# Hierarchical Clock Routing Scheme for Multi-Chip Modules Based on Area Pad Interconnection

Qing Zhu

Wayne W.M. Dai

UCSC-CRL-93-45 Oct. 10, 1993

Board of Studies in Computer Engineering University of California, Santa Cruz Santa Cruz, CA 95064

#### ABSTRACT

The flip-chip technology for multi-chip modules provides *area pads* through solder bumps which are distributed over the entire chip surface. Clock skew has been identified as one of major limiting factors for high speed VLSI systems. We propose a two level clock routing scheme for a MCM-packaged VLSI system by making use of area pads and high connectivity of MCM substrate. The die is partitioned small isochronal regions with a area pad assigned for each bin. We implement the clock network of the MCM system in two levels. A global clock network with longer wires is routed on the MCM substrate which connects the clock source of the module to all clock area pads of dice. Inside every isochronous bin of a die, a delay-bounded clock tree is constructed from clock pad to clocked elements. Experimental results show this two level clock routing scheme dramatically reduces the clock wires and achieves high clocking performance of the whole MCM system.

**Keywords:** clock routing, area pad, area pad interconnection, multi-chip module, isochronous region, delay-bounded Steiner tree, planar routing

## CONTENTS

### Contents

| 1    | Introduction                                           | 3  |
|------|--------------------------------------------------------|----|
| 2    | Two-Level Clock Distribution Scheme by Using Area Pads | 4  |
| 3    | Global Clock Routing on Substrate                      | 7  |
| 4    | Local Clock Routing in Die                             | 9  |
| 5    | Experiment Results                                     | 11 |
| 6    | Conclusions                                            | 13 |
| 7    | Acknowledgement                                        | 15 |
| Refe | erences                                                | 15 |

# List of Figures

| 1.1 | (a) A multichip module with IC chips (dice) attached directly on a substrate.   |    |
|-----|---------------------------------------------------------------------------------|----|
|     | The clock source of the module is a pin at the packaging. (b) A typical thin    |    |
|     | film MCM with flip chip assembly.                                               | 3  |
| 2.1 | (a) A multi-chip module with four dice. The clock terminals (in darken          |    |
|     | dots) are shown to be distributed across these dice. Note that one of dice is   |    |
|     | asynchronous chip and it has no clock terminal. (b) dice are partitioned into   |    |
|     | isochronous bins according to the distribution of clock terminals. Some area    |    |
|     | without clock terminals are not partitioned. A clock area pad is set at each    |    |
|     | isochronous bin to connect the local clock net inside the die and the global    |    |
|     | net on the substrate                                                            | 5  |
| 2.2 | Two level clock tree buffered at area pads                                      | 6  |
| 3.1 | Approximations of the clock signal sent out of the clock source. (a) ramp       |    |
|     | voltage; (b) step voltage                                                       | 8  |
| 4.1 | Hanan grid of source <i>o</i> and 10 sinks                                      | 10 |
| 4.2 | model of a RC tree in chip. (a) an RC line. (b) load capacitance of a terminal. |    |
|     | (c) a source                                                                    | 10 |
| 5.1 | (a) testMCM: a multi-chip module with 6 flip dice in PGA packaging. Die $A$ ,   |    |
|     | B, C and $D$ have clock terminals and Die $E$ nad $F$ are analog dice without   |    |
|     | clock terminals. The clock pin (source) is set at the center of the module.     |    |
|     | Clock area pads are assigned on Die $A, B, C$ and $D$ . The dash lines shows    |    |
|     | the planar clock tree topology with equal path length which connects from       |    |
|     | the clock source to area pads. (b) Primaryl: isochronous bin partition and      | 10 |
| •   | global clock tree.                                                              | 13 |
| 5.2 | (a) testMcm: simulation voltage waveforms at inputs of area pads after          |    |
|     | sizing the global clock tree on substrate. (b) Primary1: simulation voltage     | 14 |
| F 9 | waveforms at area pads                                                          | 14 |
| 5.3 | Primary1: Kesuit of the clock routing when set one clock pad at the side of     | 14 |
|     |                                                                                 | 14 |

# List of Tables

| 5.1 | Electrical parameters. $R_b, C_b$ and $L_b$ are resistance, capacitance and induc-               |    |
|-----|--------------------------------------------------------------------------------------------------|----|
|     | tance of unit length wire. $R_d$ is the driver output resistance, and $C_t$ is the               |    |
|     | loading capacitance of a terminal. The clock driver in chip (die) is sized with                  |    |
|     | two output resistances                                                                           | 11 |
| 5.2 | Comparison of Steiner tree results on random distribution of 8, 16, 32 and 64                    |    |
|     | sinks. The driver output resistance $R_d = 100\Omega$ . $t_d, t_b$ and $t_c$ are the largest     |    |
|     | path delays of $T_d$ , $T_b$ and $T_c$ respectively . $l_c$ , $l_b$ and $l_d$ are the total wire |    |
|     | lengths of $T_c$ , $T_b$ and $T_d$ respectively . Sinks are randomly distributed in a            |    |
|     | chip size of 20 x 20 mm                                                                          | 12 |
| 5.3 | Comparison of Steiner tree results when the driver output resistance $R_d$ =                     |    |
|     | $50(\Omega)$                                                                                     | 12 |
| 5.4 | Result of two benchmarks. $f_{max}$ is the normal maximum working frequency,                     |    |
|     | and $S_t$ is the system tolerable skew. $S_t = 0.05/f_{max}$ . Pad number is the                 |    |
|     | total clock area pads needed for each example.                                                   | 13 |

#### 1. Introduction

#### 1 Introduction

While IC density is doubling roughly every 18 months, conventional single chip packages and printed circuit board technology have become the limiting factor for high performance system. To close the gap, multi-chip modules (MCM) have emerged. An MCM has several bare chips or dice mounted and interconnected on a multilayer substrate, which functions as a single IC (see Figure 1.1(a)). The substrate of the MCM provides high routing density between 200 and 1000  $cm/cm^2$ . The most promising assembly technique for MCM is *flipchip* attachment. In flip chip, mounting dice are attached with pads facing down and attached via solder bumps which form the mechanical and electrical connections, as shown in Figure 1.1(b). The flip-chip technology provides *area pads* through solder bumps which are distributed over the entire chip surface rather than being confined to the periphery as in wire bonding and most TAB assembly technologies. This technology increases the maximum number of I/O pads available for a given die size, such that it may liberate the current I/O pads constrained VLSI design. This form of die attachment also offers advantages such as low lead inductance, low signal delay, good thermal connectivity, very close proximity and rework flexibility.

Figure 1.1: (a) A multichip module with IC chips (dice) attached directly on a substrate. The clock source of the module is a pin at the packaging. (b) A typical thin film MCM with flip chip assembly.

Clock skew has been identified as one of major limiting factors for high speed synchronous VLSI systems. There are two major concerns in high-performance clock routing: minimizing clock skew and routability. For optimal system performance, clock signals must reach each register at exactly (or almost exactly) the same time. Clock skew is due to the variations in the time of arrival of a clock signal at clocked elements in a digital system. The clock net is a global net which is over the whole chip. Due to the performance requirement, this net is usually pre-routed before other signal nets. When a chip becomes more dense and employs more circuits, clock net takes a large amount of routing area over the chip that results in the hard routing of other nets.

The clock skew becomes more serious when the operating frequency raises to enhance system performance. For example, Intel  $80486(1989) - f_{max} = 25 - 33$  MHZ; Motorala  $68030(1987) - f_{max} = 16.7 - 33$  MHZ; Intel  $80386(1985) - f_{max} = 12 - 16$  MHZ; SUN Sparc-10 (1992) -  $f_{max} = 100$  MHZ. Recently, DEC company has announced the Alpha chip which is operated at 150 - 200MHZ. It can be expected the clock frequency will be increased to 200 - 400 MHZ in few years to accommodate the need of high-performance microprocessors. It is normally accepted if the clock skew is less than five percent of the clock period  $T_c$ . So, to achieve the frequency higher than 200MHZ, the clock skew should be controlled under 0.25ns. This small skew should also be maintained for a digital system with several chips and multiple clock phases such as in multi-chip modules.

The "H" clock tree has been used in IC industry [3]. However, it is only applicable for symmetric arrays of logic elements. Some research work has been done involving the construction of a clock tree based on the general distribution of clock terminals. Algorithms presented in [12, 13, 21] try to construct a clock tree with equal path length from the source to terminals. An improved algorithm in [19] considers *Elmore delay* balance instead of geometric length balance. Another algorithm in [5] improves the Elmore delay matching method by considering the minimization of the total wire length. But this algorithm yields minimal wire length of a given clock tree topology based on linear delay model.

An concept of *isochronal region* was introduced in [15, 2]. In a isochronal region, circuit may be readily clocked without careful clock distribution. Anceau [2] uses the diffusion equation to estimate the approximate physical dimension D of a isochronal region. It is estimated in [2] that D is typically 7mm, 1.16mm and 0.35mm for  $6\mu m$ ,  $1\mu m$  and  $0.3\mu m$ NMOS technologies respectively. There are several problems existed in this estimation. (1) Even a moderate sized VLSI chip will be larger than these dimensions, resulting in difficulty in real application. (2) The estimation in [2] is not based on the real clock tree, but using a diffusion equation. This estimation usually over-estimates the dimension of a isochronal region, such that the actual required dimension for a isochronal region is even smaller than these estimated values.

However, if we can extend clock signal from the packaging to a flip chip on multiple sites through area pads. The concept of isochronal region can be made practical. In this paper, we propose a two level clock distribution scheme for a MCM-packaged VLSI system by making use of area pads of the flip chips and the high connectivity of MCM substrate. The die is partitioned into several small isochronal bins with a clock area pad assigned for each bin. We realize the distribution of the clock network in two levels to achieve a controlled (reliable) clock skew of the whole MCM. A global clock network with longer wires is routed on the MCM substrate which connects the clock source of the module to all clock area pads of dice. Inside every isochronal bin of a die, a delay-bounded clock tree distributes the clock signal from the clock pad to clocked elements. A new algorithm has been proposed in [22, 23] to construct a delay bounded Steiner tree. Experimental results show this clock distribution scheme dramatically reduces the total wire length of clock net and the whole MCM system clocking performance is well guaranteed.

#### 2 Two-Level Clock Distribution Scheme by Using Area Pads

We proposes a two level clock distribution scheme for a MCM-packaged VLSI system by making use of area pads of the flip chips and the high connectivity of MCM substrate. The global clock network with long clock wires is routed on the MCM substrate where the electrical parameters are uniform. Dice are partitioned into several small regions called isochronal bins. An *isochronal bin* is a rectangle region of a VLSI chip, inside which the largest clock delay  $t_{max}$  from the area pad to clock terminals is kept below a given fraction  $\epsilon$  of the clock period  $T_c$ . i.e.:  $t_{max} \leq \epsilon T_c$ .  $t_{max}$  or  $\epsilon$  is decided by the *tolerable clock skew* of the digital system. In the practice, the clock skew is tolerable without harm to logic functions if the skew can be controlled under 5% - 10% of the clock period  $T_c$ . For example, a digital system with a maximum clock frequency 100MHZ has a minimum clock

phase. The routing of these subnets can be performed independently, only the path delays are bounded. So, it is eliminated to balance the skew among multiple phases. So, this methodology for multiple phase clock distribution takes few increase of routing area while the timing is surely guaranteed.

A VLSI system probably has dice with different types or processes. Traditional clock routing methods including "H" tree have trouble achieving very small and stable skew for such a VLSI system. Our scheme routes the clock net on the substrate with the uniform process and electrical parameters. For the routing of local clock nets in different dice, the path delay of the local clock net can be always guaranteed under a prescribed skew by the partition of isochronous bins with variable sizes for different dice.

#### 3 Global Clock Routing on Substrate

The global clock routing on the substrate connects the clock source of the module to all clock pads of dice. This can be accomplished by constructing a planar equal path length clock tree using an algorithm developed in [21]. The clock tree constructed by this algorithm has the following properties.

- No crossing between different branches (planar);
- Exactly equal path lengths from the clock source to area pads;
- Minimum path lengths from the clock source to area pads;
- Arbitrary locations of clock source and area pads.

The last property provides the flexible locations of the clock source on MCM and uneven distribution of area pads on dice. This planar clock tree can be implemented on a single layer of the substrate. Since it is easier to achieve uniform electrical parameters on a single layer than when switching layers, it is easier to adjust a one-layer clock tree for minimum skew. For high-speed application of multi-chip module or packaging, the one-layer clock routing becomes necessary to obtain a high-quality clock distribution. Also, the cost increase of adding a clock layer on substrate is not as serious as in chip, especially when we use ceramic or plastic packaging. This clock layer can also share routing with other nets if more room exists.

The skew of the clock tree on the substrate can be further minimized by assigning the clock wires with variable widths. Since one layer of the substrate is used for global clock routing without any obstructions, we have larger range of variable widths can be assigned on clock wires compared to inside chip. So, clock sizing on substrate provides a very efficient measure to reduce the clock skew.

The sizing is guided by the interconnect modeling of the global clock tree on MCM substrate and the skew timing evaluation. Lumped RC circuit models [19, 5] are inaccurate for long wires of the global clock tree on the substrate at high frequency. The interconnect really exhibits transmission line effects. The reflective noise of the clock network becomes serious in high speed circuits. However, on the state-of-art MCM substrate, the resistive effect still dominates the inductive phenomena such that the interconnect on the substrate functions as a *lossy transmission line*.

Given a branch with width  $w_i$ , we get the resistance R and capacitance C of a unit length line [3] as

$$R = \gamma/w_i, \qquad C = \beta w_i \tag{3.1}$$

the fabrication technology and routing resource. By turning skew minimization problem into a least-squares estimation problem, a modified Gauss-Marquardt's method proposed in [24] is used to determine the optimal widths of clock wires. An efficient algorithm is also proposed that assigns the good initial widths for a clock tree which let the later iterative optimization process converge more quickly. We apply the sizing optimization method on the global clock tree based on lossy transmission line interconnect model as expressed on (3.1) and (3.2). The sizing method combines internally a delay macromodel to evaluate the timing of the clock network during the optimization process [24]. This delay macromodel [14] is based on scattering parameters which provide a very convenient means for describing distributed RC or lossy transmission line behaviors at high frequency. The major objective of the global clock tree sizing is to minimize clock skew, but in most cases the largest path delay from source to area pads is also reduced. The resultant widths of clock wires satisfy the lower and upper bounds and the minimum increments which are imposed by routing resources and the fabrication technology.

#### 4 Local Clock Routing in Die

Inside each isochronous bin of dice, the clock terminals together with the area pad form a local clock network. The local clock network can be routed as a *delay-bounded minimum Steiner tree*.

**Delay-Bounded Minimum Steiner Tree Problem:** Given a net consisting of a source (o), a set of sinks, and a delay bound D, construct a Steiner tree T, in which the delay from o to every sink in T is less than D and the sum of the edge lengths of the tree is minimized.

The delay bound D for a local clock tree inside a bin is decided by the deduction of a tolerable system clock skew to the skew on the substrate.

The path delay evaluation depends on the delay model used. Some previous works have been done in constructing zeroth-order delay (path length) bounded Steiner tree [11] [6]. Cong et al. [6] proposed an algorithm to construct radius-bounded Steiner trees with total wire length within a constant factors of optimal. Recently, Boese et al. [4] proposed methods to generate a class of Elmore delay routing tree constructions, which iteratively add tree edges to minimize Elmore delay from source to sinks.

We propose a method of constructing the *Elmore delay* bounded minimum Steiner tree based on the trade-off of two special trees *minimum path delay Steiner tree* and *minimum edge length Steiner tree*. For the local clock routing in a isochronous bin, the source is an area pad, and sinks are clock terminals located in the bin. Since an IC die is partitioned into a set of isochronous bins, the local clock tree can be modeled as an RC tree. Also buffers are usually inserted to drive clock input at the area pad as shown in Figure 2.2, resulting in short rise time of the clock input. Therefore, Elmore delay model is accurate enough to evaluate the delay of a local clock tree. So, we construct the local clock tree as an Elmore delay bounded Steiner tree while minimizing the total edge length.

Since the local clock trees in dice are usually constructed in Manhattan wirings, we use *Hanan grid* [10] to construct rectilinear Steiner trees. Hanan grid is derived by extending a horizontal line and a vertical line through each sink and the source, as shown in Figure 4.1. A *node* in Hanan grid is the intersection of a vertical line and a horizontal line; a *edge* intervenes two nodes. Some of the nodes are the locations of source and sinks.

- $T_b$ : delay-bounded minimum Steiner tree, i.e. the local clock tree we try to construct.
- $T_d$ : minimum path delay Steiner tree
- $T_c$ : minimum edge length Steiner tree

Since  $T_c$  minimizes the edge lengths while  $T_d$  minimizes the path delays,  $T_b$  is a tradeoff between  $T_c$  and  $T_d$  in terms of the largest path delay and total edge length, taking into account of a delay bound D. A new algorithm of constructing  $T_b$  based on the above concept is presented in [22, 23].

#### 5 Experiment Results

The delay bounded Steiner tree algorithm for local clock routing has been implemented in ANSI C. This algorithm has been tested both on random sink distribution and large benchmarks. We take the electrical parameters of chip in [4], and electrical parameters of a advanced thin-film MCM substrate in [24]. These parameters are listed in Table 5.1. We size the driver at clock area pad resulting in two output resistances  $R_d = 100\Omega$  and  $R_d = 50\Omega$  to test the effect of driver sizing on the Steiner tree construction.

|      | $R_b(m\Omega/\mu m)$ | $C_b(fF/\mu m)$ | $L_b(pH/\mu m)$ | $R_d(\Omega)$ | $C_t(pF)$ |
|------|----------------------|-----------------|-----------------|---------------|-----------|
| Chip | 30                   | 0.352           | 0               | 100, 50       | 0.0153    |
| MCM  | 8                    | 0.06            | 0.38            | 25            | 0.2       |

Table 5.1: Electrical parameters.  $R_b$ ,  $C_b$  and  $L_b$  are resistance, capacitance and inductance of unit length wire.  $R_d$  is the driver output resistance, and  $C_t$  is the loading capacitance of a terminal. The clock driver in chip (die) is sized with two output resistances.

Table 5.2 and Table 5.3 show the comparison of three kinds of Steiner trees  $T_d$ ,  $T_c$  and  $T_b$  for the examples of random distribution of 8, 16, 32 and 64 sinks. In these two tables,  $t_d$ ,  $t_b$ ,  $t_c$  are the largest path delays of  $T_d$ ,  $T_b$  and  $T_c$  respectively, while  $l_c$ ,  $l_b$  and  $l_d$  are the total edge lengths of  $T_c$ ,  $T_b$  and  $T_d$  respectively. D is the delay bound for  $T_b$ , which is selected between  $t_d$  and  $t_c$ . Table 5.2 is obtained based on  $R_d = 100\Omega$ , and Table 5.3 based on  $R_d = 50\Omega$ . The largest path delays are obviously shortened in Table 5.3 because of the smaller  $R_d$  compared to Table 5.2.

On the average, the minimum length Steiner tree  $T_c$  is 30% less total edge length than minimum path delay Steiner tree  $T_d$ , while  $T_d$  is 78% less largest path delay than  $T_c$ . If  $T_c$  and  $T_d$  are optimum,  $T_b$  should have less total edge length than  $T_d$  and more total edge length than  $T_c$ . But, since  $T_c$  is obtained based on a approximation algorithm [16], it is showed in Table 5.2 and Table 5.3 that  $l_b$  is less than  $l_c$  in most examples. For all these examples,  $t_b \leq D$  is satisfied.

We apply the hierarchical clock distribution scheme on a case of multi-chip module (testMCM). This module contains 6 flip dice in PGA packaging, as shown in Figure 5.1(a). The module size is 210 x 270 mm with a clock pin (source) at the center of PGA. Die E and F are analog dice without clock terminals. Each of die A and B has 120 clock terminals with die size 50 x 50 mm. Each of die C and D has 60 clock terminals with die size 30 x 30 mm. The MCM substrate is taken the electrical parameters shown in Table 5.1. The MCM is supposed to work up to 200MHZ requiring tolerable clock skew 0.25ns. Each of die A and B is partitioned to 4 bins, and each of die C and D partitioned to 2 bins. The

|          | Largest Path Delay $(ns)$ |                     | s)    | Total Edge Length $(mm)$ |                     |       |
|----------|---------------------------|---------------------|-------|--------------------------|---------------------|-------|
|          | $t_d$                     | $t_b$               | $t_c$ | $l_c$                    | $l_b$               | $l_d$ |
| 8 sinks  | 4.8                       | 4.8 (D = 5.5 ns)    | 7.0   | 61.8                     | 63.9 (D = 5.5 ns)   | 79.8  |
| 16 sinks | 5.05                      | 5.06 (D = 5.06 ns)  | 5.07  | 70.4                     | 58.6 (D = 5.06 ns)  | 83.7  |
| 32 sinks | 6.8                       | 6.86 (D = 6.9 ns)   | 7.0   | 91.0                     | 71.1 (D = 6.9 ns)   | 131.0 |
| 64 sinks | 8.2                       | 10.97 (D = 13.4 ns) | 23.7  | 122.5                    | 114.1 (D = 13.4 ns) | 163.2 |

Table 5.2: Comparison of Steiner tree results on random distribution of 8, 16, 32 and 64 sinks. The driver output resistance  $R_d = 100\Omega$ .  $t_d$ ,  $t_b$  and  $t_c$  are the largest path delays of  $T_d$ ,  $T_b$  and  $T_c$  respectively .  $l_c$ ,  $l_b$  and  $l_d$  are the total wire lengths of  $T_c$ ,  $T_b$  and  $T_d$  respectively . Sinks are randomly distributed in a chip size of 20 x 20 mm.

|          | Largest Path Delay $(ns)$ |                   |       | Total Edge Length $(mm)$ |                     |       |
|----------|---------------------------|-------------------|-------|--------------------------|---------------------|-------|
|          | $t_d$                     | $t_b$             | $t_c$ | $l_c$                    | $l_b$               | $l_d$ |
| 8 sinks  | 3.5                       | 3.6 (D = 4.3 ns)  | 5.9   | 61.8                     | 58.3 (D = 4.3 ns)   | 75.0  |
| 16 sinks | 3.6                       | 3.66 (D = 3.7 ns) | 3.9   | 70.4                     | 59.5 (D = 3.7 ns)   | 82.6  |
| 32 sinks | 4.0                       | 4.23 (D = 4.4 ns) | 5.2   | 91.0                     | 89.4 (D = 4.4 ns)   | 127.5 |
| 64 sinks | 5.2                       | 5.4 (D = 10.7 ns) | 21.8  | 122.5                    | 117.7 (D = 10.7 ns) | 171.6 |

Table 5.3: Comparison of Steiner tree results when the driver output resistance  $R_d = 50(\Omega)$ .

area pad assignment is shown in Figure 5.1(a). The local clock routing inside each bin is accomplished by using our delay bounded Steiner tree algorithm. These clock area pads are connected to the clock source at the center of the PGA via a substrate layer. This can be realized by using the planar clock routing algorithm in [21] which results in exact equal path lengths from the clock source to clock area pads. The planar clock tree toplogy on the MCM substrate is shown in dash lines in Figure 5.1(a). The clock skew on the substrate can be further reduced by assigning variable wire widths. We apply the sizing optimization method in [24] to size the clock wires on the substrate bounded between  $10\mu m \sim 50\mu m$ . The final skew is reduced to 0.07ns. Figure 5.2(a) shows the simulation voltage waveforms at clock area pads of the global clock tree on the substrate. The result is summarized in Table 5.4. The sum of clock skew on substrate and dice is 0.22ns, less than the tolerable skew 0.25ns.

The clock distribution scheme can also be used for single chip packaging. The tested benchmark Primary1 [12] has 256 clock terminals and the chip size is taken 60 x 60 mm. A clock pin (source) is set at the center of the packaging. The chip is supposed to work up to 100 M HZ with a tolerable skew 0.5ns. The result is shown in Table 5.4 with a total skew 0.3ns of substrate and chip. Figure 5.1(b) shows results of the partition of 8 isochronous bins with 8 area pads for Primary1. The global clock tree topology is shown in dash lines in Figure 5.1(b). The simulation voltage waveforms at area pads of the global clock tree on the substrate are shown in Figure 5.2(b).

We also compare with the traditional method for Primary1 which sets only one clock pad at the die side, and all clock terminals are connected to this pad with a planar equal path length clock tree as in [21]. The result is shown in Figure 5.3, which has the total wire length 7973mm. Compared with the result shown in Table 5.4, by using the two-level clock

(a) (b)

Figure 5.2: (a) testMcm: simulation voltage waveforms at inputs of area pads after sizing the global clock tree on substrate. (b) Primary1: simulation voltage waveforms at area pads.

Figure 5.3: Primary1: Result of the clock routing when set one clock pad at the side of the die.

#### 7. Acknowledgement

isochronous bin of dice. So, the wiring of clock net inside dice are dramatically reduced. Also, we insert buffers at each area pad results in sharp rising time of the clock signal at terminals in dice. This scheme achieves controlled clock skew bound at high frequency by using low percentage of area pads.

Because of the uniform process and electrical parameters and large structures of the lines on the substrate, we can construct very high performance clock distribution network even for the case that dice are fabricated by different processes. This scheme can also be extended to distribute clock nets with multiple phases. It is naturally eliminated to balance the skew among multiple phases.

This clock distribution scheme introduces a novel design style taking the advantage of area pads that designs die and packaging simultaneously to achieve the optimum system performance.

#### 7 Acknowledgement

This work was supported partially by Intel Corporation and partially by the National Science Foundation Presidential Young Investigator Award under Grant MIP-9009945. We want to thank Prof. Andrew Kahng and Kenneth Boese of UCLA for providing the source codes of their Steiner tree algorithms in [4].

#### References

- [1] Morteza Afghahi and Christer Svensson. Performance of synchronous and asynchronous schemes for vlsi systems. *IEEE Trans. Computers*, 41(7):858-872, July 1992.
- [2] F. Anceau. A synchronous approach for clocking vlsi systems. *IEEE Journal of Solid-State Circuits*, SC(17):51-56, 1982.
- [3] H. Bakoglu. Circuits, Interconnections, and Packaging for VLSI. Addison-Wesley Publishing Company, 1987.
- [4] K. D. Boese, A. B. Kahng, and G. Robins. High-performance routing trees with identified critical sinks. In Proc. of 30th Design Automation Conf., pages 182–187, 1993.
- [5] T.H. Chao, Y.C. Hsu, J.M.Ho, K. D. Boese, and A. B. Kahng. Zero skew clock net routing. *IEEE Transactions on Circuits and Systems*, 39(11):799-814, November 1992.
- [6] J. Cong, A.B. Kahng, G. Robins, M. Sarrafzadeh, and C.K. Wong. Provably good performance-driven global routing. *IEEE Trans. on Computer-Aided Design*, CAD-11(6):739-752, 1992.
- [7] Jason Cong, Kwok-Shing Leung, and Dian Zhou. Performance-driven interconnect design based on distributed rc delay model. In *Technical Report*, University of California, Los Angeles, pages 1-36, 1992.
- [8] D.F. Wann and M.A. Franklin. Asynchronous and clocked control structures of vlsi-based interconnection networks. *IEEE Trans. Computers*, C-32(3):284-293, March 1983.
- [9] J. P. Fishburn. Clock skew optimization. *IEEE Transactions on Computers*, 39(7):945-951, 1990.
- [10] M. Hanan. On steiner's problem with rectilinear distance. SIAM Journal of Applied Mathematics, pages 255-265, March 1966.

- [11] J. Ho, D. T. Lee, C. H. Chang, and C. K. Wong. Bounded-diameter spanning tree and related problems. In Proc. ACM Symp. on Computational Geometry, pages 276–282, 1989.
- [12] M. A. B. Jackson, A. Srinivasan, and E. S. Kuh. Clock routing for high-performance ics. In Proc. of 27th Design Automation Conf., pages 573-579, 1990.
- [13] A. Kahng, J. Cong, and G. Robins. High-performance clock routing based on recursive geometric matching. In Proc. of 28th Design Automation Conf., pages 322-327, 1991.
- [14] H. Liao, W. Dai, R. Wang, and F.Y. Chang. S-parameter based macro model of distributed-lumped networks using exponentially decayed polynomial function. In Proceedings of 30th ACM/IEEE Design Automation Conference, pages 726-731, 1993.
- [15] C. Mead and L. Conway. Introduction to VLSI System. Addison-Wesley Publishing Company, 1980.
- [16] T. Ohtsuki. Layout Design and Verification, Advances in CAD for VLSI, Vol. 4. North-Holland, 1986.
- [17] J. Rubinstein, P. Penfield, and M. A. Horowitz. Signal delay in rc tree networks. IEEE Trans. on Computer-Aided Design, CAD-2(3):202-211, 1983.
- [18] Martin Taylor and Wayne Wei-Ming Dai. Tiny mcm. In Proceedings of Multichip Module Workshop, pages 143-147, 1989.
- [19] R.-S. Tsay. Exact zero skew. In Digest of Tech. Papers of IEEE Intl. Conf. on Computer Aided Design, pages 336–339, 1991.
- [20] H.T. Yuan, Y.T. Lin, and S.Y. Chiang. Properties of interconnection on silicon, sapphire, and semi-insulating gallium arsenide substrates. *IEEE Transactions on Electron Devices*, ED-31:639-644, April 1982.
- [21] Qing Zhu and Wayne W.M. Dai. Perfect-balance planar clock routing with minimal path-length. In Digest of Tech. Papers of IEEE Intl. Conf. on Computer Aided Design, pages 473-476, 1992.
- [22] Qing Zhu and Wayne W.M. Dai. Delay-bounded steiner tree algorithm for performancedriven layout. Technical Report, UCSC-CRL-93-46, University of California, Santa Cruz., 1993.
- [23] Qing Zhu and Wayne W.M. Dai. Hierarchical clock routing for multi-chip modules based on area pad interconnection. *submited to 31th Design Automation Conf.*, 1993.
- [24] Qing Zhu, Wayne W.M. Dai, and Joe G. Xi. Optimal sizing of high speed clock networks based on distributed and lossy transmission line models. *Digest of Tech. Papers of IEEE Intl. Conf. on Computer Aided Design*, pages 628–633, Nov. 1993.