# Speedup of Clos Packet Switches that Provide Delay Guarantees 

Aleksandra Smiljanić, Belgrade University, Stony Brook University, aleks@ieee.org Miloš Petrović, Belgrade University, pmilos@galeb.etf.bg.ac.yu



Fig. 1. Clos switching fabric


#### Abstract

A multihop Clos fabric may provide the terabit capacity by using the smaller switching elements (SE). When the traffic load is balanced over the switches in a middle stage, all the traffic would get through the fabric, as long as the switch outputs are not overloaded. We derive formulas of the maximum utilization for which certain tolerable delay is guaranteed, and formulas of the minimum speedup for which tolerable delay is guaranteed and $100 \%$ switch utilization is achieved in a Clos packet switch. Finally, performances of the architectures comprising cross-bars and shared buffers as SEs are compared and their scalability is discussed.


## I. Introduction

First generations of packet switches had output or shared buffers, and their capacity was limited by the buffer throughput. Packet switches with input buffers based on cross-bars provide higher capacity. However, terabit capacity cross-bars are still not available on the market. The capacity of a packet switch can be increased by connecting switching elements (SEs) into Clos structure. Figure 1 shows the connections between SEs in a symmetric Clos three-stage switch. The interconnection rule is: the $x$ th SE in some switching stage is connected to the $x$ th input of each SE in the next stage [2], [3]. The Clos switch parameters are: the number of external ports per an input or an output SE denoted by $n$, the number of input and output SEs denoted by $m$, and the number of center SEs denoted by $l$. It has been recognized that a Clos packet switch in which multicast sessions are balanced across the SEs provides non-blocking, i.e. with sufficiently large buffers it passes all the traffic if the outputs are not overloaded [10]. Alternatively, end-to-end sessions can be groomed into the smaller number of flows that are separately balanced [8], [9].

A cell of a flow is transmitted through the SE the number of which equals the counter value of this flow, and the counter is then incremented modulo $l$.

Recently, load balancing has been proposed in other architectures such as parallel plane switches (PPS) or Birkhoff-von-Neumann switches [1], [4]. Both of these architectures are a special case of Clos packet switches where the number of external ports per input and output SE is $n=1$. In other words, input and output SEs are ports.

In this paper, we derive the switch utilization under which the tolerable delay can be guaranteed to the most sensitive applications in Clos packet switches with arbitrary parameters.

Often, the switch transceivers that transmit data to the neighboring switches, or receive data from these switches require the most complex and costly hardware. For this reason, switching fabrics have speedups so that the transmission capacity is maximally utilized. Speedup is defined as:

$$
\begin{equation*}
S=\frac{l m R_{c}}{n m R} \geq 1 \tag{1}
\end{equation*}
$$

where $R$ is the external link rate, and $R_{c}$ is the internal link rate. The internal links connect the SEs, and the external links connect switch with other switches and routers. We will calculate the speedup required to ensure $100 \%$ utilization of the transmission capacity. Performance of load balancing algorithms in Clos packet switches based on cross-bars and shared buffers will be analyzed and compared.

## II. Performance Analysis of Load Balancing Algorithms

Traffic of each individual flow is balanced independently across the SEs. If there are many flows that transmit cells across some SE at the same time, the cells will experience long delay. Many applications, e.g. voice and video, require stringent delay guarantees. This traffic must be policed either at the edge of the network, or at the switch ports. Policing interval equals $F$ cell time slots. For example, input 1 negotiated to send 10 Mbps to output 3 , the policing interval is $F=10^{4}$ cell time slots, and port bit rate 10 Gbps . Then, input 1 will send at most one high-priority cell per frame to output 3. We also assume that SEs are non-blocking and provide rate and delay guarantees. So, they transfer all policed traffic within one frame period. These features hold when the shared buffers are used as SEs. They also hold for the crossbar SEs with the speedup of two that are run by the maximal matching algorithms [5], [7]. We assume that there is a coarse synchronization in a switch, i.e. that at some point of time
the controllers schedule cells belonging to the same frame. In addition, the SEs in each stage schedule packets that have arrived in the previous frame. The total delay that a cell may experience through a three-stage Clos packet switch including the resequencing time is four times the frame duration:

$$
\begin{equation*}
D=4 F T_{c}, \tag{2}
\end{equation*}
$$

where $T_{c}$ is the cell time slot duration. In our derivations, the number of slots per frame that can be allocated to some port is $F_{u}$. Also, all lemmas and theorems hold in large switches where $l>10$.

Lemma 1: Let $F_{c}$ denote the maximum number of cells per frame sent through some internal switch link (either from input SE to center SE or from center SE to output SE). It holds that

$$
\begin{equation*}
\frac{n F_{u}}{l}+N_{f}-n \leq F_{c}<\frac{n F_{u}}{l}+N_{f} \tag{3}
\end{equation*}
$$

where $N_{f}$ denotes the number of flows passing through that internal link either sourced by some input SE or bound for some output SE.

Proof: Let $f_{i g}, 0 \leq g<N_{f}$, denote the number of time slots per frame that are guaranteed to the individual flows sourced by $\mathrm{SE}_{1 i}$, where $\mathrm{SE}_{1 i}$ denotes the $i$ th SE in the first stage. The maximum number of cells per frame sourced by $\mathrm{SE}_{1 i}$ fulfills:

$$
\begin{equation*}
F_{c} \leq \sum_{g}\left\lceil\frac{f_{i g}}{l}\right\rceil<\frac{n F_{u}}{l}+N_{f} \tag{4}
\end{equation*}
$$

where $\lceil x\rceil$ is the smallest integer no less than $x$, i.e. $\lceil x\rceil<$ $x+1$. Assume that out of $N_{f}$ flows sourced by $\mathrm{SE}_{1 i}, N_{f}-n$ flows are assigned one time slot per frame, and the remaining $n$ flows are assigned $n F_{u}-\left(N_{f}-n\right)$ time slots per frame. If it happens that first cells in a frame of all the flows are sent through $\mathrm{SE}_{2 j}$ ( $j$ th SE in the second stage), the total number of cells per frame transmitted through $\mathrm{SE}_{2 j}$ from $\mathrm{SE}_{1 i}$ will be:

$$
\begin{align*}
F_{c} & =N_{f}-n+n\left\lceil\frac{F_{u}}{l}-\frac{N_{f}}{n l}\right\rceil \\
& =\frac{n F_{u}}{l}+\frac{(l-1) N_{f}-\left(n F_{u}-N_{f}\right) \bmod (n l)}{l} \Rightarrow \\
F_{c} & \geq \frac{n F_{u}}{l}+N_{f}-n \tag{5}
\end{align*}
$$

for $l, F_{u}>10$. Claim of the lemma follows from inequalities $(4,5)$. The proof is identical in the case of a link bound for some output SE.

When different flows bound for the same SE are not properly synchronized, they might send cells within a given frame starting from the same center SE. Alternatively, equal numbers of flows are balanced starting from different center SEs in each frame. For example, flow $g$ of $\mathrm{SE}_{1 i}$ resets its counter at the beginning of a frame to $c_{i g}=(i+g) \bmod l$. Or, flow $g$ bound to $\mathrm{SE}_{3 k}$ ( $k$ th SE in the third stage) resets its counter at the beginning of a frame to $c_{k g}=(k+g) \bmod l$.

Lemma 2: In load balancing algorithms with the synchronized counters, and $N_{f}>10 l$ or $N_{f} \bmod l=0$ :

$$
F_{c}=\left\{\begin{array}{cc}
\frac{n F_{u}}{l}+\frac{N_{f}}{2} & F_{u} \geq \frac{l N_{f}}{2 n}  \tag{6}\\
\sqrt{\frac{2 n F_{u} N_{f}}{l}} & \frac{10 N_{f}}{8 n l} \leq F_{u}<\frac{l N_{f}}{2 n} .
\end{array}\right.
$$

Proof: We follow the calculation from [8] but assume $n \neq l$. We will calculate the maximum number of cells that are transmitted from $\mathrm{SE}_{1 i}$ through $\mathrm{SE}_{2(n-1)}$ in the middle stage, and the same result would hold for any other center SE. The number of cells in flow $g$ transmitted from $\mathrm{SE}_{1 i}$ through $\mathrm{SE}_{2(n-1)}$ is $\left\lfloor\left(f_{i g}+(i+g) \bmod l\right) / l\right\rfloor$, where $\lfloor x\rfloor$ is the smallest integer not greater than $x$ i.e. $\lfloor x\rfloor \leq x$. So, the number of cells transmitted from $S E_{1 i}$ through $\mathrm{SE}_{2(n-1)}$ is:

$$
\begin{equation*}
F_{c}=\sum_{0 \leq g<N_{f}}\left\lfloor\frac{f_{i g}+(i+g) \bmod l}{l}\right\rfloor \leq \frac{n F_{u}}{l}+\frac{N_{f}}{2} \tag{7}
\end{equation*}
$$

for $l>10$ and $N_{f}>10 l$, or $l>10$ and $N_{f} \bmod l=0$. Equality in (7) can be reached iff:

$$
\begin{equation*}
F_{u} \geq \frac{N_{f}}{n} \cdot \frac{l+1}{2} \approx \frac{l N_{f}}{2 n} \tag{8}
\end{equation*}
$$

for $l>10$ and $N_{f}>10 l$, or $l>10$ and $N_{f} \bmod l=0$. If inequality (8) does not hold:

$$
\begin{align*}
\frac{N_{f}}{l} \cdot \frac{z(z+1)}{2} & \leq n F_{u}<\frac{N_{f}}{l} \cdot \frac{(z+1) \cdot(z+2)}{2} \Leftrightarrow \\
z & =\left\lfloor\frac{-1+\sqrt{1+\frac{8 n l F_{u}}{N_{f}}}}{2}\right\rfloor \tag{9}
\end{align*}
$$

where $0 \leq z<l$ is an integer. For $F_{u} \geq 10 N_{f} /(8 n l)$ :

$$
\begin{equation*}
z \approx \sqrt{\frac{2 n l F_{u}}{N_{f}}} . \tag{10}
\end{equation*}
$$

It is easy to understand that $F_{c}$ will be maximal for:

$$
f_{i g}=\left\{\begin{array}{cc}
l-q & l-z \leq q=(i+g) \bmod l<l  \tag{11}\\
0 & 0 \leq(i+g) \bmod l<l-z
\end{array}\right.
$$

If $10 N_{f} /(8 n l) \leq F_{u}<l N_{f} /(2 n)$, from $(7,10,11)$ :

$$
\begin{equation*}
F_{c}=\frac{N_{f} z}{l} \approx \sqrt{\frac{2 n F_{u} N_{f}}{l}} \tag{12}
\end{equation*}
$$

Claim of the lemma follows from equations $(7,8,12)$. It can be proven identically that the claim of the lemma holds when $F_{c}$ is the maximum number of cells that are transmitted to $\mathrm{SE}_{3 k}$ through $\mathrm{SE}_{2(n-1)}$.

## A. Switch Utilization

Theorem 1: Maximum utilization of the switch internal link is:
$\max \left(0, S-\frac{4 l N_{f} T_{c}}{n D}\right) \leq U_{a} \leq \min \left(1, S-\frac{4 l N_{f} T_{c}}{n D}+\frac{4 l T_{c}}{D}\right)$,
where D is the maximum tolerable delay.
Proof: Note that $n S F / l$ is the number of cells that may pass the link from an input to a center SE within one frame. If it holds that

$$
\begin{align*}
U_{a c}=\frac{F_{u}}{F} & =S-\frac{l N_{f}}{n F} \Leftrightarrow \\
\frac{n F_{u}}{l}+N_{f} & =\frac{n S F}{l} . \tag{14}
\end{align*}
$$

from Lemma 1 it follows that

$$
\begin{equation*}
F_{c}<\frac{n F_{u}}{l}+N_{f}=n S F / l \tag{15}
\end{equation*}
$$

and all cells will pass the link within a frame for above $U_{a c}$. So, the maximum utilization under which all cells pass the switch is $U_{a} \geq U_{a c}$. From Lemma 1, $F_{c} \geq n F_{u} / l+N_{f}-n$, so it must hold

$$
\begin{align*}
\frac{n F_{u}}{l}+N_{f}-n & \leq F_{c} \leq \frac{n S F}{l} \Rightarrow \\
U_{a}=\frac{F_{u}}{F} & \leq S-\frac{l N_{f}}{n F}+\frac{l}{F} \tag{16}
\end{align*}
$$

Theorem 1 follows from equations $(15,16)$ when $F$ is replaced with $D /\left(4 T_{c}\right)$.

Theorem 2: Maximum utilization of the switch internal link when the counters are synchronized is:

$$
U_{r}=\left\{\begin{array}{cl}
S-\frac{2 l N_{f} T_{c}}{n D} & D \geq \frac{4 l N_{f} T_{c}}{n S}  \tag{17}\\
\frac{n S^{2} D}{8 l N_{f} T_{c}} & D<\frac{4 l N_{f} T_{c}}{n S}
\end{array}\right.
$$

where D is the maximum tolerable delay, and $N_{f}>10 l$ or $N_{f} \bmod l=0$.

Proof: Since $F_{c} \leq n S F / l$, from Lemma 2 it follows that for $F_{u} \geq l N_{f} /(2 n)$,

$$
\begin{align*}
F_{c} & =\frac{n F_{u}}{l}+\frac{N_{f}}{2} \leq \frac{n S F}{l} \Rightarrow \\
U_{r} & =\frac{F_{u}}{F} \leq S-\frac{l N_{f}}{2 n F}, \quad F \geq \frac{l N_{f}}{n S} \tag{18}
\end{align*}
$$

and for $10 N_{f} /(8 n l) \leq F_{u}<l N_{f} /(2 n)$ :

$$
\begin{gather*}
F_{c}=\sqrt{\frac{2 n F_{u} N_{f}}{l}} \leq \frac{n S F}{l} \Rightarrow \\
U_{r}=\frac{F_{u}}{F} \leq \min \left(\frac{l N_{f}}{2 n F}, \frac{n S^{2} F}{2 l N_{f}}\right) . \tag{19}
\end{gather*}
$$

Since $\left(10 N_{f}\right) /(8 n l F) \ll 1$ because $N_{f} \leq F$ and $l>10$, the range $F_{u}<\left(10 N_{f}\right) /(8 n l)$ is not of a practical interest and was omitted in the final formula.

Formula (17) follows from equations $(18,19)$, when $F$ is replaced with $D /\left(4 T_{c}\right)$.

Note that Theorem 2 provides the maximum utilization when both balancing of flows sourced by an input SE, and balancing of flows bound for an output SE are synchronized. This assumption will hold in all considered algorithms.

## B. Switch Speedup

Often, signal transmission over the fibers connecting distant routers requires most complex and costly hardware. Therefore, it is important to provide the highest utilization of the fiber transmission capacity. For this reason, switching fabrics with the speedup have been previously proposed. We have defined the speedup in the introduction (1).

Theorem 3: The minimum speedup $S$ required to pass all incoming packets with a tolerable delay when the counters are not synchronized is:

$$
\begin{equation*}
1+\frac{4 l\left(N_{f}-n\right) T_{c}}{n D} \leq S_{a}<1+\frac{4 l N_{f} T_{c}}{n D} \tag{20}
\end{equation*}
$$

and the speedup when counters are synchronized, and $N_{f}>$ $10 l$ or $N_{f} \bmod l=0$ equals:

$$
S_{r} \geq\left\{\begin{align*}
1+\frac{2 l N_{f} T_{c}}{n D} & D \geq \frac{2 l N_{f} T_{c}}{n}  \tag{21}\\
\sqrt{\frac{8 l N_{f} T_{c}}{n D}} & D<\frac{2 l N_{f} T_{c}}{n} .
\end{align*}\right.
$$

where D is the maximum tolerable delay.
Proof: We are looking for minimum speedup such that for $F_{u}=F$ it holds $F_{c} \leq n S F / l$, where $F_{c}$ is the maximum number of cells per frame passing through some internal link. When the counters are not synchronized from Lemma 1 it follows that:

$$
\begin{equation*}
\frac{n S_{a} F}{l} \geq F_{c}>\frac{n F}{l}+N_{f}-n \tag{22}
\end{equation*}
$$

Also, from Lemma 1 it follows that:

$$
\begin{equation*}
\frac{n S_{a} F}{l} \geq \frac{n F}{l}+N_{f}, F_{u}=F \Rightarrow F_{c} \leq \frac{n S_{a} F}{l} \tag{23}
\end{equation*}
$$

The above statement is equivalent to:

$$
\begin{equation*}
\frac{n S_{a} F}{l} \geq \frac{n F}{l}+N_{f} \Rightarrow\left(F_{u}=F \Rightarrow F_{c} \leq \frac{n S_{a} F}{l}\right) \tag{24}
\end{equation*}
$$

meaning that $S_{a}$ in $(24)$ is sufficient. From $(22,23)$ the minimal required speedup fulfills:

$$
\begin{equation*}
1+\frac{l\left(N_{f}-n\right)}{n F} \leq S_{a} \leq 1+\frac{l N_{f}}{n F} . \tag{25}
\end{equation*}
$$

When the counters are synchronized, from Lemma 2 it follows that:

$$
\frac{n S_{r} F}{l} \geq F_{c}=\left\{\begin{array}{cc}
\frac{n F}{l}+\frac{N_{f}}{2} & F \geq \frac{l N_{f}}{2 n}  \tag{26}\\
\sqrt{\frac{2 n F N_{f}}{l}} & \frac{10 N_{f}}{8 n l} \leq F<\frac{l N_{f}}{2 n}
\end{array}\right.
$$

Formulas $(20,21)$ follow from inequalities $(25,26)$ when $F$ is replaced with $D /\left(4 T_{c}\right)$, since $F \geq N_{f}>10 N_{f} /(8 n l)$ because $l \geq 2$.

## III. Performance of Load Balancing Algorithms

It can be observed from our previous analysis that the performance of a load balancing algorithm depends on the number of flows that are separately balanced, and the tolerable delay. One way end-to-end packet delay that can be tolerated by interactive applications (e.g. conversational voice, videoconferencing, etc.) is around 150 ms , but only $50-60 \mathrm{~ms}$ of this allowed delay can be budgeted for the queuing. The switch delay below 3 ms may be required for various reasons. For example, packets might pass multiple packet switches from their sources to the destinations, and delays through these switches would add. Also, in order to provide flexible multicasting, the ports should forward packets multiple times through the packet switch, and the delay is prolonged accordingly [6], [10].

The first two load balancing schemes assume that the Clos packet switch comprises identical $n \times n$ cross-bars as SEs, i.e. $n=m=l=\sqrt{N}$. In the first algorithm a flow comprises cells from some input to some output and $N_{f}=n N$, while in the second algorithm a flow comprises cells from some input to some output SE and $N_{f}=N$. The second two load balancing schemes assume that shared buffers with equally limited throughput are used as SEs, i.e. $m=l$. In the third


Fig. 2. Switch utilization: solid curves represent the second algorithm with $N_{f}=N$, dashed curves correspond to the fourth algorithm with $N_{f}=$ $N / n, n=4$ and dash-dotted curves represent the fourth algorithm with $N_{f}=N / n, n=16$.
algorithm a flow comprises cells from some input SE to some output and $N_{f}=N$, while in the fourth algorithm a flow comprises cells from some input SE to some output SE and $N_{f}=m=N / n$. Note that the bounds in Theorems 1 and 3 become tight in the cases mentioned above. In Figures $2-5$ we will plot the performance of these four load balancing algorithms according to the formulas derived in Theorems $1-$ 3.

We have assumed that the duration of cell time slot is $T_{c}=$ 50 ns in further analysis ( 64 bytes at $10 \mathrm{~Gb} / \mathrm{s}$ ). The second and the fourth algorithm have better performance since fewer flows are being balanced in the corresponding architectures. For that reason, Figure 2 examines utilization for these algorithms as the switch size is increasing. Figure 3 shows the fabric speedup required for the $100 \%$ utilization when the same algorithms are applied. Solid curves represent the second algorithm with $N_{f}=N$, dashed curves represent the fourth algorithm with $N_{f}=N / n, n=4$ and dash-dotted curves represent the fourth algorithm with $N_{f}=N / n, n=16$. The curves for various tolerable delays of 1,3 and 5 ms are plotted. The utilization drops unacceptably when the fourth algorithm with $n=4$ is applied and the switch size exceeds 1000 ports. The utilization is somewhat improved when the counters are synchronized,


Fig. 3. Fabric speedup: solid curves represent the second algorithm with $N_{f}=N$, dashed curves correspond to the fourth algorithm with $N_{f}=$ $N / n, n=4$ and dash-dotted curves represent the fourth algorithm with $N_{f}=N / n, n=16$.
but it is still low for the larger switch sizes. The required speedup in this case increases rapidly and equals four for the switch size around 2000 ports, and the tolerable delay of 3 ms . Speedup value is reduced when the counters are synchronized, the synchronization having more effect on large switch sizes. On the other hand, utilization of the second and the fourth algorithm with $n=16$ remains as high as $70 \%$ providing a tolerable delay of 3 ms even in switches with more than 5000 ports. Or, utilization is $85 \%$ in these cases when the counters are synchronized. Also, in these cases $100 \%$ utilization is provided for the fabric speedup around two even in switches with 10000 ports, guaranteeing a tolerable delay of 3 ms .

In Figures 4 and 5, the performance of all the load balancing algorithms is more closely examined for a tolerable delay of 3 ms . Solid curve represents the first algorithm, dashed curve corresponds to the second algorithm, while dash-dotted curves represent the third algorithm and dotted curves represent the fourth algorithm, with $n \in\{1,4,8,16\}$. Let $n_{3}$ denote the number of ports per SE for the third algorithm, and $n_{4}$ for the fourth algorithm. It is easy to see from Theorems 1,2 and 3 that for $n_{3}^{3}=n_{4}^{2}$ the performances of the third and the fourth algorithm will be identical and the curves in Figures 4 and 5 that correspond to such cases overlap $\left(n_{3}=n_{4}=1\right.$


Fig. 4. Switch utilization: solid curve represents the first algorithm $N_{f}=$ $n N$, dashed curve corresponds to the second algorithm $N_{f}=N$, dash-dotted curves represent the third algorithm $N_{f}=N$, and dotted curves correspond to the fourth algorithm $N_{f}=N / n$.
or $n_{3}=4, n_{4}=8$ ). We observe that the performance of the first and the third algorithm in which $n \leq 4$ is unacceptable. For the utilization of $70 \%$ the switch size is limited to 1500 ports in the case of the third algorithm with $n=16$, and to 2000 ports in the case of the fourth algorithm with $n=8$. All the algorithms require the speedups larger than two except the second algorithm and the fourth algorithm in which the counters are synchronized and $n=16$. We can see that SEs with shared buffers should have at least 16 ports in highly scalable architecture.

## IV. Conclusion

Clos packet switch with moderate speedup can meet stringent delay requirements in arbitrarily large switches, and provide $100 \%$ utilization of the switch capacity. Architecture deploying shared buffers as switching elements is efficient if these elements can support more than 16 ports. Otherwise, the preferred architecture deploys cross-bars as SEs that are more scalable, and load balancing algorithm in which the flows comprising cells from input to output SEs are separately balanced.

(a) Asynchronous counters

(b) Synchronized counters

Fig. 5. Fabric speedup: solid curve represents the first algorithm $N_{f}=n N$, dashed curve corresponds to the second algorithm $N_{f}=N$, dash-dotted curves represent the third algorithm $N_{f}=N$, and dotted curves correspond to the fourth algorithm $N_{f}=N / n$.

## References

[1] C. S. Chang, D. S. Lee and C. Y. Yue, "Providing guaranteed rate service in the load balanced Birkhoff-von Neumann switches," Proceedings of INFOCOM 2003.
[2] C. Clos, "A study of non-blocking switching networks," Bell Systems Technology Journal, vol. 32, 1953, pp. 406-424.
[3] J. Hui, Switching and Traffic Theory for Integrated Broadband Networks, Kluwer Academic Press 1990.
[4] I. Keslassy, S. T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard, N. McKeown, "Scaling Internet routers using optics," Proceedings of ACM SIGCOMM 2003.
[5] A. Smiljanić, "Flexible bandwidth allocation in high-capacity packet switches," IEEE/ACM Transactions on Networking, April 2002, pp. 287293.
[6] A. Smiljanić, "Scheduling of multicast traffic in high-capacity packet switches," IEEE Communication Magazine, November 2002, pp. 72-77.
[7] A. Smiljanić, "Bandwidth Reservations by Maximal Matching Algorithms," IEEE Communication Letters, March 2004, pp. 177-179.
[8] A. Smiljanić, "Performance of load balancing algorithms in Clos packet switches," Proceedings of IEEE Workshop on High Performance Switching and Routing, April 2004, pp. 304-308.
[9] A. Smiljanić, "High performance Routers," invited paper at joint Optoelectronic and Communication Conference and International Conference on Optical Internet, Yokohama, Japan, July 2004.
[10] J. S. Turner, "An optimal nonblocking multicast virtual circuit switch," Proceedings of INFOCOM 1994, vol. 1, pp. 298-305.

