Goodbye, Leaf-and-Spine Networks?
Of course not
A friend of mine sent me links to a new paper published by AWS engineers, and an associated LinkedIn post which claims:
We got lean, resilient, massive aggregation fabrics that provide 33% better throughput with 69% fewer routers, savings 27% of costs, cutting power usage by 40%, and reducing CO2 emissions.
The obvious question one should ask after reading the hyperventilated Radical Network Redesign blog post is thus: is this the end of leaf-and-spine networks? Of course not. Let’s go into the details.
What exactly did they do? They rediscovered the way Plexxi tried to build data center fabrics. Instead of spine switches, Plexxi tried to connect leaf switches directly, first with CWDM (they were dreaming about dynamic leaf-to-leaf bandwidth), later with a prewired middlebox (what AWS engineers call ShuffleBox).
Obviously, you’d waste a lot of bandwidth that way, as there are always some leaf switches that do not exchange traffic even though they have a direct link. Plexxi solved that with unequal-cost multipathing (the traffic also uses longer paths, not just direct links); the AWS blog post calls that Routing through Randomness.
As anyone who has tried to understand LFA knows, unequal-cost multipathing only gets you so far. If you want further increases in link utilization, you need “proper” traffic engineering, which requires virtual circuits (and thus an extra layer of encapsulation). Whether you use MAC frames1, MPLS, SRv6, or pigeons for that extra layer does not matter.
How could a prewired ShuffleBox be random? Yeah, that was the first major trigger of my bullshit meter. First, I thought they were using optical switches (which might turn out to be as expensive as traditional spine switches due to lower production volumes), but after reading the article, I got the impression they split the switch uplinks into individual lanes (for example, there are four 100GE lanes in a 400GE uplink port), and prewired the lane-to-lane matrix in the ShuffleBox, which makes it as random as the XKCD random number generator. It’s worth noting that Plexxi did exactly the same thing to get rid of CWDM costs, and that lane splitting is an ancient method we used more than a decade ago to make our lives miserable build larger leaf-and-spine fabrics (some details).
They claim they used optimization methods to find the best partial mesh between N switches having D uplinks. The result is probably optimal (under some constraints) and might look random to a casual observer, but there’s nothing random in it. The arXiv paper correctly calls it a Quasi-Random Graph; that nuance is lost, for obvious reasons2, in the blog posts and similar promotional material.
Could they get better throughput than leaf-and-spine fabrics? In an apple-to-apple comparison, of course not. I explained that ages ago, but of course nobody reads old stuff, so let’s do another simple thought experiment:
- You build a leaf-and-spine fabric with N:1 oversubscription on leaf switches – the total bandwidth of edge ports is N times higher than the total bandwidth of uplinks. N is usually set to three.
- The spine (or superspine fabric) of your fabric has no oversubscription. The only congested resources are the leaf switch uplinks.
- The traffic from any endpoint to any other endpoint in the leaf-and-spine fabric thus has to traverse exactly two leaf switch uplinks plus a non-oversubscribed fabric.
- The traffic in the Plexxi or AWS solution might have to traverse more than two leaf switch uplinks (when they use other leaves as relay nodes).
In an environment with many small flows (to make load balancing work well), it’s thus IMPOSSIBLE to get better total throughput in a partial mesh than in a leaf-and-spine fabric with no core oversubscription, and it DOES NOT MATTER what the traffic profile is as long as the leaf switch uplinks are the congestion points. The details are left as an exercise for the curious reader.
But they claim they got better throughput in the arXiv paper! Yeah, I tried to figure that out, but the paper is a bit vague on the details. It looks like they used a simulation to generate the throughput graphs, but the source code is not available, so we can’t know exactly what they did3. Also, they compare their solution to fat trees without defining the parameters of the fat trees they’re using.
I could think of several relatively simple explanations for their results:
- The spine layer (or the core fabric) in their fabric is oversubscribed4.
- The load balancing in their leaf-and-spine fabric is suboptimal (some uplinks are congested while the others are idle). There are multiple ways to solve this challenge before moving to packet spraying; Cisco ACI supposedly uses one of them, and I wrote several blog posts on the topic in case you’re interested in the details.
- They use load balancing across virtual paths in their solution, which results in a higher number of alternate paths and thus better load balancing performance.
- They use a routing algorithm that takes link load into account, resulting in more equally congested links.
- They use packet spraying (sending packets of the same session across multiple paths) in their solution, but not in the baseline leaf-and-spine fabric.
I would love to believe there’s some magic solution out there that works better than an optimally implemented leaf-and-spine fabric, but I don’t think the laws of physics agree with that sentiment. However, according to Clarke’s First Law, I could also be missing something obvious; in that case, please leave a comment.
Does it matter? It’s an interesting approach, and most probably more than good enough for most use cases. After all, I always told people to connect four leaf switches into a full mesh instead of wasting time on a spine layer. I don’t believe it gives you more throughput, but I totally agree it uses less power (ShuffleBoxes are probably passive elements).
Should we expect similar solutions in enterprise-sized data centers? Probably not. There might be a reason Plexxi got nowhere5. Also, as long as Fortune 50 companies need less than a dozen switches to build two data centers (based on a true story), optimizing the fabric design might not be the best investment of everyone’s time.
On the other hand, if you build fabrics with tens of thousands of switches, you should definitely take a closer look. If you do, I’d love to hear your comments.
-
Using destination MAC address as virtual circuit ID. Sounds crazy, but I’ve seen crazier things. ↩︎
-
A research paper published by a hyperscaler is often a thinly-veiled recruitment drive. See also OpenFlow @ Google and Google BeyondCorp. ↩︎
-
Did we learn nothing from the reproducibility crisis? ↩︎
-
I hope that’s not the case and that we’re past the access/aggregation/core data center networks with oversubscription at every layer. ↩︎
-
If you believe in the unlimited magic of novel approaches, please feel free to blame the HP acquisition. ↩︎
Thanks for this blog post, Ivan!
Regarding the question of how they could get higher throughput than the optimal adapted Clos design without oversubscription: they did not.
Sections 9.3 and 9.4 of the paper (including figures 13 and 14) state a 3:1 "worst case" oversubscription of the network connecting ToR routers for the higher throughput results. Not even all (simulated) traffic patterns showed higher throughput, some showed lower throughput instead. Section 9.4 does mention that the throughput of a "fat tree" without oversubscription is never worse than that of their quasi-random network ("non-blocking fat trees do not strand capacity").
The 3:1 oversubscription between ToR routers from figure 13 fits with the description of "fat tree" networks given in section 2 (including figure 1) of the paper.
According to the paper, they did find a way to handle the cabling chaos of quasi-random connections required to build a data network that approximates a random network. They also found a method to make use of many of the possible paths in such a network with merchant silicon based hardware, and called it "spraypoint routing" (neither encapsulation nor virtual circuits needed, just two VRFs). [The "spraying" is described as hash-based flow spraying, not packet spraying.]
I concur that this result does not really matter for enterprise-sized data centers. I'd expect that operational problems during adds and removes of routers might negate the gains from having less routers.
We use wireless LAN to interconnect the switches in our data center network. So no cabeling challenge.