Cisco IT designed AI-ready infrastructure with Cisco compute, best-in-class NVIDIA GPUs, and Cisco networking that helps AI mannequin coaching and inferencing throughout dozens of use instances for Cisco product and engineering groups.
It’s no secret that the stress to implement AI throughout the enterprise presents challenges for IT groups. It challenges us to deploy new know-how quicker than ever earlier than and rethink how information facilities are constructed to fulfill growing calls for throughout compute, networking, and storage. Whereas the tempo of innovation and enterprise development is exhilarating, it may possibly additionally really feel daunting.
How do you shortly construct the info middle infrastructure wanted to energy AI workloads and sustain with important enterprise wants? That is precisely what our crew, Cisco IT, was going through.
The ask from the enterprise
We have been approached by a product crew that wanted a strategy to run AI workloads which can be used to develop and check new AI capabilities for Cisco merchandise. It would ultimately assist mannequin coaching and inferencing for a number of groups and dozens of use instances throughout the enterprise. And they wanted it completed shortly. want for the product groups to get improvements to our prospects as shortly as doable, we needed to ship the new atmosphere in simply three months.
The know-how necessities
We started by mapping out the necessities for the brand new AI infrastructure. A non-blocking, lossless community was important with the AI compute cloth to make sure dependable, predictable, and high-performance information transmission inside the AI cluster. Ethernet was the first-class selection. Different necessities included:
- Clever buffering, low latency: Like several good information middle, these are important for sustaining clean information move and minimizing delays, in addition to enhancing the responsiveness of the AI cloth.
- Dynamic congestion avoidance for varied workloads: AI workloads can differ considerably of their calls for on community and compute assets. Dynamic congestion avoidance would be sure that assets have been allotted effectively, stop efficiency degradation throughout peak utilization, preserve constant service ranges, and stop bottlenecks that would disrupt operations.
- Devoted front-end and back-end networks, non-blocking cloth: With a aim to construct scalable infrastructure, a non-blocking cloth would guarantee ample bandwidth for information to move freely, in addition to allow a high-speed information switch — which is essential for dealing with giant information volumes typical with AI functions. By segregating our front-end and back-end networks, we might improve safety, efficiency, and reliability.
- Automation for Day 0 to Day 2 operations: From the day we deployed, configured, and tackled ongoing administration, we needed to cut back any guide intervention to maintain processes fast and reduce human error.
- Telemetry and visibility: Collectively, these capabilities would supply insights into system efficiency and well being, which might permit for proactive administration and troubleshooting.
The plan – with a couple of challenges to beat
With the necessities in place, we started determining the place the cluster may very well be constructed. The prevailing information middle services weren’t designed to assist AI workloads. We knew that constructing from scratch with a full information middle refresh would take 18-24 months – which was not an choice. We would have liked to ship an operational AI infrastructure in a matter of weeks, so we leveraged an present facility with minor modifications to cabling and gadget distribution to accommodate.
Our subsequent considerations have been across the information getting used to coach fashions. Since a few of that information wouldn’t be saved regionally in the identical facility as our AI infrastructure, we determined to duplicate information from different information facilities into our AI infrastructure storage programs to keep away from efficiency points associated to community latency. Our community crew had to make sure ample community capability to deal with this information replication into the AI infrastructure.
Now, attending to the precise infrastructure. We designed the center of the AI infrastructure with Cisco compute, best-in-class GPUs from NVIDIA, and Cisco networking. On the networking aspect, we constructed a front-end ethernet community and back-end lossless ethernet community. With this mannequin, we have been assured that we might shortly deploy superior AI capabilities in any atmosphere and proceed so as to add them as we introduced extra services on-line.
Merchandise:
Supporting a rising atmosphere
After making the preliminary infrastructure out there, the enterprise added extra use instances every week and we added further AI clusters to assist them. We would have liked a strategy to make all of it simpler to handle, together with managing the change configurations and monitoring for packet loss. We used Cisco Nexus Dashboard, which dramatically streamlined operations and ensured we might develop and scale for the long run. We have been already utilizing it in different components of our information middle operations, so it was straightforward to increase it to our AI infrastructure and didn’t require the crew to be taught a further device.
The outcomes
Our crew was in a position to transfer quick and overcome a number of hurdles in designing the answer. We have been in a position to design and deploy the backend of the AI cloth in beneath three hours and deploy your entire AI cluster and materials in 3 months, which was 80% quicker than the choice rebuild.
In the present day, the atmosphere helps greater than 25 use instances throughout the enterprise, with extra added every week. This contains:
- Webex Audio: Enhancing codec improvement for noise cancellation and decrease bandwidth information prediction
- Webex Video: Mannequin coaching for background alternative, gesture recognition, and face landmarks
- Customized LLM coaching for cybersecurity merchandise and capabilities
Not solely have been we in a position to assist the wants of the enterprise immediately, however we’re designing how our information facilities must evolve for the long run. We’re actively constructing out extra clusters and can share further particulars on our journey in future blogs. The modularity and suppleness of Cisco’s networking, compute, and safety provides us confidence that we will maintain scaling with the enterprise.
Further assets:
Share: