Telco infrastructures have experienced a transformative journey in recent years, aligning with the demands of the digital era. This evolution has been marked by the adoption of cutting-edge technologies, including Network Functions Virtualization (NFV), cloud computing, big data, artificial intelligence, and cloud-native architectures. In these advancements, the data center (DC) plays a pivotal role, serving as the hub where network functions, business platforms, cloud services, analytics workflows, and IT operations integrate into a converged infrastructure.
The services operating within the telco DC undergo load fluctuations throughout the day, driven directly or indirectly by the activities of the telco subscribers. Typically, there is peak load during daytime hours when people are actively using their communication services and reduced load during nighttime.
In response to these load fluctuations, operators have started embracing auto-scaling strategies, a prominent capability enabled by modern cloud-native technologies such as Kubernetes. This approach involves automatically adjusting the number of service instances to align with the current load. For instance, during peak demand, the auto-scaling logic detects the increased load and deploys additional instances, while in periods of reduced demand, it efficiently scales down the instances to prevent resource waste.
While operators embrace cloud-native principles for their services, the infrastructure supporting these services often lags behind. To optimize service performance, operators often provision DC servers for peak loads. Consequently, a substantial portion of servers remain underutilized during off-peak times, and even when they sit completely idle and are not actively processing loads, they lead to significant operational costs and energy waste.
Dynamic Data Center Sizing (DDS) is a novel feature within Intracom Telecom’s NFV-RI™ product that comes to address exactly this challenge: the disparity between the scalability of modern, cloud-native applications and the underlying infrastructure. DDS optimizes telco DCs by dynamically consolidating workloads and powering off idle servers during periods of extended inactivity. Unlike traditional static approaches, DDS leverages AI to accurately forecast resource needs and ensure proactive adjustments to timely meet fluctuating demand. This intelligent scaling capability allows operators to efficiently manage DC servers, avoiding overprovisioning during low-demand periods and preventing bottlenecks during peak times.
At its core, DDS provides a forecasting and a decision-making module. The forecasting module analyzes historical data corresponding to resource requests (CPU, memory, NICs, etc.) to predict resource demand across the entire DC for the next time step. Predictions are then fed to the decision-making module which determines when to power on or off servers, and which ones. When predictions indicate that server resources will be underutilized in the next time step, the module identifies the most appropriate servers to power off, and initiates their “draining” before shutting them down completely. During draining, attempts are made to migrate pods from the server to other servers in the cluster. Conversely, when predictions indicate a surge in demand, the module identifies the smallest subset of servers to power on, so that the expected load can be handled efficiently.
Both the draining/server power-off, and the server power-on process, are non-instantaneous and may take a considerable and sometimes unpredictable amount of time. Yet, DDS ensures that the decision to switch between server states accounts for these transition times. Moreover, it always guarantees that the servers scheduled to be online at the next time step have the capacity to securely serve the predicted demand. In this way, service disruptions are prevented and overall DC efficiency is optimized.
Ensuring the safety and reliability of automated solutions in the face of mispredictions, unforeseen load spikes, and any other kind of uncertainty is paramount for telco operators. Apart from its highly accurate forecasting algorithms, DDS addresses this concern with a multitude of robustness mechanisms.
First of all, DDS features a fallback module which serves as a safety net in case of unexpected load increases. This module continuously monitors the Kubernetes scheduler queue, which is responsible for managing incoming requests, and forcibly powers on servers when congestion is detected. This approach effectively prevents any significant service disruptions when demand suddenly becomes greater than the previously available capacity.
In addition to this, DDS introduces various thresholds, including the overprediction and utilization threshold, both key mechanisms that empower users to act with a greater degree of caution and take a more conservative approach to dynamic resource allocation. For instance, the overprediction threshold enables users to consistently overprovision nodes beyond what is predicted (e.g. “always allocate a 10% surplus of nodes”). This essentially serves as a safety buffer to accommodate unexpected demand spikes, prioritizing service stability over optimal resource sizing.
Finally, to include new usage patterns, DDS features a retraining module which can make the system adapt to previously unseen patterns observed in the production environment. As new data becomes available, this module dynamically retrains its AI models in the background, and when finished it pushes them to the production for inference. In this way, decision-making accuracy is maintained or enhanced over time.
One of the standout benefits of DDS is its impact on energy consumption and cost efficiency. By strategically powering off underutilized servers during periods of low demand, DDS not only reduces energy consumption but also minimizes operational costs. In a comprehensive evaluation of DDS on AWS, we conducted an extensive, multi-day assessment using a Kubernetes cluster (EKS) comprising 34 virtual machine nodes of various EC2 instance types, collectively offering a total of 572 CPUs. We deployed a range of cloud-native workloads, strategically designed to auto-scale their number of pods to serve varying demand throughout the day.
To gauge DDS’s potential, we simulated diverse load patterns, effectively mirroring various real-world cases. In a fluctuating load pattern (‘dynamic’ day), our evaluations demonstrated a reduction in the number of EC2 instances and associated cost savings of 31%, showcasing the tangible sustainability and financial benefits of incorporating DDS into telco DCs. This pattern emulated substantial load variations within a day, where the aggregate CPU demand fluctuated between 23% and 90% of the total DC capacity. Recognizing that many telco DCs may not exhibit such extremes yet, we also explored 'balanced' and 'static' days. Even in these less dynamic scenarios, DDS showcased tangible benefits (15.4% and 6.3% savings, respectively), proving its adaptability and efficiency across a variety of realistic deployments.
DDS offers a pragmatic approach to telco infrastructure efficiency, blending seamlessly with cloud-native technologies. By addressing the disparities between application scalability and infrastructure limitations, DDS provides telco operators with a reliable and adaptive solution for the challenges of modern DCs. Recognizing the diversity of telco environments, not only in terms of workload dynamicity but also in terms of DC infrastructures, DDS seamlessly integrates with various setups, including both bare-metal and cloud-based DCs. This flexibility ensures that operators can easily deploy DDS in their existing infrastructure and start enjoying its benefits with zero integration and deployment effort. Its proven impact on energy savings, cost efficiency, and operator confidence, makes it an essential technology for telco operators seeking to accelerate their sustainability goals and reduce their costs.