While discussing modern network architecture and operations, machine learning (ML) integration promises enhanced efficiency, security, and adaptability. During our interview, Tomasz Janaszka, a seasoned Solution Architect, sheds light on the current state of ML in networking, its challenges, and its future direction.
This conversation explores the nuanced perspectives of machine learning integration, from upgrading network architectures to addressing the complex challenges of applying ML to network design and operations.
How is machine learning currently integrated into network architectures?
Tomasz Janaszka: Well, a lot depends on who you put this question to. AI enthusiasts would say machine learning is now revolutionizing network architectures by enabling them to be more intelligent, efficient and adaptable as never before. AI agnostics, on the other hand, prefer to view it as an evolution, acknowledging the ongoing refinement and adaptation of network paradigms. Nevertheless, we can admit that thanks to introducing ML various networking aspects, including security, automation, optimization, management, and maintenance can be enhanced. This transformation is already underway, and personally, I find myself in the middle ground between AI enthusiasts and agnostics.
ML-based tools are being designed to automate routine network management tasks such as configuration management, provisioning, troubleshooting, and root cause analysis. Predictive analysis using ML models on traffic and resource utilization data helps forecast future resource demands, anticipate network congestion, and optimize routing and resource allocation, thereby improving network performance metrics like latency and packet loss.
Also, in the realm of network security, there's a notable advancement in the development and refinement of ML models for Anomaly Detection (AD), Intrusion Detection and Prevention Systems (IDS/IPS), and other security mechanisms that identify potential threats such as malware, phishing attempts, and DDoS attacks.
Recently, Large Language Models (LLMs) have become a valuable addition to our toolkit. When trained with network-related data, LLMs can understand network documentation and configuration files, identifying errors, inconsistencies, and security vulnerabilities. They can also generate training materials for new network engineers, summarize security reports, and describe abnormal network behavior events. Adjusted LLMs could suggest configuration scripts based on network requirements, simplifying the initial network setup. Moreover, utilizing trouble ticket data to fine-tune LLMs provides insights for resolving common networking issues. Many projects explore using specialized LLMs to power virtual assistants and chatbots, enabling more natural interactions with network users and engineers.
What are the key challenges in applying machine learning to network design and operations?
T. J.: My experience shows that using machine learning in network design and operations is not a walk in the park, and this awareness is becoming increasingly widespread.
In my opinion, data quality and quantity constitute the primary challenge. Adhering to the well-known maxim "garbage in, garbage out" (GIGO), it is crucial to have top-notch training data to fuel our ML models effectively. We have got to get our hands dirty to address issues stemming from noisy, incomplete, and biased data. Furthermore, gathering up enough data to develop accurate and valuable ML models can cost us quite a bit of time, particularly given the diverse and dynamic nature of networks.
The next challenge deals with the complexity, scale, and need for real-time processing. Managing large, complex networks with various devices and configurations is tough for machine learning. These models must swiftly handle data streams and spot events, anomalies, and security threats in real-time; crucial for effective network management.
Another challenge is interpretability and explainability. It is vital for network experts, especially during troubleshooting or security issue analysis, to understand why ML models make specific decisions or predictions. Transparent models build trust and help to address accountability concerns.
Following that, we have data privacy and security concerns. Networks must comply with regulations like the General Data Protection Regulation (GDPR). So, ML solutions must meet such standards, and this is also an elephant in the room.
Lastly, costs, computational infrastructure requirements, and the necessity of assembling diverse teams proficient in both networking and machine learning domains emerge as significant challenges when considering ML-based solutions for networking problems
As you can see, tackling networking challenges with ML is like embarking on a thrilling adventure. But it makes it exciting. Doesn't it?
How do you identify suitable use cases for machine learning in a network context?
T. J.: The first step should be to talk with network experts and understand their challenges thoroughly. The better we grasp the unique obstacles of the network environment, the better we can tackle them effectively.
Then, we assess the data we have: its availability, relevance, and quality, and how complex it might be. Depending on the challenges, we might need data from various sources like network telemetry, device logs, or security logs.
For each problem we identify, we need clear business goals and key performance indicators (KPIs) to measure improvement with ML-based solutions.
Next, we dive into exploring ML techniques that could solve each problem, focusing on those with the most potential for value creation. This means looking into supervised and unsupervised learning, reinforcement learning, deep learning, or specialized frameworks like Natural Language Processing (NLP) or Large Language Models (LLMs).
After that, we prioritize potential use cases based on factors like impact, alignment with business priorities, and resource requirements. Ultimately, we focus on the use case with the highest potential for business value creation, taking into account our organization's capabilities and constraints.
Let me share one more thought on your question. Many are tempted to add machine learning to systems and products without thorough analysis. While this is often to make products more appealing, it can make things needlessly complicated. So, it's important to carefully consider if machine learning is the best solution before using it.
What factors influence the choice of machine learning models for network applications?
T. J.: As you can guess, there are many factors.
To start, the type of problem at hand determines which model family to consider. Whether it's classification, cluster analysis, forecasting, or using NLP or LLMs for knowledge extraction, the specific use case usually guides the choice of model types. This decision is closely aligned with the nature of the available data, which can be structured (tables, logs), unstructured text, or time series data. Each model operates based on assumptions about the data format it handles.
Interpretability and explainability of the model's output are essential, especially for tasks related to troubleshooting or security. Network engineers and administrators need to understand the model's output to use it effectively. Hence, decision trees, logistic regression, or random forest models may be favored over black-box deep learning models, even though interpretable models may sacrifice some accuracy.
Scalability and efficiency are equally important considerations. Models selected for network and cloud infrastructure related tasks must scale to manage the large volumes of data generated by IT environments. Depending on real-time processing needs, models with fast inference capabilities suitable for deployment in distributed computing environments might be necessary. Additionally, it is crucial to assess the resources required to prepare a production-ready model, including its size, processing power for training, and inference phase after deployment.
In my experience, these factors serve to narrow down the range of applicable models. The process of selecting a model is iterative and involves experimenting with various ML algorithms. The outcomes of these experiments should be integrated with the collective knowledge, experience, and skills of the team involved to make informed decisions regarding the selection of the most suitable ML model for a specific use case.
How do you evaluate the trade-offs between traditional rule-based methods and machine learning models in network design?
T. J.: This question reminds me of heated discussions with experienced engineers who advocate for the simplest and most effective solutions.
First, let us look at the pros and cons of each approach. Traditional rule-based methods are clear and straightforward, making them predictable and easy to understand. On the other hand, machine learning-based approaches may lack this transparency. Also, using machine learning often requires more computational power compared to rule-based methods. However, machine learning can adapt to new data and changing conditions, while rule-based methods need manual updates.
In complex scenarios, machine learning can be good at finding and adapting to intricate patterns in data that are hard to capture with strict rules. Managing many rules in a traditional system can be time consuming and prone to error, but with machine learning, we need to regularly monitor and update models for optimal performance.
My observations confirm that the best approach usually involves a mix of both methods. Rule-based systems are good for clear tasks and providing basic safeguards, while machine learning is better in areas such as pattern recognition, optimization, and prediction. Ultimately, the decision should come from thorough discussions among experts.
In network environments where real-time processing is critical, how do you design machine learning solutions that can operate in real time or near-real time?
T. J.: I must say, you are quite adept at posing challenging questions. To answer this question, we need to go a little deeper into the details.
When it comes to developing ML solutions for real-time operation, one thing really stands out: precise data feature selection. This not only streamlines the model but also boosts efficiency by cutting down unnecessary processing during both training and inference. So, putting effort into feature selection, principal component analysis, and dimensionality reduction techniques is key; especially in real-time scenarios.
Another effective strategy is to focus on machine learning algorithms with low computational complexity. Lightweight models like decision trees, random forests, and logistic regression offer quick and efficient processing. In deep learning, there are plenty of lightweight models tailored for real-time or resource-constrained devices, such as simplified versions of Long Short-Term Memory (LSTM) or Convolutional Neural Networks (CNN), as well as innovative architectures like TinyML, ShuffleNet, MobileNet, one-dimensional CNN (1D CNN), and transformer-based models.
We should realize that deployment of trained ML models relies on leveraging efficient inference engines and frameworks. TensorFlow Serving, TorchServe, TensorRT, OpenVINO, ONNX Runtime, MLFlow, and Apache NiFi are essential tools, enabling seamless deployment across diverse hardware platforms like CPUs, GPUs, TPUs, VPUs, and specialized accelerators such as FPGAs or ASICs. Some frameworks also offer comprehensive toolsets for model optimization, including pruning and quantization, resulting in optimized models with reduced size and complexity for fast execution without significant performance loss.
Additionally, stream processing frameworks like Apache Kafka Streams, Apache Flink, and Apache Spark Streaming empower the construction of robust data processing pipelines capable of handling vast volumes of data generated by network and cloud environments in real time. Integration with frameworks offering distributed processing capabilities allows seamless scaling of machine learning inference and processing across multiple nodes or computing resources, which is crucial for meeting real-time operational requirements.
So, to sum it up, it is not only about having fast and efficient models. You also need to make sure they have the right environment to run smoothly for real-time processing.
How do you ensure the scalability and performance of machine learning models in large-scale network deployments?
T. J.: It varies depending on the specific use case, but I believe I can offer you some helpful general hints.
In fact, what has already been discussed regarding real-time needs applies to the concerns you are raising. However, we can highlight a few specific methods to ensure scalability and performance for ML solutions on a larger scale.
When selecting a machine learning model, it's crucial to ensure its potential for parallelization in both training and inference phases across multiple computational units. Models like gradient boosting, random forests, and certain neural networks are well suited to distributed computing using frameworks like Apache Spark, PyTorch Distributed, and TensorFlow Distributed.
Another good idea is decoupling data processing stages, such as ingestion, preprocessing, and inference, into separate subprocesses or containers that enable asynchronous and parallel execution.
For sure, containerization offers deployment flexibility and scalability, especially when combined with auto-scaling and load balancing capabilities of available computing clusters.
We should also think of employing caching mechanisms for frequently accessed data and precomputed or transformed features. This reduces redundant computations, which is particularly beneficial for iterative or repetitive data processing pipelines.
Let us not forget about continuous monitoring of deployed data processing pipelines, focusing on metrics like accuracy, latency, and resource consumption. It helps identify bottlenecks and drives the optimization efforts of our ML solution. Iterative refinement of machine learning-based solutions is essential for ensuring scalability and performance in large-scale environments.
Let me emphasize here that in order to efficiently implement these approaches, leveraging feature-rich MLOps frameworks is highly advisable. Platforms such as MLflow, Kubeflow, or commercial frameworks provided by established cloud providers offer comprehensive toolsets tailored to streamline the management and optimization of machine learning workflows at scale.
Can you share insights into the integration of machine learning with traditional security measures in network environments?
T. J.: Well, this is indeed a very interesting topic to discuss.
While traditional security measures like firewalls, Intrusion Detection and Prevention Systems (IDS/IPS), vulnerability scanners, and Access Control Lists (ACLs) are essential for safeguarding network environments, they have limitations. Rule-based or signature-based threat detection systems are good at spotting known security breaches but struggle with new, sophisticated threats. Plus, they often generate false positive alerts from harmless activities, leading to alert fatigue among security analysts. Additionally, these solutions may not scale well with the increasing volume and complexity of data in modern networks, relying too much on manual interventions that can be slow and prone to error.
The integration of machine learning with traditional security measures offers a promising solution to these challenges. ML models have a knack for spotting anomalies and suspicious patterns that traditional systems might miss. By creating behavioral profiles for applications, devices, and users, machine learning techniques can detect deviations from normal behavior, helping to identify malicious activity. Moreover, the number of false positives can be reduced by analyzing historical alert data alongside contextual information, accurately distinguishing between genuine threats and harmless network quirks.
Of course the problem is not easy, but at least the goals are clear. To sum up: combining machine learning with traditional security measures in network environments is expected to improve threat detection, speed up incident response times, reduce false positive alerts, and enable more automated and adaptive security strategies to keep up with the ever-changing cyber security landscape.
What strategies do you implement to keep machine learning models up to date with the evolving nature of network technologies?
T. J.: I am happy when a project reaches that phase.
Ensuring that machine learning models stay relevant over time is crucial, as models trained on historical data patterns may lose their predictive accuracy. Different strategies exist for maintaining model relevance, depending on the machine learning methods and models used.
A typical approach is to regularly monitor model performance by comparing its output with newly arrived real data. If performance drops below a predefined threshold, indicating degraded accuracy, the model needs updating. Another method is regular retraining with fresh data to prevent performance deterioration.
In dynamic environments like networks and cloud environments, an adaptive strategy is to use machine learning methods that support adaptive learning. This allows models to dynamically adjust and learn from new data. Batch incremental learning involves periodically updating the model with batches of new data, while online learning adjusts the model with each new data point.
Maintaining model relevance may also necessitate more substantial changes, such as adopting a new, improved version of a pretrained model and fine-tuning it to suit the most recent data. Ensemble learning is another approach, combining multiple base models trained with different algorithms or data subsets to enhance robustness and generalization.
Additionally, continuous improvement of deployed machine learning models can be achieved through data augmentation. This involves introducing variations, increased noise levels, specific network conditions, and anomalies into collected data to expand the model's coverage beyond the original dataset's scope.
In summary, there are various approaches to maintaining machine learning model relevance, and adapting strategies to specific use cases is essential for long-term effectiveness.
—
Tomasz Janaszka, Solutions Architect at CodiLime
Tomasz Janaszka is a solutions architect with over 20 years of experience in the telco/IT industry. He is a doctor of technical sciences in the field of telecommunications and an experienced engineer who has worked for many years in research and operational projects. He has worked as a leader, as a solution architect and as a developer in projects covering networks design, capacity planning, traffic engineering, load balancing, resources optimization and focusing on software solutions supporting automation of business processes related to network management.
Currently working in CodiLime’s R&D department, he explores AI/ML technology looking for practical applications in various networking aspects. Tomasz is the author and the co–author of several scientific publications and a speaker at conferences, webinars and technical workshops.