AI-Driven Anomaly/Fault Detection and Management in Modern Mobile Networks
- Venkateshu

- Oct 28
- 12 min read
Case Study – Low throughput issue mitigation
Introduction
The complexity of today’s telecom networks—driven by 5G’s massive scale, distributed Radio Access Network (RAN) architectures, and virtualized infrastructure—makes operational reliability and proactive fault management both a necessity and a challenge. Static rules and threshold-based monitoring techniques, once the backbone of network assurance, are now insufficient as data velocity, volume, and variety continue to grow.
The Need for AI in Telecom Anomaly Detection
Modern telecom environments produce multi-dimensional, varied, and high-frequency telemetry streams, comprising KPIs from base stations, controllers, network functions, and end-user devices. These unstructured logs include throughput figures, error rates, radio signal conditions, and event histories. Relying on human analysis or simple metrics can leave blind spots and slow down the response to emerging problems.
AI and advanced machine learning (ML) frameworks bring several critical capabilities:
Autonomous Real-Time Monitoring: ML models continuously observe incoming metrics across devices, cells, and network slices, capturing subtle signal degradation or performance dips.
Proactive Anomaly Detection: Instead of responding after service impact, AI algorithms flag outliers, abnormal time series, and rare event correlations as soon as they emerge, enabling instant alerts and rapid investigation.
Adaptive Learning and Scalability: Algorithms evolve with network changes, learning new patterns of normality and adjusting to dynamic operational baselines, without manual threshold updating.
Methods: Machine Learning Models and Evaluation Metrics
To establish a sound AI-driven anomaly detection system, selecting appropriate ML models and evaluation metrics is essential.
Machine Learning Models Used
Unsupervised Models:
Isolation Forest: Detects anomalies by isolating data points with fewer splits in the feature space, ideal for unlabeled telecom KPI data.
Autoencoders: Neural networks that learn to reconstruct input features; high reconstruction error indicates anomaly.
Clustering Techniques: Algorithms such as DBSCAN or k-means mark sparse or divergent clusters as anomalies.
Supervised Classification Models:
Random Forest, Gradient Boosting, Support Vector Machines: Used when historical labeled data on normal/anomalous states is available.
These models classify new instances based on learned patterns of degradation or faults.
Hybrid Models:
Ensembles or multi-stage pipelines combining unsupervised detection for initial triggers and supervised classification for validation and fault categorization.
Evaluation Metrics for Anomaly Detection
Evaluating model performance meaningfully requires metrics aligned with telecom operational goals:
Precision and Recall:
Precision measures the fraction of detected anomalies that are true anomalies.
Recall measures how many true anomalies were detected by the model.
Often, there’s a trade-off: increasing sensitivity improves recall but may increase false alarms.
F1 Score:
Harmonic mean of precision and recall, balancing both aspects.
Useful when the cost of false positives and false negatives is comparable.
Accuracy:
Overall fraction of correct predictions but may be misleading with imbalanced data.
Receiver Operating Characteristic (ROC) and Area Under Curve (AUC):
Evaluate a model’s diagnostic ability across classification thresholds.
Mean Absolute Error (MAE) and Mean Squared Error (MSE):
For regression tasks, measuring how far predicted fault severity or KPI values deviate from actuals (less common in pure anomaly detection).
To ensure robustness, evaluation uses techniques like cross-validation or stratified sampling to account for class imbalance prevalent in fault datasets.
End-to-End AI Pipeline in Telecom Fault Management
A typical AI-powered pipeline to transform unstructured RAN logs into actionable fault detection includes:
Data Ingestion and Cleansing: Collection of multi-source logs (UE, cell, system events), outlier filtering, and normalization.
Feature Extraction and Enrichment: Parsing raw event text and telemetry into structured feature sets (e.g., DL throughput, HARQ NACK ratio, CQI, RSRP).
Automated Training and Model Selection: Feeding engineered features into ML models; continuous training for evolving fault signatures.
Real-Time Inference and Alerting: Scoring incoming data, issuing real-time alerts on anomalies, and classifying fault types.
Root Cause Analysis and Closed-Loop Actions: Linking anomalies to likely fault domains and initiating mitigation (power adjustment, handover, troubleshooting workflow).
Generalized Use Cases
AI-powered anomaly detection is fundamental to numerous telecom applications:
Early detection of RAN performance degradations, before customer impact.
Predictive maintenance for hardware/software failures in base stations or network functions.
Automated isolation and correlation of multi-domain faults, reducing alert noise and pinpointing root causes.
Enhancing SLA compliance, lowering mean time to resolution (MTTR), and improving operator and end-user experience.
Here is a list of the top 10 most commonly observed 5G or 4G LTE KPI degradation issues faced by telco operators worldwide, how these are currently resolved by traditional means (without AI/ML), and how AI/ML models can offer advanced solutions for each:
Top 10 KPI Degradation Issues
Case study focusing on the “Low Throughput” KPI issue—covering step-by-step manual troubleshooting procedures and comparing them with advanced AI/ML-driven resolution.
Low throughput means your phone or computer is getting a slow connection, even though the network should be fast. Imagine you’re trying to stream a video or download something, but it keeps pausing or takes a long time—even if you have “full bars” or pay for fast internet.

Step-by-Step Manual Troubleshooting for Low Throughput
Initial Alarm/Complaint Collection
Receive customer trouble tickets or OSS alarms indicating low throughput in a sector/cell.
Data Gathering
Pull RAN performance counters and KPIs (e.g., PRB utilization, RSRP/SINR, CQI, MCS, BLER, HARQ NACK).
Review drive test results or UE logs for affected area/device.
Root Cause Isolation
Check for:
Abnormal PRB/PUCCH usage, congestion
Poor RF conditions (RSRP/SINR/CQI) or high interference
Incorrect scheduling/MCS values
Hardware faults or poor antenna connections
Cross-check config (bandwidth, MIMO, scheduler, backhaul, cell parameters).
Field/Physical Inspection
If anomaly persists, request a site visit:
Inspect antennas, feeders, jumpers for damage or loose connectors.
Conduct spectrum analysis to find external interference.
Remediative Actions
Tune scheduler/config parameters (e.g., PRB allocation, MCS thresholds).
Rectify RF/hardware faults, clean connectors, re-align antennas.
Rerun drive test to validate improvement; update ticket/resolution status.

Traditional Resolution Methods (Without AI/ML)
Heavy reliance on manual log extraction from OSS/NMS, experience-driven triage, checklists, and network audits.
Static thresholds generate fixed alarms; engineers must dig deeper for real issues (often using element/vendor toolkits).
Cross-team troubleshooting (RF, transport, core, device teams) based on phone calls, email escalations, and human pattern recognition.
Periodic audits and drive tests attempt to catch chronic issues.
Step-by-Step AI/ML-Driven Troubleshooting for Low Throughput
1. Automated Data Collection & Feature Extraction
What happens: Real-time ingestion of logs/KPIs from UEs, gNBs, and core (PRB usage, CQI, MCS, HARQ stats, slice config, core AMBR etc.).
AI/ML methods used:
Data Parsing/NLP: Custom scripts and sequence-to-structure models parse raw/unstructured logs into structured feature tables.
Feature Engineering: Automated pipelines or feature stores generate relevant features (rolling averages, deltas, composite metrics) for modeling.
Purpose: Ensure all aspects influencing throughput are systematically captured so models are accurate and up-to-date.
2. Anomaly Detection
What happens: The AI detects cells/users whose throughput and related KPIs deviate from historical or peer-group baselines.
AI/ML methods used:
Unsupervised Learning
Isolation Forest: Randomly splits data, detecting outliers that require fewer splits. Good for identifying rare KPI states or combinations.
Autoencoders: Neural networks learn to compress and reconstruct typical KPI patterns. High “reconstruction error” = anomaly.
Time Series Models
ARIMA/LSTM: Forecast expected values; actual is compared to forecast to flag anomalies.
Statistical Methods
Z-score, moving average, and quantile-based detectors for quick filtering.
Purpose: Detect abnormal drops in throughput, even before alarms would fire or customers complain.
3. Root Cause Analysis & Explainability
What happens: Once an anomaly is found, the system determines why throughput is low—by correlating all feature shifts (e.g., low PRB, bad CQI, high HARQ).
AI/ML methods used:
Supervised Learning/Classification
Random Forests/XGBoost: Trained on historical labeled tickets to classify the cause (“congestion,” “scheduler misconfig,” etc.)—returns “feature importance”.
Explainable AI (XAI)
SHAP/LIME: Highlights which features (e.g., “scheduler bandwidth” or “gNB PRB cap”) contributed most to the anomaly decision.
Multivariate Analysis
Correlation and association rules to find key parameter relationships.
Purpose: Pinpoint actionable causes and facilitate fast, targeted remediation.
4. Resolution Recommendation & Automation
What happens: System presents the most likely fix (e.g., “increase PRB,” “inspect feeder”, “raise AMBR”, “switch to BBR”) to engineers or triggers closed-loop corrective actions directly.
AI/ML methods used:
Reinforcement Learning (RL):
RL agents simulate/take actions (parameter changes) and learn from observed improvements in KPIs to recommend best next actions.
Can operate in closed-loop mode for auto-tuning.
Expert System/Rule Augmentation:
Augment ML with domain-encoded actions for cases where AI has lower confidence.
Purpose: Drive “zero-touch” or semi-automated fixing of issues, reducing mean-time-to-resolve and human effort.
5. Continuous Validation & Learning
What happens: Post-remediation, the system monitors KPIs for improvement and feeds outcome data back to training pipelines.
AI/ML methods used:
Active Learning:
System prioritizes learning from rare/edge-case resolutions to improve model generality.
Feedback Loop/Re-training:
Models auto-retrain on new diagnostic and resolution data to adapt to changing network conditions.
Purpose: Continual improvement—each fix improves the model’s accuracy, reducing future false alarms and speeding up diagnosis.

Summary Table: AI/ML Amplifies Detection, Diagnosis, and Resolution
Real-time Low throughput Issue Troubleshooting
A. Example Step-by-Step Resolution: Manual vs. AI/ML
Scenario: 20MHz cell, low DL throughput detected
Step 1: Data & Symptom Collection
Manual: Engineer collects QXDM logs, checks DL BLER, CQI, PRB assignment, MCS events; downloads gNB config, checks scheduler section
AI/ML: Automated pipeline ingests UE and RAN logs, parses for correlated drops in throughput, CQI, HARQ, PRB, and MCS
Step 2: Parameter Correlation
Manual: Cross-checks whether CQI/RSRP is low; spots only 100 PRBs assigned (out of 106); suspects license shortfall or misconfigured cell BW
AI/ML: Model explains drop with feature importance: “PRB count,” “MCS,” and “CQI” strongly correlate with anomaly
Step 3: In-Depth Inspection
Manual: Downloads current gNB config:
bandwidth=20MHz (should match)
prb_count=100 (should be closer to max for 20MHz: ~106 for LTE, 273 for 5G sub-6)
scheduler_type=rr (should consider pf or advanced)
tdd_ul_dl_config=2:7 (DL-heavy is good)
AI/ML: Flags “max PRB assigned lower than expected”; triggers remedial action suggestion
Step 4: Reconfiguration/Tuning
Manual: Changes:
Increase prb_count to 106 (LTE) or 273 (5G NR)
Adjust scheduler to proportional fair
Reboot cell/site
AI/ML: API/closed-loop triggers config update, validates post-fix KPIs automatically
Step 5: Validation
Manual: Repeats throughput test, validates improvement in QXDM logs; closes ticket
AI/ML: Monitors post-fix logs, confirms normal KPIs; archives scenario for retraining
B. Sample UE Log/Configuration Snippets (Field Examples)
QXDM/MAC Log:
Time 123456: SchedPRB=72, CQI=8, DL_Tput=42Mbps, HARQ_NACK=12%, MCS=QPSK
Time 123789: SchedPRB=106, CQI=12, DL_Tput=91Mbps, HARQ_NACK=2%, MCS=64QAM
gNB Config:
cell_bandwidth: 20MHz
prb_count: 100 # Should be 106/273
scheduler_type: rr # Consider 'pf'
tdd_ul_dl_config: DL:UL = 7:2
mimo_layers: 2
carrier_aggregation: enabled
C. Key Parameters to Inspect When Low Throughput Is Detected
1. UE Log (QXDM/QCAT/Chipset-Specific) Parameters
DL/UL Throughput: e.g., actual vs. scheduled throughput (per RLC, MAC logs)
CQI, PMI, RI: Low CQI/Rank Indicator often explains low modulation
BLER (Block Error Rate): High BLER degrades throughput, especially TCP
HARQ NACK Rate: Frequent NACKs signal decoding or resource/radio problems
RSRP/RSRQ/SINR: Weak/variable signal strengths lead to CQI/throughput drops
RLC Mode (AM/UM): RLC AM is sensitive to loss/retransmission; check RLC retrans counts
UE Capability Exchange: Ensure UE correctly negotiated max BW, MIMO, carrier aggregation (3GPP 38.306/38.331)
Assigned PRBs/Slot: Fewer than expected implies scheduler or license issue
Physical Cell ID/Serving Cell Info: Cross-verify expected cell selection
MCS (Modulation and Coding Scheme): Low MCS or capping/truncation may restrict throughput
L2, NAS, S1/X2 Events: Look for RLF (Radio Link Failure), drops in bearer establishment
2. gNB/eNB (Base Station) and vRAN Parameters
Bandwidth Allocation: Ensure cell defines full (20 MHz) BW for the slice/UE group (see 3GPP 38.104, 36.104)
PRB (Physical Resource Block) Mapping: Overor undersubscription reduces throughput
Scheduler Type and Fairness Algorithm: Check for proportional fair, round robin, strict priority—misconfigurations can starve some flows (vendor-specific: Samsung vRAN, Ericsson DUS/Baseband, Nokia AirScale)
MIMO Configuration: Number of layers, beamforming settings, license/capability match
TDD/FDD Frame Settings: Wrong UL/DL ratio in TDD can throttle DL
Transmission Power, Antenna Parameters: Tx Power, tilt, beam direction
Backhaul Rate Limit/GTP-U Tunneling: Core network link/switch congestion (check for bottlenecking on S1-U, N3 interfaces)
Carrier Aggregation/GNB Capabilities: Actual use of CA/MIMO as signaled in RRC reconfigurations (see 3GPP 38.321/38.322)
Slicing/Network Slice Selection: Bandwidth/capacity reserved per NSI/SNSSAI (O-RAN, 3GPP 28.541)
Antenna/Hardware Alarms: PA/antenna feedelement issues or software-flagged failures
3. Core Network/Transport Layer Checks
UPF Throughput Limits: Sufficient GTP-U tunnel resources/capacity
QoS Flows (5QI): Policed/limited throughput as per 5QI scheduling policy
TCP Window Scaling/Buffer Size: Especially if poor TCP, but good UDP throughput is observed
NSSMF/NSMF Policy: Check slice resource templates and real-time elasticity
D. Recommended values for RAN parameters
1. Physical Resource Block (PRB) Allocation
Parameter: prb_count (assigned per UE or scheduling interval)
Recommended Range:
For 5G NR: Up to 273 PRBs for 100 MHz (30 kHz SCS);
~106 PRBs for 20 MHz (15 kHz SCS, FR1/LTE backward)
Best Practice: Ensure near-maximum scheduling for eMBB UEs when cell load allows.
2. Modulation and Coding Scheme (MCS) Table
Parameter: mcs_index (0–28 in 5G NR, influences constellation and code rate)
Recommended Range:
Auto/dynamic, but MCS should align with CQI.
Low MCS index (<10) = poor channel, high retransmissions.
High MCS index (20+) = good channel for high throughput.
Note: Adaptive MCS recommended, with “Filter of UE MCS value” often set 0–2.
3. Transmission Time Interval (TTI) / Slot Size / Mini-slot Scheduling
Parameter: tti_length, slot_duration, minislot_enabled
Recommended Range:
5G NR supports TTI: 0.125 ms, 0.25 ms, 0.5 ms (slot), down to mini-slot (~0.071 ms)
Shorter TTI → lower latency but more overhead; default: 0.5 ms (One slot for 15/30 kHz SCS)
Best Practice: Use smaller TTI for URLLC, default slot for eMBB.
4. HARQ (Hybrid ARQ) Settings
Parameter: harq_processes, harq_max_retx
Recommended Range:
8–16 processes
3–4 max retransmissions
Best Practice: Sufficient processes for the expected load, avoid excessive retransmissions.
5. Scheduler Algorithm and Fairness
Parameter: scheduler_type
Typical Types:
'pf' (proportional fair): preferred for balancing throughput and fairness
'rr' (round robin): testing/basic, not optimal for capacity
'priority', 'max throughput', 'QoS-aware'
Best Practice: Use ‘pf’ or advanced vendor scheduler for commercial sites.
6. BLER (Block Error Rate) Target
Parameter: bler_target
Recommended Range:
eMBB: 10% BLER at the outer-loop link adaptation point
URLLC: stricter, lower targets (e.g., 1e-5)
Best Practice: Set for intended service profile; eMBB can accept higher BLER for throughput.
7. MIMO Layer Assignment (Rank)
Parameter: rank, num_mimo_layers
Recommended Range:
1–8 layers (depending on gNB/UE support, typical: 2–4 for mid-band)
Best Practice: Adapt layers per UE capability and channel quality.
8. Uplink/Downlink Scheduling Ratios (TDD Only)
Parameter: tdd_ul_dl_config
Recommended Range:
e.g., 7:2 (DL:UL ratio for DL-heavy sites)
Note: Should match traffic profiles.
9. SRS, CSI, and Scheduling Grant Configurations
Parameter: srs_periodicity, csi_report_config
Best Practice: Set high enough reporting periodicity and coverage to optimize scheduler accuracy.
10. Admission/Load & Buffer Status Parameters
Parameter: bsr_threshold, admission_control_enabled
Recommended Range:
Set buffer/reporting thresholds and enable admission control for load abatement.
Example: srsRAN Config Snippet (for reference)
# srsRAN gNB config (YAML example)
scheduler:
scheduler_type: pf
tti_length: 0.5ms
prb_count: 106
mcs_table: auto
harq_processes: 16
bler_target: 0.1
rank: 4
tdd_ul_dl_config: 7:2
admission_control_enabled: true
Quick-Reference Table
Here are core network (EPC/5GC) configuration parameters that often impact and can be tuned to resolve low throughput issues.
1. User Plane Function (UPF, SGW-U, PGW-U) Parameters
Session-AMBR (Aggregate Maximum Bit Rate)
Controls max bandwidth per PDU session (per UE or per slice).
Range/Example: Should match or exceed radio-side peak (e.g., set 200 Mbps+ for eMBB UE).
Tune: Increase if sessions are capped below radio capability.
Per-Flow QoS Policy Parameters (5QI, GBR, MBR, ARP)
5QI value chosen, GBR/MBR values set per QFI (QoS Flow Identifier).
Example: Set appropriate MBR/GBR for haptic/streaming flows, increase if constrained.
GTP-U Throughput Limits
User tunnel capacity on UPF/S-GW-U network interfaces (check interface and vSwitch limits).
Action: Increase GTP-U buffer or change switch profile to “high throughput.”
Buffer Sizes (RX/TX Buffers)
Core-side IP/TCP buffer, GTP-U buffer, virtual switch (e.g., OVS DPDK) queue length.
Recommended: Tune to avoid drops/overflow under heavy loads.
2. TCP/IP and Transport Stack Parameters
TCP Congestion Control Algorithm
Use BBR instead of CUBIC for high-latency, wireless backhaul scenarios.
Change example (Linux CLI):
sysctl -w net.ipv4.tcp_congestion_control=bbr.
TCP Window Size/Scaling
Adjust rmem and wmem (socket buffer) parameters in Linux/UPF:
/proc/sys/net/ipv4/tcp_rmem
/proc/sys/net/ipv4/tcp_wmem
Example: Increase max above default for high-throughput PDU sessions (e.g., to 4MB+).
3. Core Network Slice Configuration
Slice AMBR/Throughput Policy
Set proper max slice throughput (Slice-AMBR) for shared slice users.
QoS Enforcement Policy
Make sure the enforcement function doesn’t cap the flow below radio-side limits.
4. DNS/MTU/Fragmentation Handling
MTU (Maximum Transmission Unit)
Ensure GTP-U path supports large enough MTU to avoid fragmentation (e.g., up to 1500–9000 bytes jumbo frames).
Validate that all devices (vSwitch, N6, routers) align on MTU settings to avoid fragmentation, which reduces throughput.
5. Hardware and Virtualization Performance
CPU Pinning/NIC Affinity (for virtual UPF/SGWs)
Pin user-plane threads to dedicated cores and enable NIC acceleration or DPDK offloading.
SR-IOV/NVMe Acceleration
For high-throughput UEs, enable hardware offload features and fast path for TCP/IP.
Example: Diagnosing and Resolving Low Core Throughput
Case: UE observed 40 Mbps max DL even with excellent radio, 100+ Mbps radio.
Step 1: Check Session-AMBR and Slice-AMBR in SMF/PCF—was set at 50 Mbps due to default profile—adjusted to 200 Mbps.
Step 2: Check buffer/queue in UPF—found default RX buffer size 128k; increased to 1MB.
Step 3: Switch TCP stack on UPF from CUBIC to BBR, increased socket buffer (tcp_wmem) from 256k to 2MB.
Step 4: Check core transport MTU—reduced to 1400, causing drops. Fixed to 1500 along entire core path.
Result: Throughput at UE instantly increased to radio-side levels.
References
3GPP 38.306/38.331/36.104: UE capability, bearer setup, RRC signaling
O-RAN Spec 28.541 / Slicing Docs: For NSI/SNSSAI configuration and bandwidth control
Open-source: srsRAN, Amarisoft, OAI config files and debug procedures
https://www.headspin.io/blog/fixing-network-performance-issues-telecom
https://www.telecomhall.net/t/throughput-troubleshooting-drive-test-analysis/16389
https://www.mavenir.com/blog/key-to-ai-value-realization-in-telecom/
https://www.redhat.com/en/topics/ai/understanding-ai-in-telecommunications
greenwich157/telco-5G-core-faults · Datasets at Hugging Face





Comments