top of page

AI-Driven Anomaly/Fault Detection and Management in Modern Mobile Networks

  • Writer: Venkateshu
    Venkateshu
  • Oct 28
  • 12 min read

Case Study – Low throughput issue mitigation


Introduction

The complexity of today’s telecom networks—driven by 5G’s massive scale, distributed Radio Access Network (RAN) architectures, and virtualized infrastructure—makes operational reliability and proactive fault management both a necessity and a challenge. Static rules and threshold-based monitoring techniques, once the backbone of network assurance, are now insufficient as data velocity, volume, and variety continue to grow.

 

The Need for AI in Telecom Anomaly Detection

Modern telecom environments produce multi-dimensional, varied, and high-frequency telemetry streams, comprising KPIs from base stations, controllers, network functions, and end-user devices. These unstructured logs include throughput figures, error rates, radio signal conditions, and event histories. Relying on human analysis or simple metrics can leave blind spots and slow down the response to emerging problems.

AI and advanced machine learning (ML) frameworks bring several critical capabilities:

  • Autonomous Real-Time Monitoring: ML models continuously observe incoming metrics across devices, cells, and network slices, capturing subtle signal degradation or performance dips.

  • Proactive Anomaly Detection: Instead of responding after service impact, AI algorithms flag outliers, abnormal time series, and rare event correlations as soon as they emerge, enabling instant alerts and rapid investigation.

  • Adaptive Learning and Scalability: Algorithms evolve with network changes, learning new patterns of normality and adjusting to dynamic operational baselines, without manual threshold updating.

 

Methods: Machine Learning Models and Evaluation Metrics

To establish a sound AI-driven anomaly detection system, selecting appropriate ML models and evaluation metrics is essential.

Machine Learning Models Used

  • Unsupervised Models:

    • Isolation Forest: Detects anomalies by isolating data points with fewer splits in the feature space, ideal for unlabeled telecom KPI data.

    • Autoencoders: Neural networks that learn to reconstruct input features; high reconstruction error indicates anomaly.

    • Clustering Techniques: Algorithms such as DBSCAN or k-means mark sparse or divergent clusters as anomalies.

  • Supervised Classification Models:

    • Random Forest, Gradient Boosting, Support Vector Machines: Used when historical labeled data on normal/anomalous states is available.

    • These models classify new instances based on learned patterns of degradation or faults.

  • Hybrid Models:

    • Ensembles or multi-stage pipelines combining unsupervised detection for initial triggers and supervised classification for validation and fault categorization.

Evaluation Metrics for Anomaly Detection

Evaluating model performance meaningfully requires metrics aligned with telecom operational goals:

  • Precision and Recall:

    • Precision measures the fraction of detected anomalies that are true anomalies.

    • Recall measures how many true anomalies were detected by the model.

    • Often, there’s a trade-off: increasing sensitivity improves recall but may increase false alarms.

  • F1 Score:

    • Harmonic mean of precision and recall, balancing both aspects.

    • Useful when the cost of false positives and false negatives is comparable.

  • Accuracy:

    • Overall fraction of correct predictions but may be misleading with imbalanced data.

  • Receiver Operating Characteristic (ROC) and Area Under Curve (AUC):

    • Evaluate a model’s diagnostic ability across classification thresholds.

  • Mean Absolute Error (MAE) and Mean Squared Error (MSE):

    • For regression tasks, measuring how far predicted fault severity or KPI values deviate from actuals (less common in pure anomaly detection).

To ensure robustness, evaluation uses techniques like cross-validation or stratified sampling to account for class imbalance prevalent in fault datasets.

 

End-to-End AI Pipeline in Telecom Fault Management

A typical AI-powered pipeline to transform unstructured RAN logs into actionable fault detection includes:

  1. Data Ingestion and Cleansing: Collection of multi-source logs (UE, cell, system events), outlier filtering, and normalization.

  2. Feature Extraction and Enrichment: Parsing raw event text and telemetry into structured feature sets (e.g., DL throughput, HARQ NACK ratio, CQI, RSRP).

  3. Automated Training and Model Selection: Feeding engineered features into ML models; continuous training for evolving fault signatures.

  4. Real-Time Inference and Alerting: Scoring incoming data, issuing real-time alerts on anomalies, and classifying fault types.

  5. Root Cause Analysis and Closed-Loop Actions: Linking anomalies to likely fault domains and initiating mitigation (power adjustment, handover, troubleshooting workflow).

 

Generalized Use Cases

AI-powered anomaly detection is fundamental to numerous telecom applications:

  • Early detection of RAN performance degradations, before customer impact.

  • Predictive maintenance for hardware/software failures in base stations or network functions.

  • Automated isolation and correlation of multi-domain faults, reducing alert noise and pinpointing root causes.

  • Enhancing SLA compliance, lowering mean time to resolution (MTTR), and improving operator and end-user experience.

 

Here is a list of the top 10 most commonly observed 5G or 4G LTE KPI degradation issues faced by telco operators worldwide, how these are currently resolved by traditional means (without AI/ML), and how AI/ML models can offer advanced solutions for each:

 

Top 10 KPI Degradation Issues

KPI Issue

Current (Non-AI/ML) Troubleshooting

AI/ML-Driven Resolution (with Example)

1. Call Drop Rate (CDR) Increase

Manual log analysis, drive tests, RF audits, parameter tuning

Real-time anomaly detection on call traces with auto-root cause mapping ​

2. Low Data Throughput

Element-wise KPI checks, spectrum analyzer sweeps, static thresholds

Predictive throughput degradation using UE logs and supervised anomaly models

3. Attach/Connection Success Rate Drop

Checking signaling counters, parameter audits, manual message tracing

Pattern mining of signaling failures, automated clustering of root causes ​

4. High Handover Failure Rate

Cross-validation of handover parameters, site-level troubleshooting

ML-based detection of abnormal mobility/KPI behavior, dynamic parameter tuning

5. High Packet Loss/Error Rate

Traffic mirroring, protocol decodes, repeated reconfigurations

Sequence anomaly detection in packet flows, early warnings for growing trends

6. Accessibility/Retainability Dips

Scope isolation (cell, cluster), hardware sweeps, checklist audits

Automated detection of sleeping cells using unsupervised learning ​

7. Uplink/Downlink Latency Surge

Latency test scripts, link path checks, physical inspections

Latency root cause inference from correlated KPIs, time-series anomaly scoring

8. Low Voice Quality (MOS/VoLTE Fail)

Drive test/voice QoE probes, audio playback assessment

Audio analytics and deep learning to infer voice degradations from call traces

9. Sudden RSSI Imbalance/Interference

On-field RF sweeps, checking jumpers/connectors, interference hunting

RF sensor data fusion, deep learning classifiers for real vs. external interference​

10. Network Congestion/Overload

Traffic report trending, capacity planning, static PRB allocation

Predictive congestion management using reinforcement learning for RRM


Case study focusing on the “Low Throughput” KPI issue—covering step-by-step manual troubleshooting procedures and comparing them with advanced AI/ML-driven resolution.

Low throughput means your phone or computer is getting a slow connection, even though the network should be fast. Imagine you’re trying to stream a video or download something, but it keeps pausing or takes a long time—even if you have “full bars” or pay for fast internet.

 

ree

Step-by-Step Manual Troubleshooting for Low Throughput

  1. Initial Alarm/Complaint Collection

    • Receive customer trouble tickets or OSS alarms indicating low throughput in a sector/cell.

  2. Data Gathering

    • Pull RAN performance counters and KPIs (e.g., PRB utilization, RSRP/SINR, CQI, MCS, BLER, HARQ NACK).

    • Review drive test results or UE logs for affected area/device.

  3. Root Cause Isolation

    • Check for:

      • Abnormal PRB/PUCCH usage, congestion

      • Poor RF conditions (RSRP/SINR/CQI) or high interference

      • Incorrect scheduling/MCS values

      • Hardware faults or poor antenna connections

    • Cross-check config (bandwidth, MIMO, scheduler, backhaul, cell parameters).

  4. Field/Physical Inspection

    • If anomaly persists, request a site visit:

      • Inspect antennas, feeders, jumpers for damage or loose connectors.

      • Conduct spectrum analysis to find external interference.

  5. Remediative Actions

    • Tune scheduler/config parameters (e.g., PRB allocation, MCS thresholds).

    • Rectify RF/hardware faults, clean connectors, re-align antennas.

    • Rerun drive test to validate improvement; update ticket/resolution status.

 

ree

Traditional Resolution Methods (Without AI/ML)

  • Heavy reliance on manual log extraction from OSS/NMS, experience-driven triage, checklists, and network audits.

  • Static thresholds generate fixed alarms; engineers must dig deeper for real issues (often using element/vendor toolkits).

  • Cross-team troubleshooting (RF, transport, core, device teams) based on phone calls, email escalations, and human pattern recognition.

  • Periodic audits and drive tests attempt to catch chronic issues.

 

Step-by-Step AI/ML-Driven Troubleshooting for Low Throughput

1. Automated Data Collection & Feature Extraction

  • What happens: Real-time ingestion of logs/KPIs from UEs, gNBs, and core (PRB usage, CQI, MCS, HARQ stats, slice config, core AMBR etc.).

  • AI/ML methods used:

    • Data Parsing/NLP: Custom scripts and sequence-to-structure models parse raw/unstructured logs into structured feature tables.

    • Feature Engineering: Automated pipelines or feature stores generate relevant features (rolling averages, deltas, composite metrics) for modeling.

Purpose: Ensure all aspects influencing throughput are systematically captured so models are accurate and up-to-date.

 

2. Anomaly Detection

  • What happens: The AI detects cells/users whose throughput and related KPIs deviate from historical or peer-group baselines.

  • AI/ML methods used:

    • Unsupervised Learning

      • Isolation Forest: Randomly splits data, detecting outliers that require fewer splits. Good for identifying rare KPI states or combinations.

      • Autoencoders: Neural networks learn to compress and reconstruct typical KPI patterns. High “reconstruction error” = anomaly.

    • Time Series Models

      • ARIMA/LSTM: Forecast expected values; actual is compared to forecast to flag anomalies.

    • Statistical Methods

      • Z-score, moving average, and quantile-based detectors for quick filtering.

Purpose: Detect abnormal drops in throughput, even before alarms would fire or customers complain.

 

3. Root Cause Analysis & Explainability

  • What happens: Once an anomaly is found, the system determines why throughput is low—by correlating all feature shifts (e.g., low PRB, bad CQI, high HARQ).

  • AI/ML methods used:

    • Supervised Learning/Classification

      • Random Forests/XGBoost: Trained on historical labeled tickets to classify the cause (“congestion,” “scheduler misconfig,” etc.)—returns “feature importance”.

    • Explainable AI (XAI)

      • SHAP/LIME: Highlights which features (e.g., “scheduler bandwidth” or “gNB PRB cap”) contributed most to the anomaly decision.

    • Multivariate Analysis

      • Correlation and association rules to find key parameter relationships.

Purpose: Pinpoint actionable causes and facilitate fast, targeted remediation.

 

4. Resolution Recommendation & Automation

  • What happens: System presents the most likely fix (e.g., “increase PRB,” “inspect feeder”, “raise AMBR”, “switch to BBR”) to engineers or triggers closed-loop corrective actions directly.

  • AI/ML methods used:

    • Reinforcement Learning (RL):

      • RL agents simulate/take actions (parameter changes) and learn from observed improvements in KPIs to recommend best next actions.

      • Can operate in closed-loop mode for auto-tuning.

    • Expert System/Rule Augmentation:

      • Augment ML with domain-encoded actions for cases where AI has lower confidence.

Purpose: Drive “zero-touch” or semi-automated fixing of issues, reducing mean-time-to-resolve and human effort.

 

5. Continuous Validation & Learning

  • What happens: Post-remediation, the system monitors KPIs for improvement and feeds outcome data back to training pipelines.

  • AI/ML methods used:

    • Active Learning:

      • System prioritizes learning from rare/edge-case resolutions to improve model generality.

    • Feedback Loop/Re-training:

      • Models auto-retrain on new diagnostic and resolution data to adapt to changing network conditions.

Purpose: Continual improvement—each fix improves the model’s accuracy, reducing future false alarms and speeding up diagnosis.

ree

Summary Table: AI/ML Amplifies Detection, Diagnosis, and Resolution

 

Issue

Without AI/ML

With AI/ML

Slow, manual triage

Slow, fragmented view

Holistic, real-time, predictive, automated

High operational overhead

Human bottlenecks

Reduced field dispatches, faster MTTR

Poor root cause isolation

Experience driven

Data-driven, systemic cause mapping

Real-time Low throughput Issue Troubleshooting

 

A. Example Step-by-Step Resolution: Manual vs. AI/ML

Scenario: 20MHz cell, low DL throughput detected

Step 1: Data & Symptom Collection

  • Manual: Engineer collects QXDM logs, checks DL BLER, CQI, PRB assignment, MCS events; downloads gNB config, checks scheduler section

  • AI/ML: Automated pipeline ingests UE and RAN logs, parses for correlated drops in throughput, CQI, HARQ, PRB, and MCS

Step 2: Parameter Correlation

  • Manual: Cross-checks whether CQI/RSRP is low; spots only 100 PRBs assigned (out of 106); suspects license shortfall or misconfigured cell BW

  • AI/ML: Model explains drop with feature importance: “PRB count,” “MCS,” and “CQI” strongly correlate with anomaly

Step 3: In-Depth Inspection

  • Manual: Downloads current gNB config:

    • bandwidth=20MHz (should match)

    • prb_count=100 (should be closer to max for 20MHz: ~106 for LTE, 273 for 5G sub-6)

    • scheduler_type=rr (should consider pf or advanced)

    • tdd_ul_dl_config=2:7 (DL-heavy is good)

  • AI/ML: Flags “max PRB assigned lower than expected”; triggers remedial action suggestion

Step 4: Reconfiguration/Tuning

  • Manual: Changes:

    • Increase prb_count to 106 (LTE) or 273 (5G NR)

    • Adjust scheduler to proportional fair

    • Reboot cell/site

  • AI/ML: API/closed-loop triggers config update, validates post-fix KPIs automatically

Step 5: Validation

  • Manual: Repeats throughput test, validates improvement in QXDM logs; closes ticket

  • AI/ML: Monitors post-fix logs, confirms normal KPIs; archives scenario for retraining

 

B. Sample UE Log/Configuration Snippets (Field Examples)

QXDM/MAC Log:

Time 123456: SchedPRB=72, CQI=8, DL_Tput=42Mbps, HARQ_NACK=12%, MCS=QPSK

Time 123789: SchedPRB=106, CQI=12, DL_Tput=91Mbps, HARQ_NACK=2%, MCS=64QAM

 

gNB Config:

cell_bandwidth: 20MHz

prb_count: 100      # Should be 106/273

scheduler_type: rr  # Consider 'pf'

tdd_ul_dl_config: DL:UL = 7:2

mimo_layers: 2

carrier_aggregation: enabled

 

C. Key Parameters to Inspect When Low Throughput Is Detected

1. UE Log (QXDM/QCAT/Chipset-Specific) Parameters

  • DL/UL Throughput: e.g., actual vs. scheduled throughput (per RLC, MAC logs)

  • CQI, PMI, RI: Low CQI/Rank Indicator often explains low modulation

  • BLER (Block Error Rate): High BLER degrades throughput, especially TCP

  • HARQ NACK Rate: Frequent NACKs signal decoding or resource/radio problems

  • RSRP/RSRQ/SINR: Weak/variable signal strengths lead to CQI/throughput drops

  • RLC Mode (AM/UM): RLC AM is sensitive to loss/retransmission; check RLC retrans counts

  • UE Capability Exchange: Ensure UE correctly negotiated max BW, MIMO, carrier aggregation (3GPP 38.306/38.331)

  • Assigned PRBs/Slot: Fewer than expected implies scheduler or license issue

  • Physical Cell ID/Serving Cell Info: Cross-verify expected cell selection

  • MCS (Modulation and Coding Scheme): Low MCS or capping/truncation may restrict throughput

  • L2, NAS, S1/X2 Events: Look for RLF (Radio Link Failure), drops in bearer establishment

2. gNB/eNB (Base Station) and vRAN Parameters

  • Bandwidth Allocation: Ensure cell defines full (20 MHz) BW for the slice/UE group (see 3GPP 38.104, 36.104)

  • PRB (Physical Resource Block) Mapping: Overor undersubscription reduces throughput

  • Scheduler Type and Fairness Algorithm: Check for proportional fair, round robin, strict priority—misconfigurations can starve some flows (vendor-specific: Samsung vRAN, Ericsson DUS/Baseband, Nokia AirScale)

  • MIMO Configuration: Number of layers, beamforming settings, license/capability match

  • TDD/FDD Frame Settings: Wrong UL/DL ratio in TDD can throttle DL

  • Transmission Power, Antenna Parameters: Tx Power, tilt, beam direction

  • Backhaul Rate Limit/GTP-U Tunneling: Core network link/switch congestion (check for bottlenecking on S1-U, N3 interfaces)

  • Carrier Aggregation/GNB Capabilities: Actual use of CA/MIMO as signaled in RRC reconfigurations (see 3GPP 38.321/38.322)

  • Slicing/Network Slice Selection: Bandwidth/capacity reserved per NSI/SNSSAI (O-RAN, 3GPP 28.541)

  • Antenna/Hardware Alarms: PA/antenna feedelement issues or software-flagged failures

3. Core Network/Transport Layer Checks

  • UPF Throughput Limits: Sufficient GTP-U tunnel resources/capacity

  • QoS Flows (5QI): Policed/limited throughput as per 5QI scheduling policy

  • TCP Window Scaling/Buffer Size: Especially if poor TCP, but good UDP throughput is observed​

  • NSSMF/NSMF Policy: Check slice resource templates and real-time elasticity

 

 

D. Recommended values for RAN parameters

 

1. Physical Resource Block (PRB) Allocation

  • Parameter: prb_count (assigned per UE or scheduling interval)

  • Recommended Range:

    • For 5G NR: Up to 273 PRBs for 100 MHz (30 kHz SCS);


      ~106 PRBs for 20 MHz (15 kHz SCS, FR1/LTE backward)

  • Best Practice: Ensure near-maximum scheduling for eMBB UEs when cell load allows.​

2. Modulation and Coding Scheme (MCS) Table

  • Parameter: mcs_index (0–28 in 5G NR, influences constellation and code rate)

  • Recommended Range:

    • Auto/dynamic, but MCS should align with CQI.

    • Low MCS index (<10) = poor channel, high retransmissions.

    • High MCS index (20+) = good channel for high throughput.

  • Note: Adaptive MCS recommended, with “Filter of UE MCS value” often set 0–2.​

3. Transmission Time Interval (TTI) / Slot Size / Mini-slot Scheduling

  • Parameter: tti_length, slot_duration, minislot_enabled

  • Recommended Range:

    • 5G NR supports TTI: 0.125 ms, 0.25 ms, 0.5 ms (slot), down to mini-slot (~0.071 ms)

    • Shorter TTI → lower latency but more overhead; default: 0.5 ms (One slot for 15/30 kHz SCS)

  • Best Practice: Use smaller TTI for URLLC, default slot for eMBB.​

4. HARQ (Hybrid ARQ) Settings

  • Parameter: harq_processes, harq_max_retx

  • Recommended Range:

    • 8–16 processes

    • 3–4 max retransmissions

  • Best Practice: Sufficient processes for the expected load, avoid excessive retransmissions.​

5. Scheduler Algorithm and Fairness

  • Parameter: scheduler_type

  • Typical Types:

    • 'pf' (proportional fair): preferred for balancing throughput and fairness

    • 'rr' (round robin): testing/basic, not optimal for capacity

    • 'priority', 'max throughput', 'QoS-aware'

  • Best Practice: Use ‘pf’ or advanced vendor scheduler for commercial sites.

6. BLER (Block Error Rate) Target

  • Parameter: bler_target

  • Recommended Range:

    • eMBB: 10% BLER at the outer-loop link adaptation point

    • URLLC: stricter, lower targets (e.g., 1e-5)​

  • Best Practice: Set for intended service profile; eMBB can accept higher BLER for throughput.​

7. MIMO Layer Assignment (Rank)

  • Parameter: rank, num_mimo_layers

  • Recommended Range:

    • 1–8 layers (depending on gNB/UE support, typical: 2–4 for mid-band)

  • Best Practice: Adapt layers per UE capability and channel quality.

8. Uplink/Downlink Scheduling Ratios (TDD Only)

  • Parameter: tdd_ul_dl_config

  • Recommended Range:

    • e.g., 7:2 (DL:UL ratio for DL-heavy sites)

  • Note: Should match traffic profiles.

9. SRS, CSI, and Scheduling Grant Configurations

  • Parameter: srs_periodicity, csi_report_config

  • Best Practice: Set high enough reporting periodicity and coverage to optimize scheduler accuracy.​

10. Admission/Load & Buffer Status Parameters

  • Parameter: bsr_threshold, admission_control_enabled

  • Recommended Range:

    • Set buffer/reporting thresholds and enable admission control for load abatement.​

 

Example: srsRAN Config Snippet (for reference)

# srsRAN gNB config (YAML example)

scheduler:

  scheduler_type: pf

  tti_length: 0.5ms

  prb_count: 106

  mcs_table: auto

  harq_processes: 16

  bler_target: 0.1

  rank: 4

  tdd_ul_dl_config: 7:2

  admission_control_enabled: true

 

Quick-Reference Table

Parameter

Typical/Recommended Range

PRB Count

50 (10MHz), 100/106 (20MHz), 273 (100MHz/5G)

MCS Index

0–28 (auto/dynamic, matches CQI)

Scheduler Type

‘pf’, ‘priority’, ‘QoS-aware’

TTI/Slot Size

0.5 ms (standard), 0.125–0.25 ms (mini slot)

HARQ Processes

8–16

HARQ Retrans

3–4

BLER Target

10% (eMBB), <<1% (URLLC)

MIMO Layers/Rank

1–8 (UE/gNB cap. dependent)

TDD UL:DL Config

e.g., 7:2 for DL-heavy

Here are core network (EPC/5GC) configuration parameters that often impact and can be tuned to resolve low throughput issues.

 

1. User Plane Function (UPF, SGW-U, PGW-U) Parameters

  • Session-AMBR (Aggregate Maximum Bit Rate)

    • Controls max bandwidth per PDU session (per UE or per slice).

    • Range/Example: Should match or exceed radio-side peak (e.g., set 200 Mbps+ for eMBB UE).

    • Tune: Increase if sessions are capped below radio capability.​

  • Per-Flow QoS Policy Parameters (5QI, GBR, MBR, ARP)

    • 5QI value chosen, GBR/MBR values set per QFI (QoS Flow Identifier).

    • Example: Set appropriate MBR/GBR for haptic/streaming flows, increase if constrained.

  • GTP-U Throughput Limits

    • User tunnel capacity on UPF/S-GW-U network interfaces (check interface and vSwitch limits).

    • Action: Increase GTP-U buffer or change switch profile to “high throughput.”

  • Buffer Sizes (RX/TX Buffers)

    • Core-side IP/TCP buffer, GTP-U buffer, virtual switch (e.g., OVS DPDK) queue length.

    • Recommended: Tune to avoid drops/overflow under heavy loads.​

2. TCP/IP and Transport Stack Parameters

  • TCP Congestion Control Algorithm

    • Use BBR instead of CUBIC for high-latency, wireless backhaul scenarios.

    • Change example (Linux CLI):


      sysctl -w net.ipv4.tcp_congestion_control=bbr.​

  • TCP Window Size/Scaling

    • Adjust rmem and wmem (socket buffer) parameters in Linux/UPF:

      • /proc/sys/net/ipv4/tcp_rmem

      • /proc/sys/net/ipv4/tcp_wmem

      • Example: Increase max above default for high-throughput PDU sessions (e.g., to 4MB+).​

3. Core Network Slice Configuration

  • Slice AMBR/Throughput Policy

    • Set proper max slice throughput (Slice-AMBR) for shared slice users.

  • QoS Enforcement Policy

    • Make sure the enforcement function doesn’t cap the flow below radio-side limits.

4. DNS/MTU/Fragmentation Handling

  • MTU (Maximum Transmission Unit)

    • Ensure GTP-U path supports large enough MTU to avoid fragmentation (e.g., up to 1500–9000 bytes jumbo frames).

    • Validate that all devices (vSwitch, N6, routers) align on MTU settings to avoid fragmentation, which reduces throughput.

5. Hardware and Virtualization Performance

  • CPU Pinning/NIC Affinity (for virtual UPF/SGWs)

    • Pin user-plane threads to dedicated cores and enable NIC acceleration or DPDK offloading.

  • SR-IOV/NVMe Acceleration

    • For high-throughput UEs, enable hardware offload features and fast path for TCP/IP.

 

Example: Diagnosing and Resolving Low Core Throughput

  • Case: UE observed 40 Mbps max DL even with excellent radio, 100+ Mbps radio.

  • Step 1: Check Session-AMBR and Slice-AMBR in SMF/PCF—was set at 50 Mbps due to default profile—adjusted to 200 Mbps.

  • Step 2: Check buffer/queue in UPF—found default RX buffer size 128k; increased to 1MB.

  • Step 3: Switch TCP stack on UPF from CUBIC to BBR, increased socket buffer (tcp_wmem) from 256k to 2MB.

  • Step 4: Check core transport MTU—reduced to 1400, causing drops. Fixed to 1500 along entire core path.

  • Result: Throughput at UE instantly increased to radio-side levels.

Parameter

Purpose

Typical Adjustment/Range

Session-AMBR

UE max DL/UL rate

>= cell max, e.g., 200–500 Mbps

Slice-AMBR

Slice max throughput

>= aggregate peak, e.g., 1 Gbps

GTP-U/UPF buffer

Tunnel buffer

Increase for eMBB/large file flows

TCP congestion

Avoid loss-based stalls

Use BBR for mobile environments

TCP window sizes

Prevent protocol-level throttles

1–4 MB (depending on RTT, flow)

MTU

Minimize fragmentation overhead

1500–9000 bytes (all path elements)

Policy controls

Avoid unintended caps

Audit/raise 5QI/GBR/MBRs

References

 

Comments


 

bottom of page