Genos v1.1 · IEEE AIIoT 2026 · Research Whitepaper
The proliferation of Living-off-the-Land (LotL) techniques presents a significant classification challenge for traditional security controls due to their reliance on dual-use system utilities and syntactically complex command-line telemetry. Models designed for natural language processing (NLP), which are fine-tuned for linguistic patterns, often do not capture the execution semantics and structural dependencies inherent in command-line inputs. In this paper, we propose a two-stage cascaded classification framework based on a pretrained transformer model (CodeBERT) to map command-line sequences to 141 MITRE ATT&CK techniques. The first stage performs binary classification to distinguish potentially malicious commands from benign administrative activity, while the second stage applies fine-grained multi-class classification for behavioral attribution. This cascaded design reduces unnecessary multi-class inference on benign-dominated traffic while improving focus on security-relevant inputs. Experimental results show that the proposed system achieves 95.53% Top-1 accuracy and 97.94% Top-3 accuracy, with a Tier 1 Area Under the Curve (AUC) of 0.9999 and a macro F1-score of 95.46%. End-to-end evaluation further shows that pipeline latency decreases as traffic becomes increasingly benign, reaching 6.57 ms per sequence under a production-like 99% benign distribution. These results demonstrate that transformer-based representations of command-line syntax can support high-fidelity, real-time adversarial behavior classification in endpoint environments.
Keywords: Living-off-the-Land (LotL), MITRE ATT&CK, command-line analysis, malware classification, transformer models, CodeBERT, hierarchical classification, cybersecurity machine learning, behavioral analysis.
The landscape of adversarial operations has shifted toward Living-off-the-Land (LotL) techniques, where threat actors leverage native, dual-use system utilities such as PowerShell, Windows Management Instrumentation (WMI), and bash to execute malicious objectives. Because these utilities are commonly used in legitimate administrative and DevOps workflows, distinguishing malicious intent from benign activity presents a significant challenge for modern Endpoint Detection and Response (EDR) systems. As a result, the problem extends beyond binary detection to accurately disambiguating intent within syntactically complex command-line telemetry.
Attackers further exploit this ambiguity through obfuscation techniques such as nested encoding, string manipulation, and dynamic command generation. While recent work has explored the application of NLP and deep learning models to command-line analysis, models optimized for linguistic data (e.g., RoBERTa) may not fully capture the execution semantics of command sequences. Unlike natural language, command-line inputs are governed by structured dependencies and operator relationships that resemble programmatic representations such as Abstract Syntax Trees (ASTs). In addition, single-stage classification approaches often struggle in environments with significant class imbalance, where benign activity dominates and false positive rates must be tightly controlled.
Traditional approaches to command-line analysis rely on statistical feature extraction and classical machine learning. Common techniques include TF-IDF representations combined with classifiers such as Random Forests and Support Vector Machines. These methods are effective for detecting known patterns but depend heavily on surface-level token frequency and n-gram statistics, and often struggle to capture structural dependencies and execution semantics — particularly when adversaries employ obfuscation techniques such as encoding, token splitting, and dynamic command construction.
Pre-trained models such as CodeBERT extend transformer-based approaches by incorporating training on both natural language and source code, enabling improved representation of structured inputs. In parallel, several studies have focused on mapping system activity to standardized threat frameworks such as the MITRE ATT&CK matrix for behavioral analysis and incident response. Despite these advances, two key limitations remain: existing models either rely on surface-level statistical features or apply transformer architectures designed for natural language without fully addressing command-line structural characteristics; and most approaches employ single-stage classification pipelines, leading to unnecessary computational overhead when processing benign-dominated telemetry.
To address these challenges, this paper proposes a two-stage cascaded classification framework based on CodeBERT. The first stage acts as a binary filter to distinguish potentially malicious commands from benign activity, while the second stage performs fine-grained classification across 141 MITRE ATT&CK techniques. The primary contributions are:
Command-line inputs often employ obfuscation techniques such as nested encodings (e.g., Base64), string concatenation, and dynamically generated substrings. To improve model interpretability, a deterministic preprocessing module is applied prior to inference. This module recursively decodes encoded substrings and simplifies common obfuscation patterns up to a fixed depth. The goal is to expose the underlying command structure and reduce token-level noise before transformer tokenization. The maximum recursion depth is empirically bounded to limit preprocessing overhead.
The first stage performs binary classification to distinguish potentially malicious commands from benign administrative activity. Given an input sequence x, the model outputs a probability distribution P(y | x), where y ∈ {0, 1} corresponds to benign and malicious classes respectively. The predicted class is obtained via argmax over the Tier 1 output distribution:
ŷ̂ = argmax P(y | x), y ∈ {0 = benign, 1 = malicious}
Inputs classified as benign terminate at Tier 1. Inputs classified as malicious are forwarded to Tier 2. This stage serves as a filtering mechanism to reduce the number of samples requiring computationally expensive multi-class inference.
The second stage performs fine-grained classification over malicious inputs. Conditioned on activation by Tier 1, the model maps the input sequence to a probability distribution over N = 141 MITRE ATT&CK techniques using a softmax output layer:
P(y_c | x), c ∈ {1, ..., 141}
ŷ̂ = argmax_c P(y_c | x)
Each class corresponds to a specific MITRE ATT&CK technique. This stage enables detailed behavioral attribution by associating command-line inputs with standardized adversarial techniques. Top-k predictions are also considered to account for semantic overlap between related techniques, which is reflected in the Top-3 accuracy metric reported in the evaluation.
Transformer-based models such as RoBERTa are optimized for natural language distributions and may exhibit limitations when applied to syntactically structured command sequences. These inputs are governed by execution semantics rather than linguistic structure, resulting in a distributional mismatch between their training data and command-line inputs.
To address this, both tiers adopt microsoft/codebert-base (125M parameters), pre-trained on both natural language and source code across six programming languages. This enables the model to capture structural dependencies in command-line execution — including relationships between flags, pipelines, operators, and encoded payloads — while maintaining low inference latency.
The input sequence is tokenized using Byte-Pair Encoding (BPE) via the RoBERTa tokenizer associated with CodeBERT, with truncation enabled and padding applied to a fixed maximum sequence length of 256 tokens. This tokenization preserves subword structure including executable name fragments, flag prefixes, and path separators that remain distinctive across obfuscation variants.
Transformer self-attention captures structural dependencies that span non-adjacent tokens and reflect execution order rather than linguistic proximity — including relationships between flags, pipelines, and payloads that are central to command-line behavioral semantics.
The two tiers are trained as independent but sequentially coupled models. Tier 1 is optimized using binary cross-entropy; Tier 2 is optimized using categorical cross-entropy over 141 classes. Tier 2 is conditionally executed only when Tier 1 predicts the malicious class (ŷTier1 = 1).
The cascaded design reduces expected inference cost. In benign-dominated environments, the probability that Tier 1 predicts the malicious class is low, so the expected end-to-end cost approaches that of Tier 1 alone:
E[L] = L_Tier1 + P(ŷ_Tier1 = 1) · L_Tier2 where: L_Tier1 = Tier 1 latency L_Tier2 = Tier 2 latency P(...) = probability Tier 1 predicts malicious
As the proportion of benign inputs increases, P(ŷTier1 = 1) decreases, and the expected end-to-end cost converges toward Tier 1 latency alone. Empirical latency measurements confirming this behavior are reported in Section 4.
The dataset consists of command-line sequences labeled as either benign or malicious, with malicious samples further mapped to one of 141 MITRE ATT&CK techniques. To support reproducible evaluation, the dataset is partitioned using an 80/10/10 split: 80% training, 10% validation, 10% test.
The validation split is used for model selection and hyperparameter tuning. All final metrics reported in Section 4 are computed exclusively on the held-out test set. Benign and malicious samples are separated prior to Tier 1 evaluation; Tier 2 is evaluated only on malicious samples with valid MITRE ATT&CK labels.
The dataset is intentionally analyzed under multiple benign-to-malicious traffic ratios during end-to-end pipeline evaluation to approximate both balanced and production-like deployment conditions. The dataset is inherently imbalanced toward benign samples, reflecting real-world endpoint telemetry where benign activity significantly outnumbers malicious events.
Both Tier 1 and Tier 2 use the microsoft/codebert-base encoder (~125M parameters), pre-trained on both source code and natural language. The maximum token length is fixed at 256 tokens per sequence. Tokenization uses the RoBERTa tokenizer associated with CodeBERT, with truncation enabled and padding to a fixed maximum sequence length.
Tier 1 is trained as a binary classifier using binary cross-entropy loss. Tier 2 is trained as a 141-class classifier using categorical cross-entropy loss. The inference environment uses an NVIDIA RTX 4000 GPU, and all latency measurements are reported under single-sequence inference conditions.
Prior to inference, each command is processed through a deterministic de-obfuscation pipeline. The preprocessing stage includes:
[char] constructions into plain-text strings.This preprocessing stage is intended to expose latent command structure before tokenization, thereby reducing token-level noise for downstream transformer inference.
Example 1 — Base64 encoded PowerShell
Raw: powershell -enc VwByAGkAdABlAC0ASABvAHMAdAAgACIASABlAGwAbABvACIA
Output: write-host "hello"
Example 2 — Character-code construction
Raw: ([char]119)+hoami
Output: whoami
Example 3 — .NET Base64 decode wrapper
Raw: [System.Text.Encoding]::UTF8.GetString(
[System.Convert]::FromBase64String('d2hvYW1p'))
Output: whoami
Models are evaluated using Top-1 accuracy, Top-3 accuracy, macro F1-score, and inference latency. Top-3 accuracy is reported to account for semantic overlap between related MITRE ATT&CK techniques, where multiple techniques may be plausible for a given command sequence. All experiments are conducted on an NVIDIA RTX 4000 GPU; latency measurements correspond to per-sequence inference under single-sample evaluation.
The Tier 1 classifier is evaluated using a Receiver Operating Characteristic (ROC) curve. The model achieves an AUC of 0.9999, indicating near-perfect separability between benign and malicious command sequences. At the operating point:
| Metric | Value |
|---|---|
| AUC | 0.9999 |
| Precision | 100.00% |
| Recall | 99.93% |
| F1-score | 99.96% |
These results demonstrate that the Gatekeeper effectively distinguishes benign administrative activity from adversarial command sequences, even in the presence of syntactic variability and obfuscation. While near-perfect separability is observed, this result may be influenced by dataset characteristics and should be interpreted in the context of the evaluation distribution.
For fine-grained classification, performance is evaluated using Top-3 accuracy and macro F1-score to account for the substantial number of classes (141 techniques) and potential class imbalance. The proposed model achieves a Top-3 accuracy of 97.94% and a macro F1-score of 95.46%. The high Top-3 accuracy indicates that the model successfully captures semantic proximity between related techniques, allowing correct classifications to appear within a small candidate set. The average Tier 2 inference latency, including preprocessing, is 6.74 ms per sequence.
To evaluate real-world deployment characteristics, the full pipeline — including preprocessing, Tier 1 classification, and conditional Tier 2 inference — is evaluated under varying benign-to-malicious traffic distributions:
| Benign Ratio | E2E Latency (ms) | Tier 2 Trigger Rate | Obf. Flag Rate | De-obf. Change Rate |
|---|---|---|---|---|
| 0.50 | 9.86 | 49.98% | 10.15% | 0.33% |
| 0.90 | 7.20 | 10.00% | 2.13% | 0.03% |
| 0.99 | 6.57 | 1.00% | 0.18% | 0.00% |
Under a balanced 50% benign distribution, the system exhibits a latency of 9.86 ms with a Tier 2 trigger rate of 49.98%, representing a worst-case operational scenario. Under a 90% benign distribution, the Tier 2 trigger rate decreases to 10.00%, reducing latency to 7.20 ms. Under a production-like 99% benign distribution, the trigger rate further decreases to 1.00%, yielding an average latency of 6.57 ms per sequence. These results validate that the cascaded architecture effectively reduces computational overhead as the proportion of benign inputs increases.
The proposed architecture is compared against a statistical baseline consisting of TF-IDF feature extraction combined with a Random Forest classifier. The baseline achieves slightly higher Top-1 accuracy, while the CodeBERT-based model demonstrates improved Top-3 accuracy, indicating that transformer representations more effectively capture semantic relationships between related MITRE ATT&CK techniques. The baseline exhibits significantly higher latency due to feature extraction overhead.
| Model | Top-1 Accuracy | Top-3 Accuracy | Macro F1 | Latency (ms) |
|---|---|---|---|---|
| TF-IDF + Random Forest | 95.89% | 97.52% | 95.73% | 24.77 |
| CodeBERT Specialist (Tier 2) | 95.53% | 97.94% | 95.46% | 6.74 |
The preprocessing module detects potentially obfuscated sequences at varying rates depending on traffic distribution. Under balanced traffic, 10.15% of inputs are flagged as obfuscated; under a 99% benign distribution this decreases to 0.18%. Despite these detections, only a small fraction of inputs (≤0.33%) undergo meaningful transformation after recursive de-obfuscation. This indicates that substantial obfuscation requiring normalization is relatively rare in production-like traffic, but the preprocessing module remains important for handling edge cases and deeply nested encodings. Preprocessing overhead remains low compared to overall inference cost.
The results highlight a trade-off between statistical and transformer-based approaches. While TF-IDF-based models provide competitive Top-1 accuracy, they rely on surface-level token statistics and fail to capture structural dependencies in obfuscated command sequences. In contrast, the proposed architecture improves ranking performance (Top-k) and enables structured mapping to MITRE ATT&CK techniques. Several limitations should be noted: the near-perfect Tier 1 AUC may partially reflect dataset characteristics; additional cross-environment testing is needed to assess generalization across organizations, operating systems, and administrative workflows; and future work should address confidence-aware routing and adaptive mechanisms for incorporating analyst feedback.
This paper presented a two-stage transformer-based framework for classifying command-line inputs and mapping malicious sequences to 141 MITRE ATT&CK techniques. By combining a binary Gatekeeper with a fine-grained Specialist model, the proposed architecture separates benign filtering from multi-class behavioral attribution, enabling conditional inference over security-relevant inputs.
Experimental results showed strong Tier 1 discrimination with an AUC of 0.9999, precision of 100.00%, recall of 99.93%, and F1-score of 99.96%, while Tier 2 achieved 95.53% Top-1 accuracy, 97.94% Top-3 accuracy, and a macro F1-score of 95.46%. End-to-end evaluation demonstrated that the cascaded design becomes more efficient as the benign traffic ratio increases, with average pipeline latency decreasing from 9.86 ms under balanced traffic to 6.57 ms under a production-like 99% benign distribution.
These results indicate that code-aware transformer representations combined with cascaded inference provide a practical approach for real-time command-line threat classification in endpoint environments.
Future work will evaluate generalization across cross-environment datasets and unseen command distributions, with particular emphasis on robustness to novel obfuscation strategies and operational drift. Additional directions include confidence-aware routing, calibration of low-confidence predictions, and active learning mechanisms for incorporating analyst feedback into future model updates.
A Two-Stage Transformer-Based Framework for Command-Line Classification and MITRE ATT&CK Technique Mapping
Ahmed Khan · IEEE World AI IoT Congress (AIIoT 2026) · Seattle, USA
PDF and BibTeX will be linked here upon publication in the IEEE digital library. The paper presents the full architecture, training methodology, evaluation datasets, and benchmark results including the ROC curves and per-class F1 breakdowns.