Genos v1.0 · IEEE AIIoT 2026 · Research Whitepaper
Existing security solutions are expensive and often utilize kernel-level agents which, by design, introduce a high attack surface and are resource-intensive, rendering them impractical for low-overhead or constrained environments. This paper proposes and validates a novel open-source, kernel-driverless (Ring 3) intrusion detection and mitigation agent designed as a low-overhead alternative. We successfully integrated a RoBERTa-base classifier fine-tuned using LoRA (Low-Rank Adaptation), achieving 99.75% accuracy against real world tests. The system's operation is anchored by the analysis of select data gathered from Windows Event Logs (WEL), where the classifier performs 4-class syntactic analysis to distinguish malicious from benign commands. We demonstrated its capability to detect and mitigate three high-priority MITRE ATT&CK techniques — T1003.002 (OS Credential Dumping), T1562 (Impair Defenses), and T1134 (Access Token Manipulation) and benign commands — with near perfect fidelity. This work validates a significant contribution by establishing a highly effective and resource-efficient security monitoring approach that matches the detection rigor of kernel-level solutions while adhering to the critical constraint of low system overhead and minimal deployment complexity.
Keywords: Kernel-driverless agent, Windows Event Logs, MITRE ATT&CK, intrusion detection, RoBERTa, LoRA, Syntactic Analysis, Adversarial Training.
This paper addresses the security challenges of achieving an effective and efficient threat detection and mitigation solution without compromising both system stability and security. Existing solutions that do this are known as endpoint detection and response solutions or EDR.
Three of the best EDR solutions in terms of both effectiveness and market share are CrowdStrike's "Falcon Sensor", IBM's "QRadar EDR" and SentinelOne's "Singularity Platform". These solutions are very expensive and rely on their respective agents to reside within the kernel. They need to be in the kernel in order to monitor key security events such as process creation, process mutation, as well as monitoring raw network traffic.
The kernel has the highest level of access in a given computer and hence the aforementioned EDR solutions need this data in order to have the highest level of detection possible. Since drivers operate within the kernel, a bug within the driver can cause a system-wide failure. It also expands the host attack surface as the kernel has the highest level of privilege within a given system. Our agent mitigates this risk by only needing to be run as an Administrator on a given Windows computer.
In July 2024, CrowdStrike had a bug within one of their kernel drivers for Windows. The bug was a logic flaw caused by a trivial programming error. This resulted in the blue screen of death across 8.5 million computers worldwide.
In summary, existing kernel-mode solutions provide the necessary fidelity but at the unacceptable cost of stability and an expanded TCB (Trusted Computing Base). This work addresses this critical gap by proposing and validating a novel, kernel-driverless agent that achieves high detection and automated mitigation efficacy through advanced WEL stream analysis, maintaining a highly constrained attack surface and an efficient performance overhead within the host computer. The main verifiable contributions of this work are:
Based on the risks associated with an expanded Trusted Computing Base and performance overhead, we built the 'Next Generation End-point Detection Agent': a telemetry server and a text classifier model "RoBERTa" as a 3-part comprehensive security solution. The agent is implemented as a Python service operating entirely in user space on the Windows endpoint, requiring only administrative privileges. The agent sends a command to a Flask server that has imported the trained model; the model assigns a label and a confidence score to the command and pushes this data to the database; the server then renders that data in a manner that details each attack appropriately.
The agent runs as a Python process in user space (Ring 3) with elevated privileges. This design inherently avoids the risks of kernel-mode drivers. The model classifies a command as malicious or benign and assigns a confidence score. The server reads detection and mitigation data sent from the agent via MongoDB, then renders it in the telemetry dashboard.
The agent watches for events generated after program initialization and sends command-line data from specific events to the model for inference. The agent subscribes to the Windows Event Log on startup and watches for the following event IDs:
| Event ID | Description |
|---|---|
| 4688 | A new process was created |
| 4950 | A firewall setting has changed |
| 4946 | A firewall rule was added |
| 4947 | A firewall rule was modified |
| 4948 | A firewall rule was deleted |
If an event with ID 4688 has been caught, the agent checks if the command for that event contains any indicators of a registry hive being copied. Events 4950, 4946, 4947, and 4948 are parsed by default without a filter due to the nature of firewall attacks. The agent sends the command-line data to the model server for a label and confidence score. The label is one of the four classes: the three MITRE codes or Benign.
| Attack | MITRE ATT&CK Technique ID |
|---|---|
| Credential Dumping | T1003.002 |
| Token Access Manipulation | T1134 |
| Impair Defenses | T1562 |
| Benign | N/A |
The classifier is the RoBERTa model developed by the University of Washington and Facebook AI, built on BERT (Bidirectional Encoder Representations from Transformers). RoBERTa builds on the performance of BERT by training longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to training data. This enhanced version proved useful as the commands being parsed can be lengthy, and the model must be ready to classify long commands.
The methodology focuses on confirming the system's core functional reliability: its ability to successfully classify the four attacks in Table II with a high degree of accuracy.
The data generation protocol was designed to address the inherent syntactic ambiguity of Windows command-line analysis by producing a highly balanced and structurally isolated training corpus.
Corpus Synthesis: The final dataset comprises 200,000 synthesized command-line entries, logically partitioned into training 80%, validation 10%, and test 10% subsets.
Structural Isolation Ratios: The training data enforces a stringent density ratio to reinforce critical decision boundaries: 70% Malicious TTPs / 20% Benign Hard Negatives / 10% Benign Pure Noise.
Toxin Removal Policy: To specifically eliminate the Catastrophic Forgetting failure mode observed in preliminary models (where the model confuses safe and malicious registry commands), a Toxin Removal policy was implemented. All ambiguous reg save/export structures were surgically eliminated from the Hard Negative class and replaced with placeholder commands (e.g., backup-utility) or safe queries (get-registry-value), ensuring that the reg save/export syntax is exclusively associated with T1003.002 payload delivery.
Preprocessing: All raw command input was subjected to a mandatory normalization step prior to tokenization: redundant whitespace was collapsed to a single space, and all characters were converted to lowercase, matching the agent's pre-inference preparation.
By default, all commands sent to the model for inference are normalized by lowering the case for each character. This ensures that if an attacker tries to change the case of a character within a command, the model won't be affected in terms of evaluation performance.
| Parameter | Value | Rationale |
|---|---|---|
| Base Model | RoBERTa-base | Optimal balance of robustness and efficiency for syntactic analysis |
| PEFT Method | LoRA (Low-Rank Adaptation) | Minimizes computational overhead and memory footprint required for training |
| LoRA Rank | 16 | Determined empirically to provide sufficient capacity for TTP semantic learning |
| Target Modules | query, value, key, dense | Targets both the attention mechanism and the final classification layers |
| Batch Size (Train/Eval) | 32 / 32 | Optimized for GPU memory utilization while allowing effective gradient accumulation |
| Learning Rate | 2.0×10−4 | Standard optimal rate for fine-tuning dense transformer models |
| Precision | bf16 (BFloat16) | Reduces memory usage and increases throughput on compatible hardware |
| Stopping Mechanism | Early stopping callback (Patience: 5) | Prevents overfitting and ensures the model stabilizes at its best F1 score |
There are two components to assess how well the model performs: (1) the post-training evaluation, which measures precision, recall, F1-score, and support; and (2) real-world tests in the form of randomly generated commands that resemble the four labels.
The model was evaluated against a held-out test set of 20,000 samples (10% of the total corpus). The system achieved an overall accuracy of 100% on post-training evaluation, with the agent successfully identifying all three malicious TTPs and distinguishing them from benign administrative activity.
| Classification Label | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Benign | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| T1003.002 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| T1562 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| T1134 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
The resulting confusion matrix obtains a perfect diagonal, validating that the structural hardening and toxin removal policies successfully eliminated the overlap between high-privilege administrative tasks and malicious credential access payloads.
To verify the model's reliability in a production-simulated environment, a rigorous adversarial stress test was performed. This involved 50 iterations of the "V7 Scenario Suite," comprising 1,200 individual test cases designed to represent critical boundary conditions, built around dynamic fuzzing to generate real-world edge cases and accurately check for possible overfitting.
A successful result is defined as a correct label with a confidence score of at least 80%.
| Result | Number of Examples | Percentage (out of 1,200) |
|---|---|---|
| Success | 1,194 | 99.5% |
| Failure | 3 | 0.25% |
| Low Confidence | 3 | 0.25% |
Exporting SYSTEM Hive (Malicious) | Predicted: Benign | 0.98 PowerShell Wrapped Security Dump | Predicted: Benign | 0.51 Searching Logs (Safe Pattern) | Predicted: T1562 | 0.65
The experimental data confirms that high-fidelity security classification is primarily a data engineering challenge. While preliminary model architectures often struggle with "Confident Failures" or unoptimized latency, the fine-tuned RoBERTa-base model achieved 100% accuracy on evaluation and a 99.5% success rate under rigorous adversarial stress.
A critical takeaway was the necessity of Structural Hardening and Toxin Removal. By surgically isolating malicious syntax from benign administrative structures in the training corpus, the model moved from keyword-based pattern matching to a deeper syntactic understanding of command-line intent. This allowed the system to resolve critical boundary cases — such as distinguishing legitimate registry queries from malicious hive dumps — with confidence scores exceeding 0.90 in the majority of production-simulated scenarios.
The performance profiling validates the system's suitability for resource-constrained environments. With a median inference latency of 5 ms and a peak memory footprint of 255 MB, the solution offers a viable alternative to commercial kernel-level agents. This approach significantly reduces the host attack surface by operating entirely in the application layer while maintaining detection rigor that matches or exceeds traditional driver-based tools.
Future development will focus on expanding the classification ontology to include broader persistence and lateral movement TTPs. Additionally, recursive training loops will be implemented to resolve the remaining 0.25% "Low Confidence" edge cases, and the integration of automated remediation orchestration directly into the host agent's response logic will be explored. These directions are addressed in Genos v1.1 and v1.2.
Open-Source Next Gen Endpoint Detection & Response
Ahmed Khan · IEEE World AI IoT Congress (AIIoT 2026) · Seattle, USA
PDF and BibTeX will be linked here upon publication in the IEEE digital library. The paper presents the full architecture, training methodology, evaluation results, adversarial stress test data, and dashboard implementations.