Applied Machine Learning

We deliver full-lifecycle machine learning solutions using TensorFlow, PyTorch, and XGBoost for prediction, classification, and anomaly detection.

Intro

Beyond generative AI, we engineer "classic" machine learning solutions that provide measurable, predictive power. We leverage foundational frameworks like TensorFlow, PyTorch, and Scikit-learn to build and train models that excel at prediction, classification, and anomaly detection. These are solutions for high-stakes business challenges where statistical rigor, interpretability, and verifiable accuracy are paramount, turning raw data into a strategic asset.

The Code 0 Advantage: End-to-End ML Engineering

A highly detailed visualization of a neural network, representing end-to-end ML engineering.

A successful ML model is more than just an algorithm; it's the result of a rigorous engineering process. We manage the entire lifecycle to deliver robust, reliable systems.

  • Problem Framing: We begin by translating a business need into a quantifiable machine learning problem, defining the exact metrics for success before writing a single line of code.
  • Tool Agnosticism: We choose the right tool for the job. While we are experts in deep learning with TensorFlow and PyTorch, we often use Gradient Boosting models (like XGBoost or LightGBM) for tabular data, as they can offer superior performance and interpretability.
  • Feature Engineering at Scale: The most important factor in ML success is the quality of the data. We specialize in advanced feature engineering, using techniques like automated feature synthesis and creating complex embeddings from unstructured data.
  • Production-Ready MLOps: We don't just deliver a model file or a Jupyter notebook. We deploy models as scalable, containerized APIs and establish full MLOps pipelines using tools like MLflow and DVC to monitor for data drift, manage model versions, and automate retraining.

Technical Deep Dive: Our Model Selection Framework

Choosing the right model architecture is critical for success. Our decision process is guided by the data's structure and the problem's specific requirements.

Model Category Key Frameworks/Libraries Primary Use Case When We Use It
Deep Neural Networks (DNNs) TensorFlow, PyTorch Complex, unstructured data; perception tasks. For image classification, natural language understanding (e.g., custom BERT models), and time-series forecasting with complex, non-linear patterns.
Gradient Boosting Machines (GBMs) XGBoost, LightGBM, CatBoost High-stakes tabular data; classification and regression. The default choice for most business problems involving structured data (e.g., customer churn, fraud detection). Offers high accuracy and better interpretability than DNNs.
Clustering & Anomaly Detection Scikit-learn, PyOD Unsupervised learning; finding hidden groups or outliers. For network intrusion detection, identifying fraudulent transactions, or customer segmentation when data is unlabeled.
Time-Series Models Prophet, Statsmodels (ARIMA) Forecasting based on historical time-stamped data. For demand forecasting, resource planning, and predicting future trends when strong seasonality and trend components exist.

Use Cases

A diseased or anomalous neuronal network, representing network intrusion detection.
  • Cybersecurity (Network Intrusion Detection): We build network intrusion detection systems (NIDS) using autoencoder neural networks. The model is trained exclusively on months of legitimate network flow data (e.g., NetFlow, Zeek logs) to learn a highly detailed baseline of "normal" traffic. In production, any live traffic that the model cannot reconstruct with high fidelity (i.e., has a high reconstruction error) is flagged in real-time as a potential anomaly, capable of detecting zero-day threats that signature-based systems would miss.
  • Intelligence (Threat Forecasting): We create predictive models to forecast geopolitical instability or supply chain disruptions. These models are trained on a massive fusion of diverse data sources, including event databases (GDELT), global news feeds, shipping manifests, and commodity pricing data. The output is a probability score for a specific event (e.g., civil unrest, port closure) in a given region, allowing organizations to take proactive measures.
  • Web Dev & Automation (User Churn Prediction): We implement sophisticated user churn prediction models for SaaS applications. The model analyzes time-series of user behavior patterns (login frequency, specific feature adoption, support ticket history, time spent in-app) to generate a "churn risk score" for every user. This allows the business to automatically trigger proactive interventions, like offering a discount to a high-value, at-risk user or providing targeted training to a user struggling with a key feature.

Complete Code Example: TensorFlow Autoencoder for Anomaly Detection

This is a complete, runnable Python script demonstrating how to build a simple autoencoder for anomaly detection. It generates synthetic data, trains the model, determines an anomaly threshold, and then uses it to classify new data points.

autoencoder_anomaly_detection.py
# Step 0: Installation
# pip install tensorflow numpy scikit-learn matplotlib

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
import matplotlib.pyplot as plt

# Step 1: Generate Synthetic Data
# Create normal data centered around a mean, representing 'normal' network traffic
normal_data = np.random.normal(loc=0.5, scale=0.1, size=(1000, 64))
# Create anomaly data, which is distinctly different
anomaly_data = np.random.normal(loc=0.9, scale=0.05, size=(100, 64))

# For training, we only use normal data. We split it into training and validation sets.
train_size = int(len(normal_data) * 0.8)
x_train, x_val = normal_data[:train_size], normal_data[train_size:]

print(f"Training data shape: {x_train.shape}")
print(f"Validation data shape: {x_val.shape}")

# Step 2: Build the Autoencoder Model
latent_dim = 8
input_dim = x_train.shape[1]

autoencoder = models.Sequential([
    # Encoder
    layers.Dense(32, activation='relu', input_shape=(input_dim,)),
    layers.Dense(16, activation='relu'),
    layers.Dense(latent_dim, activation='relu', name='latent_space'),
    
    # Decoder
    layers.Dense(16, activation='relu'),
    layers.Dense(32, activation='relu'),
    layers.Dense(input_dim, activation='sigmoid') # Use sigmoid as data is implicitly scaled 0-1
])

autoencoder.compile(optimizer='adam', loss='mae')
autoencoder.summary()

# Step 3: Train the model ONLY on normal data
history = autoencoder.fit(x_train, x_train,
                          epochs=50,
                          batch_size=256,
                          shuffle=True,
                          validation_data=(x_val, x_val),
                          verbose=0) # Use verbose=1 to see training progress

# Step 4: Determine the Anomaly Threshold
# Predict on the validation data to see what a 'normal' reconstruction error looks like
reconstructions = autoencoder.predict(x_val)
train_loss = tf.keras.losses.mae(reconstructions, x_val)

# Set the threshold to be a value slightly higher than the average error on normal data
# (e.g., mean + 3 * standard deviation)
threshold = np.mean(train_loss) + 3 * np.std(train_loss)
print(f"\nCalculated Anomaly Threshold: {threshold:.4f}")

# Step 5: Evaluate on Anomalous Data
def predict_anomaly(model, data, threshold):
    reconstructions = model.predict(data)
    loss = tf.keras.losses.mae(reconstructions, data)
    return tf.math.less(threshold, loss)

# Combine normal validation and anomaly data for testing
test_data = np.vstack([x_val, anomaly_data])
is_anomaly = predict_anomaly(autoencoder, test_data, threshold)

print(f"\nAnomalies detected: {np.sum(is_anomaly)} out of {len(test_data)} total samples.")
print(f"Known anomalies in test set: {len(anomaly_data)}")

# Plotting reconstruction errors
plt.hist(train_loss[None,:], bins=50, label='Normal Data Loss')
plt.axvline(threshold, color='r', linestyle='dashed', linewidth=2, label='Anomaly Threshold')
plt.title('Reconstruction Error Distribution')
plt.xlabel('Mean Absolute Error')
plt.ylabel('No. of Samples')
plt.legend()
plt.show()

Links