Logo

CloudZen’s Deep Learning Optimization project for Advertisement Company

Migrating from classic ML to deep learning in a mobile AdTech platform was one of the most challenging projects we faced as a technology partner with Adtech company. Customer’s Demand Side Platform (DSP) used ML to predict whether showing an ad impression would lead to a user clicking and installing an app. While we quickly achieved competitive offline metrics with deep learning, deploying these models in production presented unexpected challenges.
CloudZen’s team initially tested deep learning models using Keras and Vertex AI, iterating against our existing logistic regression models. Over time, we successfully improved performance and modernized our ML pipeline. However, the transition wasn’t seamless—we encountered multiple incidents along the way.
In machine learning-driven systems, failures are inevitable. At Twitch, we used the Five W’s framework for postmortems—analyzing what went wrong, when and where it happened, who was involved, and why it occurred—then implementing measures to prevent recurrence. This structured approach helped us systematically strengthen our ML platform.

8 Key Incidents & Lessons we Learned

1. Untrained Embeddings
Issue: Many of our deployed models, including those predicting click and install conversions, suffered from poor calibration. The predicted conversion rates were significantly higher than the actual conversion rates observed in served impressions. A deeper investigation revealed that the miscalibration was particularly severe for categorical features with sparse training data. Ultimately, we discovered that our install model’s embedding layers contained vocabulary entries with no corresponding training data, meaning some weights remained at their randomly initialized values, leading to inaccurate predictions. Some categorical features lacked training data, causing certain embedding layer weights to remain untrained and leading to poor model calibration.
Fix: Limited vocabulary sizes and zeroed out untrained embeddings. Eventually, we stopped reusing vocabulary across tasks.
Lesson: Untrained embeddings led to model instability. We built tools to track weight changes between training runs.

2. TensorFlow Serving Padding Issue
Issue: Sparse tensors (e.g., installed app lists) were handled inconsistently across training and serving, leading to incorrect predictions.
Fix: Standardized handling of empty arrays across pipelines. Our initial fix involved updating the model training pipeline to mirror the logic used during model serving, replacing empty arrays with “[0].” However, this only partially mitigated the issue. To fully address it, we expanded the vocabulary range from [0, n-1] to [0, n], introducing 0 as a placeholder with no inherent meaning and adding it to every tensor. This ensured that every sparse tensor contained at least one value, allowing us to effectively use batching within our sparse tensor setup.
Lesson: Lack of coordination between training and serving pipelines caused feature parity issues. We worked to improve on this incident by including data scientists as reviewers on pull requests on the production pipeline to help identify these types of issues

3. Untrained Model Deployment
Issue: A faulty training run produced a model predicting a constant 25% click rate.
Fix: Rolled back to the previous model, added validation checks, and automated model rollback.
Lesson: Guardrails like validation metrics and real-time monitoring are essential. We also set up alerts on Datadog to flag large changes in the p50 p_ctr metric and worked on automating our model rollback process.

4. Bad Warmup Data for TensorFlow Serving
Issue: Mismatched warmup files and model tensors caused failed deployments.
Fix: Validated warmup files against model tensors before deployment.
Lesson: Staging environments for TensorFlow models are crucial for stability.

5. Problematic Time-Based Features
Issue: Relative time features (e.g., “weeks ago”) created unintended biases.
Fix: Removed them after A/B testing revealed better live performance without them.
Lesson: Features that improve offline metrics can introduce bias in production.

6. Feedback Features & Labeling Errors
Issue: A bid price feature reinforced bias, worsening performance after an experiment introduced false positives.
Fix: Removed the feature and switched to alternative bidding signals.
Lesson: Certain feedback-driven features can cause cascading failures.

7. Bad Feature Encoding
Issue: Divide-by-zero handling differed between training and serving pipelines.
Fix: Standardized encoding logic.
Lesson: Feature parity checks between training and serving pipelines are critical.

8. String Parsing Inconsistencies
Issue: Different string sanitization logic led to out-of-vocabulary (OOV) issues in categorical features.
Fix: Unified parsing logic and improved vocabulary preprocessing.
Lesson: Standardizing string handling at data ingestion minimizes downstream issues.

Key Takeaways

  • Monitor everything: Log feature values, encoded values, tensors, and predictions to detect discrepancies.
  • Validate before deploying: Establish model validation and test environments to prevent faulty rollouts.
  • Beware of feature pitfalls: Some features introduce bias or feedback loops that degrade performance.
  • Ensure team collaboration: Training and serving pipelines must be reviewed together to maintain consistency.
  • Adopt MLOps best practices: Borrow from DevOps to improve reliability and incident response in ML systems.

Deploying deep learning in production comes with challenges, but robust incident management and continuous iteration help build resilient ML platforms.

CloudZen is a leading Europe-based data engineering and Software Automation firm dedicated to crafting bespoke digital solutions for businesses worldwide.