Improving AI Service Performance by 25% Through Microservices Migration

Improving AI Service Performance by 25% Through Microservices Migration How AnimaApp’s strategic microservices transformation reduced latency while maintaining 99.9% uptime for user-facing AI features By Leke Ariyo… TechCity

Science/Technology Aug 29, 2025 0 25 Add to Reading List

How AnimaApp’s strategic microservices transformation reduced latency while maintaining 99.9% uptime for user-facing AI features

By Leke Ariyo

When AnimaApp, a Y Combinator-backed design automation platform, began experiencing performance bottlenecks that threatened their core AI-powered features, the engineering team faced a critical decision. The monolithic architecture that had served them well during early product development was becoming a liability as user demand scaled. Response times were degrading, resource utilization was inefficient, and the system’s ability to handle concurrent AI processing requests was hitting hard limits.

As Senior DevOps Engineer at AnimaApp, I led the strategic migration from monolithic to microservices architecture that ultimately improved service latency by 25% while maintaining the 99.9% uptime that users depended on for their design workflows. This case study examines the technical decisions, implementation strategy, and measurable outcomes of this transformation.

AnimaApp’s technology converts design files into working code, leveraging machine learning models to read design intention and output equivalent HTML, CSS, and React components. The machine learning-driven process requires significant amounts of compute power and coordination across many processing steps: file parsing, image recognition, layout examination, code generation, and quality assessment.

The Performance Challenge

The original monolithic architecture handled these processes sequentially within a single application. While this approach simplified initial development and deployment, it created several performance problems as user volume grew:

Resource Contention: CPU-intensive AI processing competed with I/O-intensive file operations within the same application boundaries, leading to suboptimal resource utilization across the system.

Scaling Limitations: The entire application had to be scaled based on the most resource-intensive component, resulting in over-provisioning of resources for lighter processing tasks.

Processing Delays: Serial processing meant that any delay at any stage affected the entire user process, and the inference times of the AI models directly affected perceived application responsiveness.

Deployment Risk: Refactoring any of the components required to deploy the entire monolith, increasing the danger of inadvertently adding regressions affecting every user-facing feature.

Performance monitoring indicated that mean response times for features with AI were 40% higher over six months as user volume grew, with 95th percentile latencies breaching acceptable thresholds for design professionals, depending on fast iteration cycles.

Strategic Architecture Analysis

Before making any adjustments, we performed a lot of performance analysis to observe exactly where bottlenecks were occurring and what architectural adjustments would have the greatest effect.

With the aid of distributed tracing and fine-grained application profiling, we identified five distinct processing domains within the monolith, each with different resource requirements and scaling behaviors:

File Processing Service: Handled design file uploads, parsing, and initial validation. This component was I/O-intensive with predictable resource requirements.

AI Inference Engine: Executed machine learning models for layout recognition and component classification. This required GPU acceleration and had variable processing times based on design complexity.

Code Generation Service: Converted AI analysis results into functional code. This was CPU-intensive but had consistent processing patterns.

Asset Optimization Service: Compressed and optimized images and other design assets. This component could benefit from horizontal scaling during peak usage periods.

It was discovered that the AI Inference Engine was the predominant performance bottleneck, but its resource requirements were fundamentally distinct from the remainder of the components. Isolating this into a different service would allow GPU-scale optimisation without resource conflict with other processing stages.

Leke Ariyo is a Cloud & Site Reliability Engineer with over 9 years of experience building resilient, scalable systems across AWS, GCP, and Azure. He specializes in Kubernetes, Terraform, and CI/CD automation, with a strong focus on observability and platform reliability. Leke has led engineering efforts for startups and large enterprises like Lloyds Banking Group, consistently driving efficiency, uptime, and cost savings.

TechCity

Read Original