Case Studies

CADY Accelerates AI Workflows and Overcomes AWS Batch Limitations with Slurm on AWS

Background

CADY is pioneering what was once considered an unsolvable challenge—automated understanding of electrical components datasheets. CADY’s AI system translates complex hardware documentation into a formal language, enabling early-stage electronic design analysis.

This innovation allows hardware developers to significantly reduce development time, cut costs, and catch design issues before moving to advanced prototyping or manufacturing.

To power its AI workflows, CADY originally ran on AWS Batch, but as data complexity grew and job scheduling requirements became more advanced, the system encountered significant architectural limitations.

The Challenges

As CADY expanded its platform to support larger-scale AI workloads and added real-time demands, AWS Batch began to pose obstacles. Specifically, the team faced:

  • Rigid orchestration limitations: AWS Batch lacked the flexibility to handle CADY’s growing variability in workload types and execution priorities
  • Scalability and latency constraints: CADY needed faster job turnaround and better support for compute elasticity
  • Observability challenges: Debugging failed jobs and tracking resource performance in AWS Batch proved cumbersome
  • Lack of control over scheduling: The team needed fine-grained control over job priority, preemption, and workload balancing—essential for optimizing AI pipelines

CADY needed a robust, low-latency solution that could scale with demand and offer the visibility and control required for production-grade AI inference and data processing.

The Solution

To overcome the limits of AWS Batch and enable intelligent orchestration at scale, CADY turned to Modality, an AWS Advanced Consulting Partner with deep expertise in AI workloads, HPC architectures, and distributed computing on AWS.

Modality implemented a custom Slurm-based job scheduling solution tailored to CADY’s AI architecture. This transformation included:

  • Design and deployment of a Slurm scheduler on AWS. Replacing AWS Batch with a Slurm HPC cluster, optimized for fast provisioning, queue management, and fault tolerance.
  • Elastic compute architecture. Integrated Spot EC2 instances, enabling cost-effective scaling while maintaining SLA guarantees.
  • Workflow observability and monitoring. Enhanced visibility into job queues, task failures, and resource bottlenecks using AWS cost allocation tags, Jenkins automations, the Anodot FinOps platform and custom dashboards and alerts.
  • Automation and DevOps enablement. Delivered infrastructure-as-code automation for reproducibility, faster environment recovery, and scaling experiments with minimal overhead.

Through this transformation, Modality delivered a flexible, powerful, and cost-efficient orchestration layer—designed specifically for AI in production.

Learn more about our Cloud Management Services here

The Results

With Slurm running on AWS, CADY now benefits from:

  • Massively improved workload scheduling control, including queue prioritization, retries, and preemption
  • Faster AI turnaround, with reduced queue times and compute latency
  • Scalable HPC infrastructure, adaptable to unpredictable workloads and growing data complexity
  • Operational visibility, enabling faster debugging and system tuning with real-time telemetry
  • Cost savings, by leveraging Spot instances and eliminating inefficiencies associated with AWS Batch
“Migrating from AWS Batch to a Slurm-based scheduler gave us the power, flexibility, and transparency we were missing,” said Ido Port, System Architect at CADY, “Modality’s AWS expertise and hands-on support helped us streamline our AI operations and scale with confidence.”

Recent Case Studies

CADY Accelerates AI Workflows and Overcomes AWS Batch Limitations with Slurm on AWS

Read More >

Hashavshevet launches Hcloud with Modality to provide secure, scalable ERP in the cloud

Read More >

UPS Contractor in Israel improves internet-facing servers’ security and costs

Read More >