Multi-Agent Code Translation

(MEng Project, University of Toronto – Supervised by Prof. Eldan Cohen, Postdoc in CS at UofT and Researcher at UofT & Vector Institute)

Sept. 2024 - May. 2025

AILLMMulti-Agent SystemsLangChainPythonJavaOpenAI APIAnthropic APIResearchEvaluation Pipeline

Goal

I aimed to make code migration between Python and Java faster and more reliable. Moving programs across languages is usually slow, error-prone, and costly, and standard AI translators often break syntax or logic.

This project tested whether a multi-agent system with agents for analysis, translation, testing, and repair could produce cleaner, more dependable translations as a first step toward robust end-to-end migration tools.

An overview of the general architecture of the multi-agent system is shown below:

Pipeline overview

Data & Evaluation Pipeline

I prepared ~600 parallel Python↔Java files and selected 50 diverse functions with unit tests. I then built a unit test evaluation pipeline: every translated function was automatically inserted into its tests, executed, and scored. The pipeline logged pass/fail rates, compilation errors, and regeneration attempts into CSVs, letting me compare the performance of all architectures in detail.

Architectures & Agents

Baseline – A single GPT-4o translator (no checks or repairs).
Architecture #1 – Translator, Evaluator, Regenerator
- Translator produced code → Evaluator ran unit tests → Regenerator fixed errors and retried.
Architecture #2 – Added an Analyzer to explain the source code, plan translations, and guide test generation.
Architecture #3 – Used a Deep Analyzer with AST reasoning plus an Evaluator (Claude Sonnet) that critiqued logic instead of only running tests.

Results

Evaluated the performance of each multi-agent architecture by creating a testing pipeline that used unit tests from the dataset to check every translation’s correctness and reliability.

Baseline Python→Java accuracy: 46%
Architecture #1: 68% (fewer compile errors thanks to testing & regeneration).
Architecture #2: similar accuracy, but slower and sometimes noisier.
Architecture #3: 78% accuracy, a 32-point gain with far fewer retries.
Java→Python stayed high (84–86%) across all systems.

Takeaway

Designing a multi-agent system, especially with the Deep Analyzer and critique-based Evaluator, led to much more reliable translations. Using AST (Abstract Syntax Tree) analysis gave the system a clear view of each program’s structure, helping it understand control flow and edge cases before translating. This improved first-pass accuracy, reduced syntax and logic errors, and showed how structured reasoning and AST insights can make automated code migration far more dependable.

Collaboration

I worked under the guidance of Prof. Eldan Cohen (University of Toronto & Vector Institute), whose expertise helped shape the system design and experiments.