Multi-Agent Code Translation
(MEng Project, University of Toronto – Supervised by Prof. Eldan Cohen, Postdoc in CS at UofT and Researcher at UofT & Vector Institute)
Sept. 2024 - May. 2025
Goal
I aimed to make code migration between Python and Java faster and more reliable. Moving programs across languages is usually slow, error-prone, and costly, and standard AI translators often break syntax or logic.
This project tested whether a multi-agent system with agents for analysis, translation, testing, and repair could produce cleaner, more dependable translations as a first step toward robust end-to-end migration tools.
An overview of the general architecture of the multi-agent system is shown below:
Data & Evaluation Pipeline
I prepared ~600 parallel Python↔Java files and selected 50 diverse functions with unit tests. I then built a unit test evaluation pipeline: every translated function was automatically inserted into its tests, executed, and scored. The pipeline logged pass/fail rates, compilation errors, and regeneration attempts into CSVs, letting me compare the performance of all architectures in detail.
Architectures & Agents
- Baseline – A single GPT-4o translator (no checks or repairs).
- Architecture #1 – Translator, Evaluator, Regenerator
- Translator produced code → Evaluator ran unit tests → Regenerator fixed errors and retried.
- Architecture #2 – Added an Analyzer to explain the source code, plan translations, and guide test generation.
- Architecture #3 – Used a Deep Analyzer with AST reasoning plus an Evaluator (Claude Sonnet) that critiqued logic instead of only running tests.
Results
Evaluated the performance of each multi-agent architecture by creating a testing pipeline that used unit tests from the dataset to check every translation’s correctness and reliability.
- Baseline Python→Java accuracy: 46%
- Architecture #1: 68% (fewer compile errors thanks to testing & regeneration).
- Architecture #2: similar accuracy, but slower and sometimes noisier.
- Architecture #3: 78% accuracy, a 32-point gain with far fewer retries.
- Java→Python stayed high (84–86%) across all systems.
Takeaway
Designing a multi-agent system, especially with the Deep Analyzer and critique-based Evaluator, led to much more reliable translations. Using AST (Abstract Syntax Tree) analysis gave the system a clear view of each program’s structure, helping it understand control flow and edge cases before translating. This improved first-pass accuracy, reduced syntax and logic errors, and showed how structured reasoning and AST insights can make automated code migration far more dependable.
Collaboration
I worked under the guidance of Prof. Eldan Cohen (University of Toronto & Vector Institute), whose expertise helped shape the system design and experiments.