From Functions to Full Repositories
Recent LLMs excel at function-level generation, yet falter when a task spans an entire codebase. To close that gap, researchers from Tsinghua University, Shanghai Jiao Tong University and Shanghai AI Lab introduce Code Graph Model (CGM)—a graph-integrated large language model that reasons over whole repositories without relying on tool-calling agents.
How CGM Works
Component | Purpose |
---|---|
Graph Encoder–Adapter | Extracts control-flow, call-graph and dependency edges from every file, converting them into node embeddings. |
Graph-Aware Attention | Blends token context with structural edges so the model “sees” long-range relationships across files. |
Staged Training | 1) text-only warm-up on permissive code; 2) graph-enhanced fine-tuning on 20 K curated repos; 3) instruction tuning for tasks like bug repair and doc generation. |
The result is a 72-billion-parameter Mixture-of-Experts checkpoint (CodeFuse-CGM-72B) plus a lighter 13 B variant, both released under Apache 2.0 on Hugging Face.
Benchmark Highlights
Task (RepoBench) | GPT-4o (agent) | DeepSeek-R1 | CGM-72B |
---|---|---|---|
Bug Fix (pass@1) | 62.3 % | 55.8 % | 64.7 % |
Refactor-Large | 58.1 % | 48.9 % | 61.4 % |
Doc Generation | 71.5 % | 66.2 % | 72.1 % |
Why It Matters
-
Agent-Free Reliability – Removes the non-determinism and overhead of multi-call agent frameworks.
-
Whole-Project Context – Graph attention lets the model track cross-file types, imports and call chains.
-
Self-Hosted Friendly – Open weights mean enterprises can audit and finetune without data-privacy worries.
Limitations & Roadmap
The authors note performance drops on repos exceeding 50 K lines; future work targets hierarchical graphs and sparse attention to scale further. They also plan IDE plug-ins that stream live graph embeddings to CGM for interactive code assistance.
Takeaway
Code Graph Model shows that marrying graph structure with LLMs can unlock repository-scale intelligence—providing a transparent, open alternative to closed-source agent pipelines for everyday software engineering.
Paper: https://huggingface.co/papers/2505.16901
No comments:
Post a Comment