The Asia-Pacific region is layered with overlapping Free Trade Agreements, each running to thousands of legal provisions. Comparing how two or three of them treat the same topic is slow, manual work, and the answer often turns on small differences buried deep in the text.
Trade economists call this the spaghetti bowl problem, a tangle of rules and thresholds that differ just enough between agreements to matter for exporters operating under more than one.
This project tests whether an LLM-based pipeline can help with that comparison work. It segments the legal text of three ASEAN-centred FTAs into provisions, asks the model to classify each into a policy category, and uses retrieval to draft side-by-side notes on how each agreement handles the same topic.
The output is best read as a first-pass triage layer, useful for spotting where agreements diverge enough to warrant a closer manual read, not a substitute for it. The whole pipeline runs on free-tier APIs and can be pointed at any new FTA PDF.
The project is organised around three questions, one per layer of the pipeline. Answers and supporting numbers live on the Findings tab.
Can LLMs reliably classify FTA legal provisions, and how does accuracy change across models and prompt strategies?
How do comparable provisions differ across agreements in observable design features such as thresholds, governance structures, and scope?
Do the agreements show structural convergence or fragmentation in their treatment of key trade policy topics?
Limitations and Honest Caveats
Everything that can go wrong, has been flagged here. Read this before drawing conclusions from any chart in the dashboard.
The AHKFTA source PDF is a fully scanned document. Tesseract OCR extracted the goods-trade chapters (Chapter 2 Trade in Goods, Chapter 3 Rules of Origin, plus Annexes 2-1, 3-1, 3-2, 3-3) reliably, but degraded substantially on the legal-paragraph chapters that follow.
What this means in practice: AHKFTA in fact contains a 14-chapter structure (verifiable from its Table of Contents), but our extracted dataset under-represents the following chapters:
- Chapter 8 — Trade in Services (with Annex 8-1 Schedules of Specific Commitments)
- Chapter 10 — Intellectual Property
- Chapter 13 — Consultations and Dispute Settlement (with Annex 13-1 Rules of Procedure for Arbitral Tribunals)
- Chapter 6 (Standards), Chapter 7 (Trade Remedies), Chapter 11 (General Provisions and Exceptions), Chapter 12 (Institutional Provisions) — all partially captured at best
Affected findings: Apparent zeros for AHKFTA in Trade in Services, IP, and Dispute Settlement on the Provision Distribution and Convergence pages are extraction artefacts, not real legal absences. The fragmentation entropy scores for IP (0.37) and Dispute Settlement (0.47) are partly driven by this gap.
A separate Agreement on Investment among the Governments of the Hong Kong Special Administrative Region of the People's Republic of China and the Member States of the Association of Southeast Asian Nations was signed on 18 May 2018 as a complementary instrument to the AHKFTA goods agreement. This document was not yet processed in the current pipeline.
Affected findings: The Investment entropy ratio (0.96, currently flagged as "convergent") cannot be interpreted reliably. The 4 AHKFTA provisions tagged as investment-related may include genuine references to the parallel Investment Agreement, but a complete substantive comparison of investment regimes across the three FTAs requires this document to be processed.
The validation gold set contains only 50 provisions, all labelled by the project author. The author is not a customs lawyer or FTA specialist, and no inter-annotator agreement (κ between two human labellers) was measured.
Statistical implications: With n = 50 and observed accuracy of 0.480, the 95% confidence interval is approximately 0.34 to 0.62. Point estimates of model performance therefore carry substantial uncertainty.
Path to improvement: Expanding to ≥ 200 provisions labelled by ≥ 2 customs-law / FTA experts is the single highest-leverage improvement available. This requires institutional support that one researcher cannot provide alone.
Tariff commitments at the line-item level (HS code, base rate, staging category, phase-out year) live in Annex tariff schedules that are structured as multi-page tables rather than as paragraph text. The current extraction pipeline treats these as text fragments and does not preserve the row/column structure.
Affected findings: Quantitative tariff thresholds, particularly for AANZFTA (which delegates many threshold definitions to product-specific schedules), are under-recovered in the attribute extraction module. The Tariff Commitments category in classification reflects framework provisions, not actual rate schedules.
The two in-context examples used in the few-shot prompt for the stratified classification run were both goods-trade categories (one Rules of Origin, one Tariff). This biases both models toward goods-related classifications and away from services, investment, and intellectual property.
Affected findings: The high entropy ratio for Rules of Origin (0.97) is partially inflated by this exemplar bias. Future runs should use exemplars balanced across the full target taxonomy.
The entire pipeline runs on a personal MacBook with no GPU. LLM inference uses Groq's free tier, which imposes a rolling 24-hour token budget of ~100,000 tokens per day for LLaMA 3.3 70B. A full Chain-of-Thought classification run on 100 provisions consumes the daily quota in one session.
Implication: Scaling beyond three agreements or running comprehensive sweeps requires either paid API access or institutional inference infrastructure. The current pipeline demonstrates feasibility, not production capacity.
The corpus covers RCEP, AHKFTA, and AANZFTA in English-language versions only. Findings are suggestive of Asia-Pacific patterns but are not statistically generalisable to the broader regional landscape, which includes the Comprehensive and Progressive Agreement for Trans-Pacific Partnership (CPTPP), the ASEAN-China FTA, the Korea-ASEAN FTA, and others.
- Rules of Origin attribute findings (CC vs CTH, 40% RVC, 10% de minimis): the relevant chapters were extracted reliably across all three agreements.
- Customs Procedures convergence (entropy 1.00): consistent across all three agreements and consistent with the WTO Trade Facilitation baseline.
- Pairwise Cohen's κ (0.582–0.702): robust on shared-cohort comparison and methodologically defensible.
- Tariff Commitments distribution differences: the framework-level patterns are reliable even if line-item rates are not.
- Pipeline reproducibility: code, data, and gold labels are public; anyone with a Groq API key can reproduce or extend.
Terms and Abbreviations
If anything in the headlines or charts is unfamiliar, the definitions live here.
Triage-Grade Classification, One Compliance Gap, One Real Convergence
The three findings the project landed on, with the supporting numbers and the caveats around each.
Each bar shows what share of that agreement's sampled provisions falls into each policy area. AHKFTA's profile is unmistakably different. Almost half of its sample goes to Rules of Origin, and entire categories are missing.
20 chapters · 2,171 provisions (53.5%)
362 provisions (8.9%)
1,526 provisions (37.6%)
Best Run Reaches 48% Accuracy and 0.442 Macro-F1
Six model and strategy combinations scored against the same 50-provision gold set. That ceiling is why this dashboard is positioned for triage rather than provision-level adjudication.
| Model | Strategy | Accuracy | Macro-F1 | n | Note |
|---|---|---|---|---|---|
| LLaMA 3.3 70B | Zero-Shot | 0.480 | 0.431 | 50 | Best raw accuracy |
| Qwen 3 32B | Chain-of-Thought | 0.460 | 0.442 | 50 | 🏆 Best Macro-F1 |
| Qwen 3 32B | Zero-Shot | 0.380 | 0.424 | 50 | |
| Qwen 3 32B | Few-Shot | 0.380 | 0.373 | 50 | Few-shot hurts Qwen |
| LLaMA 3.3 70B | Few-Shot | 0.340 | 0.336 | 50 | |
| LLaMA 3.3 70B | Chain-of-Thought | 0.320 | 0.327 | 50 | CoT hurts LLaMA |
Few-shot Macro-F1: 0.373 (−5.1 points)
CoT Macro-F1: 0.442 (+1.8 points)
Few-shot Macro-F1: 0.336 (−9.5 points)
CoT Macro-F1: 0.327 (−10.4 points)
Models Agree Most Strongly at Zero Shot
Pairwise Cohen's κ across the six runs, computed on the provisions they share. Agreement is substantial overall, and prompting strategies weaken it rather than strengthen it.
| Run A | Run B | κ | n shared | Interpretation |
|---|---|---|---|---|
| LLaMA zero-shot | Qwen zero-shot | 0.702 | 200 | Substantial, strongest cross-model alignment |
| Qwen zero-shot | Qwen few-shot | 0.689 | 200 | Substantial, within-model consistency |
| LLaMA zero-shot | LLaMA few-shot | 0.668 | 200 | Substantial, within-model consistency |
| LLaMA CoT | Qwen CoT | 0.640 | 100 | Substantial, CoT aligns both models |
| LLaMA few-shot | Qwen few-shot | 0.582 | 200 | Moderate, few-shot diverges the models more |
With κ values in the 0.58 to 0.70 range, the two models are substantially consistent with each other when working on the same provisions. They are not labelling at random; they share a meaningful common signal.
The most important finding here is that aggregate category distributions are more trustworthy than individual labels. If both LLaMA and Qwen (zero-shot) assign 48% of AHKFTA provisions to Rules of Origin, that pattern is robust. The value of this pipeline is in answering questions like "what topics does AHKFTA focus on?", not "is this exact provision a Rules of Origin provision?"
The highest within-model κ is for zero-shot pairs, and the highest cross-model κ is also between zero-shot runs. This is the practical recommendation: zero-shot produces the most reproducible and cross-model-consistent results, even if Qwen CoT edges it out on Macro-F1.
Each Agreement Carries a Distinct Topical Profile
Counts per policy category from a stratified sample of 100 provisions per agreement, classified by Qwen 3 32B few-shot.
| Category | RCEP | AHKFTA | AANZFTA | Total |
|---|
Customs Procedures Converges, Disputes and IP Fragment
Per-category entropy ratio across the three agreements. A score of 1.00 means provisions are shared evenly; a score near 0 means one agreement carries the topic alone.
| Category | RCEP | AHKFTA | AANZFTA | Entropy Ratio | Signal |
|---|
Convergent: Customs Procedures (1.00). All three agreements allocate almost identical shares of their text to customs procedures. This reflects the WTO Trade Facilitation Agreement providing a shared procedural baseline (documentation, advance rulings, risk-based release) that all three parties follow. This is an area where regional harmonisation has genuinely occurred.
Fragmented: Dispute Settlement (0.47) and Intellectual Property (0.37). These scores are driven almost entirely by AHKFTA having near-zero provisions in our extracted sample for both categories. However, AHKFTA does in fact contain Chapter 10 (Intellectual Property) and Chapter 13 (Consultations and Dispute Settlement) — the Table of Contents of the signed agreement confirms it. The provisions were under-extracted because the AHKFTA PDF is fully scanned and OCR degraded on the legal-paragraph chapters that follow the goods chapters. So the fragmentation signal here is largely a measurement artefact, not a real structural absence.
Caveat on Rules of Origin. Rules of Origin appears convergent (high entropy ratio) but this is partly inflated by the few-shot classification method used. Both in-context examples in the prompt were goods-trade categories, which biases the model toward classifying more provisions as Rules of Origin across all three agreements and makes the distribution appear more even than it actually is.
Caveat on Investment. Investment also scores highly on the entropy ratio. The 4 AHKFTA provisions classified as investment-related may include genuine references — there is in fact a separate Agreement on Investment among Hong Kong and ASEAN signed 18 May 2018 that complements the AHKFTA goods agreement. That separate Investment Agreement was not yet processed in this pipeline, so neither the high entropy nor the low provision count for AHKFTA Investment can be interpreted reliably until it is added. The honest reading is "investment scope cannot be assessed from current data."
Same Threshold, Different Transformation Rule
Feature by feature comparison drawn from the RAG output. The CC versus CTH row between AHKFTA and RCEP is the gap most worth flagging for closer review.
| Feature | RCEP | AHKFTA | AANZFTA |
|---|---|---|---|
| Tariff Amendment | Consensus + formal procedure; unilateral importer notification | HS 2012 reference; product-specific exporter choice | Unilateral modification rights; selective HS concessions |
| RoO Governance | Committee (Annex 3A/3B); CTH rule | Sub-Committee; CC rule, stricter transformation | Certificate-based; exporter compliance burden |
| RVC Threshold | 40% | 40% | Not in main text, delegated to schedules |
| CTC Rule | CTH, 4-digit heading level. Easier to satisfy. | CC, 2-digit chapter level. More restrictive. | Not recovered in main text |
| Dispute Settlement | Own mechanism; independent of WTO DSU | WTO DSU as reference, no independent mechanism | Own mechanism; adjudication + arbitration |
| Investment | Dedicated Ch.10; national treatment | No dedicated chapter; 4 provisions classified as investment-adjacent (likely misclassified general commercial provisions) | Dedicated Ch.11; national treatment; ISDS |
| Customs | Direct consignment requirements | No legalisation / authentication required | Risk-based clearance for low-risk goods |
| Services | Mode 1 to 4; schedule of commitments | Not present | Mode 1 to 4; schedule of commitments |
RCEP and AHKFTA both use a 40% RVC threshold. An exporter looking at just that number would conclude they meet the Rules of Origin requirements under both agreements. But the CTC rule, the other requirement, differs in a practically important way:
Side by Side Across Eleven Policy Categories
For each category, Qwen 3 32B read the three most relevant provisions from each agreement and wrote a structured comparison. Click a category to expand it.