Trade Policy · Large Language Models · ASEAN

A Computational Framework for Comparative Analysis of Free Trade Agreements

An exploration of whether Large Language Models can read FTA legal text well enough to support comparison work: extracting, classifying, and surfacing design differences across three ASEAN-centred agreements.

4,059

provisions extracted

6

LLM classification runs

11

policy categories

3

FTAs compared

0.442

best Macro-F1

The Challenge

The Asia-Pacific region is layered with overlapping Free Trade Agreements, each running to thousands of legal provisions. Comparing how two or three of them treat the same topic is slow, manual work, and the answer often turns on small differences buried deep in the text.

Trade economists call this the spaghetti bowl problem, a tangle of rules and thresholds that differ just enough between agreements to matter for exporters operating under more than one.

The Solution

This project tests whether an LLM-based pipeline can help with that comparison work. It segments the legal text of three ASEAN-centred FTAs into provisions, asks the model to classify each into a policy category, and uses retrieval to draft side-by-side notes on how each agreement handles the same topic.

The output is best read as a first-pass triage layer, useful for spotting where agreements diverge enough to warrant a closer manual read, not a substitute for it. The whole pipeline runs on free-tier APIs and can be pointed at any new FTA PDF.

Three Research Questions

The project is organised around three questions, one per layer of the pipeline. Answers and supporting numbers live on the Findings tab.

RQ1: Classification

Can LLMs reliably classify FTA legal provisions, and how does accuracy change across models and prompt strategies?

RQ2: Policy Design

How do comparable provisions differ across agreements in observable design features such as thresholds, governance structures, and scope?

RQ3: Convergence

Do the agreements show structural convergence or fragmentation in their treatment of key trade policy topics?

How the Pipeline Works

📄

Extraction

7 PDFs parsed with pdfplumber, PyMuPDF, and OCR fallback into 4,059 provisions

🔢

Embedding

Each provision vectorised with MiniLM and stored in ChromaDB for semantic retrieval

🏷️

Classification

LLaMA 3.3 70B and Qwen 3 32B each run zero-shot, few-shot, and CoT strategies

🔍

RAG Comparison

Top provisions per category retrieved and fed to LLM for cross-agreement narrative

✅

Validation

50 hand-labelled provisions used to score all 6 model-strategy combinations

1. AHKFTA Source PDF — OCR Coverage Gap

The AHKFTA source PDF is a fully scanned document. Tesseract OCR extracted the goods-trade chapters (Chapter 2 Trade in Goods, Chapter 3 Rules of Origin, plus Annexes 2-1, 3-1, 3-2, 3-3) reliably, but degraded substantially on the legal-paragraph chapters that follow.

What this means in practice: AHKFTA in fact contains a 14-chapter structure (verifiable from its Table of Contents), but our extracted dataset under-represents the following chapters:

Chapter 8 — Trade in Services (with Annex 8-1 Schedules of Specific Commitments)
Chapter 10 — Intellectual Property
Chapter 13 — Consultations and Dispute Settlement (with Annex 13-1 Rules of Procedure for Arbitral Tribunals)
Chapter 6 (Standards), Chapter 7 (Trade Remedies), Chapter 11 (General Provisions and Exceptions), Chapter 12 (Institutional Provisions) — all partially captured at best

Affected findings: Apparent zeros for AHKFTA in Trade in Services, IP, and Dispute Settlement on the Provision Distribution and Convergence pages are extraction artefacts, not real legal absences. The fragmentation entropy scores for IP (0.37) and Dispute Settlement (0.47) are partly driven by this gap.

2. ASEAN-Hong Kong Investment Agreement Not Processed

A separate Agreement on Investment among the Governments of the Hong Kong Special Administrative Region of the People's Republic of China and the Member States of the Association of Southeast Asian Nations was signed on 18 May 2018 as a complementary instrument to the AHKFTA goods agreement. This document was not yet processed in the current pipeline.

Affected findings: The Investment entropy ratio (0.96, currently flagged as "convergent") cannot be interpreted reliably. The 4 AHKFTA provisions tagged as investment-related may include genuine references to the parallel Investment Agreement, but a complete substantive comparison of investment regimes across the three FTAs requires this document to be processed.

3. Validation Gold Set — Small and Single-Annotator

The validation gold set contains only 50 provisions, all labelled by the project author. The author is not a customs lawyer or FTA specialist, and no inter-annotator agreement (κ between two human labellers) was measured.

Statistical implications: With n = 50 and observed accuracy of 0.480, the 95% confidence interval is approximately 0.34 to 0.62. Point estimates of model performance therefore carry substantial uncertainty.

Path to improvement: Expanding to ≥ 200 provisions labelled by ≥ 2 customs-law / FTA experts is the single highest-leverage improvement available. This requires institutional support that one researcher cannot provide alone.

4. Tariff Schedule Annexes Not Table-Extracted

Tariff commitments at the line-item level (HS code, base rate, staging category, phase-out year) live in Annex tariff schedules that are structured as multi-page tables rather than as paragraph text. The current extraction pipeline treats these as text fragments and does not preserve the row/column structure.

Affected findings: Quantitative tariff thresholds, particularly for AANZFTA (which delegates many threshold definitions to product-specific schedules), are under-recovered in the attribute extraction module. The Tariff Commitments category in classification reflects framework provisions, not actual rate schedules.

5. Few-Shot Prompt Bias

The two in-context examples used in the few-shot prompt for the stratified classification run were both goods-trade categories (one Rules of Origin, one Tariff). This biases both models toward goods-related classifications and away from services, investment, and intellectual property.

Affected findings: The high entropy ratio for Rules of Origin (0.97) is partially inflated by this exemplar bias. Future runs should use exemplars balanced across the full target taxonomy.

6. Infrastructure — Personal Laptop, Free-Tier APIs

The entire pipeline runs on a personal MacBook with no GPU. LLM inference uses Groq's free tier, which imposes a rolling 24-hour token budget of ~100,000 tokens per day for LLaMA 3.3 70B. A full Chain-of-Thought classification run on 100 provisions consumes the daily quota in one session.

Implication: Scaling beyond three agreements or running comprehensive sweeps requires either paid API access or institutional inference infrastructure. The current pipeline demonstrates feasibility, not production capacity.

7. Three Agreements, English Only

The corpus covers RCEP, AHKFTA, and AANZFTA in English-language versions only. Findings are suggestive of Asia-Pacific patterns but are not statistically generalisable to the broader regional landscape, which includes the Comprehensive and Progressive Agreement for Trans-Pacific Partnership (CPTPP), the ASEAN-China FTA, the Korea-ASEAN FTA, and others.

What Remains Reliable Despite These Gaps

Rules of Origin attribute findings (CC vs CTH, 40% RVC, 10% de minimis): the relevant chapters were extracted reliably across all three agreements.
Customs Procedures convergence (entropy 1.00): consistent across all three agreements and consistent with the WTO Trade Facilitation baseline.
Pairwise Cohen's κ (0.582–0.702): robust on shared-cohort comparison and methodologically defensible.
Tariff Commitments distribution differences: the framework-level patterns are reliable even if line-item rates are not.
Pipeline reproducibility: code, data, and gold labels are public; anyone with a Groq API key can reproduce or extend.

FTA

Free Trade Agreement. A treaty between countries that reduces or eliminates tariffs and other trade barriers.

RCEP

Regional Comprehensive Economic Partnership. A 15-party FTA covering ASEAN plus China, Japan, South Korea, Australia, and New Zealand. Signed 2020.

AHKFTA

ASEAN-Hong Kong Free Trade Agreement. A bilateral goods-focused FTA between ASEAN and Hong Kong. In force since 2019.

AANZFTA

ASEAN-Australia-New Zealand Free Trade Agreement. A comprehensive FTA including services, investment, and dispute settlement. Signed 2009.

ASEAN

Association of Southeast Asian Nations. A 10-country regional bloc including Indonesia, Thailand, Vietnam, the Philippines, Singapore, and others.

LLM

Large Language Model. An AI model trained on large amounts of text that can read, classify, summarise, and generate natural language. LLaMA and Qwen are both LLMs.

RAG

Retrieval-Augmented Generation. A technique where relevant text is retrieved from a database and given to the LLM as context before it generates a response, improving factual accuracy.

Rules of Origin (RoO)

Rules that determine whether a product "originates" from a country and thus qualifies for preferential tariff treatment under an FTA. The most rule-intensive area in any goods FTA.

RVC

Regional Value Content. A product qualifies for preferential treatment if at least X% of its value was added within the FTA region. All three agreements in this study use a 40% threshold.

CTC

Change in Tariff Classification. An alternative way to satisfy Rules of Origin. The inputs used in manufacturing must fall under a different tariff code than the finished product, proving substantial transformation occurred.

CTH

Change in Tariff Heading. A CTC rule requiring a change at the 4-digit HS code level (the heading). Used by RCEP. Less strict than CC because transformation within the same chapter is allowed.

CC

Change in Chapter. A CTC rule requiring a change at the 2-digit HS code level (the chapter). Used by AHKFTA. Stricter than CTH because it demands more substantial transformation of the goods.

HS Code

Harmonized System code. An internationally standardised numbering system for traded goods. The 2-digit level is a "chapter," the 4-digit level is a "heading," and the 6-digit level is a "subheading."

NTM

Non-Tariff Measure. A trade barrier other than a tariff, for example import quotas, licensing requirements, or technical standards that effectively restrict trade.

SPS

Sanitary and Phytosanitary Measures. Rules protecting human, animal, and plant health from food-borne risks, diseases, and pests. Governed internationally by the WTO SPS Agreement.

IP

Intellectual Property. Legal rights covering inventions (patents), creative works (copyright), brand names (trademarks), and geographic indicators.

ISDS

Investor-State Dispute Settlement. A mechanism allowing foreign investors to bring legal claims directly against a host government in international arbitration, bypassing domestic courts.

WTO DSU

World Trade Organization Dispute Settlement Understanding. The WTO's own mechanism for resolving trade disputes between member states. AHKFTA references this instead of maintaining its own dispute settlement chapter.

κ, Cohen's Kappa

A statistical measure of agreement between two classifiers that corrects for the agreement you would expect by chance alone. 1.0 means perfect agreement; 0 means no better than chance; negative means worse than chance.

Macro-F1

Macro-averaged F1 score. Measures classification accuracy equally across all categories, including rare ones. More informative than plain accuracy when some categories are much more common than others.

Zero-shot

A prompt strategy where the model is given only the task description and the text to classify. No examples are provided. The model relies entirely on its training.

Few-shot

A prompt strategy where two labelled examples are shown to the model before the new provision. The examples help the model understand the expected format and reasoning.

CoT, Chain-of-Thought

A prompt strategy that instructs the model to reason step-by-step before giving its final answer. Improves Qwen's performance on this task but degrades LLaMA's.

OCR

Optical Character Recognition. Software that converts scanned images of printed text into machine-readable characters. Used as a fallback when PDF text extraction fails.

ChromaDB

An open-source vector database used in this project to store provision embeddings and retrieve semantically similar provisions for the RAG comparison pipeline.

MiniLM

all-MiniLM-L6-v2. A compact sentence embedding model from the sentence-transformers library. Converts text into numerical vectors for semantic similarity search.

Entropy (normalised)

A measure of how evenly distributed provisions are across the three agreements for a given category. A score of 1.0 means all three agreements contribute equally; 0 means one agreement holds everything.

WTO / UNCTAD

World Trade Organization / United Nations Conference on Trade and Development. The two international bodies whose standard FTA chapter taxonomy the 11 classification categories in this project follow.

Total Provisions

4,059

Extracted from 3 FTAs, 7 PDFs

Best Macro-F1

0.442

Qwen 3 32B, Chain-of-Thought

LLM Runs

6

2 models × 3 prompt strategies

Policy Categories

11

WTO/UNCTAD taxonomy

Verdict on Each Research Question

RQ1: Classification

Can LLMs reliably classify FTA legal provisions?

✓ Triage-grade, not provision-level

Accuracy lands between 32% and 48%. Qwen 3 32B with Chain-of-Thought reaches the best Macro-F1 at 0.442. CoT helps Qwen and hurts LLaMA; few-shot prompting hurts both.

RQ2: Policy Design

How do the agreements differ in observable design features?

✓ Same threshold, different transformation rule

RCEP and AHKFTA both use a 40% RVC threshold but RCEP applies CTH at the heading level while AHKFTA applies CC at the chapter level. AANZFTA is the only one with ISDS. Note: AHKFTA's Services, IP, and Dispute Settlement chapters were under-extracted from a scanned PDF — see data limitations.

RQ3: Convergence

Do the agreements converge or fragment?

✓ Procedural converges, substantive does not

Customs Procedures is the only category in genuine sync (entropy 1.00). Dispute Settlement (0.47) and Intellectual Property (0.37) score as fragmented, but this is largely a data artefact — AHKFTA actually has both chapters; OCR under-extracted them from the scanned source.

Three Agreements, Three Very Different Scopes

Each bar shows what share of that agreement's sampled provisions falls into each policy area. AHKFTA's profile is unmistakably different. Almost half of its sample goes to Rules of Origin, and entire categories are missing.

Three Key Findings

⚡

Prompt strategy must be chosen per model, not universally

Chain-of-thought prompting raises Qwen's Macro-F1 by 1.8 points but drops LLaMA's by 10.4. Few-shot prompting hurts both. Picking the right strategy is a per-model decision, and the asymmetry only became visible once all six combinations were scored against the same gold set.

⚖️

RCEP eligibility does not guarantee AHKFTA eligibility

RCEP and AHKFTA both use a 40% Regional Value Content threshold. But AHKFTA requires transformation at the chapter level (CC) while RCEP only requires it at the heading level (CTH). On paper the same product can satisfy one rule and not the other, the kind of detail an analyst would want flagged before a closer reading.

🌏

Most apparent fragmentation comes from missing chapters, not opposing rules

Customs Procedures is the only category where all three agreements carry similar shares of provisions, consistent with their shared WTO Trade Facilitation baseline. Dispute Settlement and Intellectual Property appear fragmented in our data, but AHKFTA does in fact contain both chapters (Ch.10 IP, Ch.13 Dispute Settlement) — they were under-extracted from a scanned source PDF. The fragmentation signal is therefore partly an artefact of OCR coverage rather than a real structural absence.

Raw Corpus Size per Agreement

RCEP contributes more than half the corpus simply because it is the largest agreement. The comparative analysis uses a balanced sample of 100 provisions per agreement so that RCEP's size does not drown out the other two.

Headline Numbers

Best Accuracy

LLaMA 3.3 70B, Zero-Shot

0.480

Best Macro-F1

Qwen 3 32B, Chain-of-Thought

0.442

Highest Cohen's κ

LLaMA ZS vs Qwen ZS, n=200

0.702

Most Convergent Category

Customs Procedures

1.00

Most Fragmented Category

Intellectual Property

0.37

RCEP

15 nations · Signed 2020 · In force 2022
20 chapters · 2,171 provisions (53.5%)

Includes: Goods, Services, Investment, IP, Dispute Settlement

AHKFTA

ASEAN + Hong Kong · Signed 2017
362 provisions (8.9%)

Includes: Goods, Services (Ch.8), IP (Ch.10), Dispute Settlement (Ch.13). A separate Agreement on Investment (2018) also exists.

⚠ Source PDF is fully scanned. OCR captured goods/RoO chapters well but under-extracted Services, IP, and Dispute Settlement chapters. Investment Agreement was not yet processed. Coverage figures below understate AHKFTA's actual scope.

AANZFTA

ASEAN + AU + NZ · Signed 2009
1,526 provisions (37.6%)

Includes: Services, Investment (Ch.11), Dispute Settlement, ISDS

Where to Find Each Result

Validation

All 6 model-strategy combinations scored against the 50-provision gold set. Accuracy and Macro-F1 by run.

Inter-Run Agreement

Pairwise Cohen's κ across runs. How consistent are the models when working on the same provisions?

Provision Distribution

What each agreement actually covers, by category and count, on the 100-provision stratified sample.

Convergence Analysis

Entropy ratio per category. Which topics are similar across all three agreements, and which are not?

Policy Design Matrix

Side-by-side feature comparison: thresholds, governance, and the CC vs CTH gap that matters for compliance.

RAG Comparisons

Per-category narratives drafted from retrieved provisions across all three agreements.

Accuracy, % of provisions correctly labelled

LLaMA zero-shot leads on raw accuracy at 0.480, with Qwen CoT close behind at 0.460. No run clears 50%, which is why aggregate patterns are more reliable than any single label.

Macro-F1, fairness-weighted accuracy score

Qwen CoT leads on Macro-F1 even though LLaMA zero-shot beats it on raw accuracy. That gap means Qwen handles the rarer categories better, while LLaMA may be over-favouring the most common ones.

Full Validation Table, all 6 runs ranked

Model	Strategy	Accuracy	Macro-F1	n	Note
LLaMA 3.3 70B	Zero-Shot	0.480	0.431	50	Best raw accuracy
Qwen 3 32B	Chain-of-Thought	0.460	0.442	50	🏆 Best Macro-F1
Qwen 3 32B	Zero-Shot	0.380	0.424	50
Qwen 3 32B	Few-Shot	0.380	0.373	50	Few-shot hurts Qwen
LLaMA 3.3 70B	Few-Shot	0.340	0.336	50
LLaMA 3.3 70B	Chain-of-Thought	0.320	0.327	50	CoT hurts LLaMA

The clearest pattern in this table is that chain-of-thought prompting moves the two models in opposite directions: it helps Qwen and hurts LLaMA. Few-shot prompting hurts both. That asymmetry only became visible after all six combinations were scored against the same gold set.

Key Finding: The Two Models Respond to Prompting in Opposite Directions

Qwen 3 32B, CoT is the best strategy

Zero-shot Macro-F1: 0.424
Few-shot Macro-F1: 0.373 (−5.1 points)
CoT Macro-F1: 0.442 (+1.8 points)

Qwen 3 is a "thinking model": it naturally emits internal reasoning traces. CoT prompting works with that architecture, giving the model space to work through legal text before committing to a label. Few-shot examples appear to anchor it to the wrong prior.

Macro-F1 comparison across strategies

Zero-shot

Few-shot

Chain-of-Thought

LLaMA 3.3 70B, Zero-shot is the best strategy

Zero-shot Macro-F1: 0.431
Few-shot Macro-F1: 0.336 (−9.5 points)
CoT Macro-F1: 0.327 (−10.4 points)

LLaMA's default zero-shot understanding of legal categories is better calibrated than what the prompting strategies produce. Adding examples or reasoning instructions actively degrades its output, possibly because it over-conditions on the in-context content.

Macro-F1 comparison across strategies

Zero-shot

Few-shot

Chain-of-Thought

Pairwise κ, all run pairs on shared provisions

Run A	Run B	κ	n shared	Interpretation
LLaMA zero-shot	Qwen zero-shot	0.702	200	Substantial, strongest cross-model alignment
Qwen zero-shot	Qwen few-shot	0.689	200	Substantial, within-model consistency
LLaMA zero-shot	LLaMA few-shot	0.668	200	Substantial, within-model consistency
LLaMA CoT	Qwen CoT	0.640	100	Substantial, CoT aligns both models
LLaMA few-shot	Qwen few-shot	0.582	200	Moderate, few-shot diverges the models more

The two models agree more with each other on zero-shot than any same-model pair using few-shot or CoT. This means both models share a natural baseline understanding of FTA legal categories, and prompting strategies disrupt that shared signal more than they help it.

What This Means for Reliability

With κ values in the 0.58 to 0.70 range, the two models are substantially consistent with each other when working on the same provisions. They are not labelling at random; they share a meaningful common signal.

The most important finding here is that aggregate category distributions are more trustworthy than individual labels. If both LLaMA and Qwen (zero-shot) assign 48% of AHKFTA provisions to Rules of Origin, that pattern is robust. The value of this pipeline is in answering questions like "what topics does AHKFTA focus on?", not "is this exact provision a Rules of Origin provision?"

The highest within-model κ is for zero-shot pairs, and the highest cross-model κ is also between zero-shot runs. This is the practical recommendation: zero-shot produces the most reproducible and cross-model-consistent results, even if Qwen CoT edges it out on Macro-F1.

Provision Count by Category and Agreement

AHKFTA's bar dominates Rules of Origin while showing zero or near-zero counts in Dispute Settlement, Trade in Services, and Intellectual Property. This pattern is a data artefact, not a real design feature. AHKFTA in fact contains Chapter 8 (Trade in Services), Chapter 10 (Intellectual Property), and Chapter 13 (Consultations and Dispute Settlement). The source PDF is fully scanned, and Tesseract OCR captured the goods-trade chapters well but degraded substantially on the legal-paragraph chapters that follow. The "compression into goods chapters" is therefore extraction loss, not legal scope.

Provision Counts, colour intensity shows concentration

Category	RCEP	AHKFTA	AANZFTA	Total

Reading down the AHKFTA column, the zeros are missing values, not real absences — they reflect OCR failure on the scanned source for chapters that exist in the agreement but were not successfully extracted. The strongest cells (AHKFTA's 48 Rules of Origin and 31 Tariff Commitments, AANZFTA's 22 Dispute Settlement) are reliable; AHKFTA's zeros in Services, IP, and Dispute Settlement are not.

Three Structural Patterns Worth Highlighting

AHKFTA

Rules of Origin heavy, 48% of its sample, vs 24% for RCEP

In our extracted sample, AHKFTA's provisions concentrate heavily in Rules of Origin and Tariff schedules. The CC transformation rule, stricter than RCEP's CTH, does generate more legal text per topic. Caveat: the sample under-represents AHKFTA's Services, IP, and Dispute Settlement chapters because those sections of the scanned source were not reliably captured by OCR. The actual AHKFTA covers all those areas plus a separate 2018 Investment Agreement that has not yet been processed.

AANZFTA

Dispute Settlement leader, 22 provisions vs 6 for RCEP and 0 for AHKFTA

AANZFTA maintains its own independent dispute settlement mechanism with dedicated adjudication and arbitration procedures. AHKFTA delegates entirely to the WTO DSU and has no chapter of its own. RCEP runs its own mechanism, but fewer of its provisions surfaced in the stratified sample.

RCEP

Broadest coverage, the only agreement with meaningful IP provisions (12)

RCEP's 15-party scope drives it to codify a wide range of policies, and its source PDF is born-digital so extraction was reliable. Its 20 Trade in Services and 12 IP provisions reflect actual chapter coverage. AHKFTA also contains Services and IP chapters in principle, but those were not fully captured from the scanned source — so a clean comparison of scope across all three agreements is not yet possible from this dataset alone.

Entropy Ratio by Category, convergence vs fragmentation

Customs Procedures is the only category where all three agreements are genuinely in sync. Dispute Settlement and Intellectual Property sit at the fragmented end of the entropy chart, but the reason is mostly an extraction artefact: AHKFTA actually has both chapters (Ch.10 IP and Ch.13 Dispute Settlement), and they were under-captured from the scanned source PDF. So the entropy score reads "fragmented" while the underlying legal scope is closer to "all three present, comparable in size."

Convergence Signal Table, raw counts behind each score

Category	RCEP	AHKFTA	AANZFTA	Entropy Ratio	Signal

Dispute Settlement shows counts of 6, 0, and 22 across the three agreements. That spread is what drives the fragmentation score. One agreement carries almost all the provisions; another has none. That is a structural scope difference, not a drafting conflict.

What Convergence and Fragmentation Mean for Policy

Convergent: Customs Procedures (1.00). All three agreements allocate almost identical shares of their text to customs procedures. This reflects the WTO Trade Facilitation Agreement providing a shared procedural baseline (documentation, advance rulings, risk-based release) that all three parties follow. This is an area where regional harmonisation has genuinely occurred.

Fragmented: Dispute Settlement (0.47) and Intellectual Property (0.37). These scores are driven almost entirely by AHKFTA having near-zero provisions in our extracted sample for both categories. However, AHKFTA does in fact contain Chapter 10 (Intellectual Property) and Chapter 13 (Consultations and Dispute Settlement) — the Table of Contents of the signed agreement confirms it. The provisions were under-extracted because the AHKFTA PDF is fully scanned and OCR degraded on the legal-paragraph chapters that follow the goods chapters. So the fragmentation signal here is largely a measurement artefact, not a real structural absence.

Caveat on Rules of Origin. Rules of Origin appears convergent (high entropy ratio) but this is partly inflated by the few-shot classification method used. Both in-context examples in the prompt were goods-trade categories, which biases the model toward classifying more provisions as Rules of Origin across all three agreements and makes the distribution appear more even than it actually is.

Caveat on Investment. Investment also scores highly on the entropy ratio. The 4 AHKFTA provisions classified as investment-related may include genuine references — there is in fact a separate Agreement on Investment among Hong Kong and ASEAN signed 18 May 2018 that complements the AHKFTA goods agreement. That separate Investment Agreement was not yet processed in this pipeline, so neither the high entropy nor the low provision count for AHKFTA Investment can be interpreted reliably until it is added. The honest reading is "investment scope cannot be assessed from current data."

Feature Comparison, how each agreement handles key trade policy topics

Feature	RCEP	AHKFTA	AANZFTA
Tariff Amendment	Consensus + formal procedure; unilateral importer notification	HS 2012 reference; product-specific exporter choice	Unilateral modification rights; selective HS concessions
RoO Governance	Committee (Annex 3A/3B); CTH rule	Sub-Committee; CC rule, stricter transformation	Certificate-based; exporter compliance burden
RVC Threshold	40%	40%	Not in main text, delegated to schedules
CTC Rule	CTH, 4-digit heading level. Easier to satisfy.	CC, 2-digit chapter level. More restrictive.	Not recovered in main text
Dispute Settlement	Own mechanism; independent of WTO DSU	WTO DSU as reference, no independent mechanism	Own mechanism; adjudication + arbitration
Investment	Dedicated Ch.10; national treatment	No dedicated chapter; 4 provisions classified as investment-adjacent (likely misclassified general commercial provisions)	Dedicated Ch.11; national treatment; ISDS
Customs	Direct consignment requirements	No legalisation / authentication required	Risk-based clearance for low-risk goods
Services	Mode 1 to 4; schedule of commitments	Not present	Mode 1 to 4; schedule of commitments

The CTC Rule row is where the comparison gets practically interesting. RCEP and AHKFTA both use a 40% RVC threshold, which makes them look interchangeable at first glance. Look one column down and the transformation requirement diverges. Chapter-level for AHKFTA, heading-level for RCEP, which means a product can in principle satisfy one rule and not the other.

Why the CC vs CTH Difference Matters

RCEP and AHKFTA both use a 40% RVC threshold. An exporter looking at just that number would conclude they meet the Rules of Origin requirements under both agreements. But the CTC rule, the other requirement, differs in a practically important way:

RCEP, CTH (Heading level, 4-digit)

The inputs and the finished product must fall under different 4-digit HS headings. You are allowed to use inputs from the same 2-digit chapter, as long as they end up at a different heading. Easier to satisfy, more flexible for manufacturers.

AHKFTA, CC (Chapter level, 2-digit)

The inputs and the finished product must fall under different 2-digit HS chapters. This is a much stricter test, it requires the manufacturing process to involve a genuine change in the nature of the goods at the chapter level.

What this implies

On the LLM-extracted text, a product that meets the 40% RVC threshold and clears CTH under RCEP would not automatically clear AHKFTA's stricter CC rule. That is a difference worth flagging for closer manual review rather than a verdict on any specific shipment. The dashboard’s role is to surface the gap, not to score it.

About these AI-generated comparisons

Each category below was analysed by retrieving the most relevant provisions from all three agreements and asking the model to write a structured comparison. The output is richer than a single-provision classification because the model reads primary text before responding. That said, these are AI-generated summaries and any specific legal conclusions should be checked against the original agreement text.

A Computational Framework for Comparative Analysis of Free Trade Agreements

Limitations and Honest Caveats

Terms and Abbreviations

Triage-Grade Classification, One Compliance Gap, One Real Convergence

Best Run Reaches 48% Accuracy and 0.442 Macro-F1

Models Agree Most Strongly at Zero Shot

Each Agreement Carries a Distinct Topical Profile

Customs Procedures Converges, Disputes and IP Fragment

Same Threshold, Different Transformation Rule

Side by Side Across Eleven Policy Categories