OPEN DATA CURATION

Open Data Curation

A meaning-layer guide to Taiwan's open data × Twinkle Hub

Taiwan's government open data platform hosts nearly fifty thousand datasets. That number overwhelms anyone who actually wants to use them: you don't know which ones matter, how far each is kept up to date, or which two tables can be joined, let alone how any of it relates to the questions you care about.

Taiwan.md has written over nine hundred articles about Taiwan, and behind every one sits a judgment call: what data should verify this? This page lays that layer of judgment open: how we assess the data infrastructure, which datasets to combine (and how) when analyzing a question, and which stories on this island each data domain connects to.

0
government datasets
0
full-text court judgments
0
national exam questions
0
nutrition analysis rows
0
procurement records

The web below is real: on the left, 20 data domains and five vertical corpora (crawled live); on the right, Taiwan.md articles already written. Every line is a curation judgment made on this page. Drag and hover to watch messy data connect to clear stories.

Data domain Vertical corpus Taiwan.md article Taiwan.md meaning layer 🧬

The ecosystem map: three layers, each holding its own

For an AI (or a person) to genuinely answer questions about Taiwan, three layers have to work together: a home for the data, a path for queries, and a layer of meaning.

🏛️
The home of the data (SSOT)

data.gov.tw and agency systems

The government open data platform is each dataset's persistent identity: dataset ID, license terms, competent authority, raw downloads. Every citation should ultimately trace back here.

about 50,000 datasets
🔌
The path for queries (MCP gateway)

Twinkle Hub

Taiwan's first MCP hub, wrapping data scattered across hundreds of government portals into a single query endpoint: semantic search, structured row queries, and tools for five vertical domains. An AI gets the data in one call, skipping the manual slog across portals.

21 tools · 20 domains
🧬
The layer of meaning (curation)

Taiwan.md

Data doesn't speak for itself. Which dataset deserves pointing to, which claim it verifies, which stretch of history it connects to — that is curation work. Starting June 2026, our articles are getting 'Public data' sections one by one, stitching narrative to raw data.

900+ articles · 15 dataset pointers live

Three-dimension assessment: what our hands-on testing found

What follows is Taiwan.md's first-hand assessment as a user, run with our own verification tools (two rounds of testing, May and June 2026), laid out along three dimensions. This is not an ad; it's a checkup.

🗃️

Data completeness

The coverage is real, and it goes beyond mirroring
  • Covers roughly 96.6% of data.gov.tw in full (49,343 datasets as of our 2026-06-05 count), plus 135,000 government e-procurement records and Legislative Yuan data
  • Each of the 20 domain categories comes with 'typical questions' and anchor examples; every dataset is tagged with a quality tier (platinum to bronze), update frequency, format, and joinable keys
  • Self-curated datasets patch holes in the government portals: nationwide real-price registration (sales / pre-sale / rentals) connects straight to the Ministry of the Interior's Land Administration systems
  • Five vertical corpora go beyond simple mirroring: patent full texts, national exam question banks, court judgments, drug licenses, and food nutrition (scale detailed in the next section)
Honest gaps
  • Search ranking skews toward county-level slices: a query for 'birth rate' returns county datasets for Nantou, Taoyuan, and Kaohsiung, and a human has to pick out the national-scale one — exactly why curation exists
  • There is no way to query the dataset count per domain, so any inventory has to lean on officially claimed figures
  • Some older datasets remain in un-normalized ODS format and cannot be queried as structured rows
🫀

Stability

True to its alpha nature: it runs fast, and it changes fast
  • Measured query latency under 100ms on cache hits; every response carries a trace_id and cost fields — good transparency
  • Version numbers are embedded in tool descriptions (v1.11.2 aggregate queries, v1.18 judgments), so the iteration cadence stays visible
  • The judgment corpus explicitly labels its alpha scope (about 1.24 million records, 2024-05 through 2026-03) — marking boundaries clearly is more honest than pretending to be complete
Honest gaps
  • Two API interface changes within two months: between 2026-05-11 and 06-10, the connection moved to a session handshake, the tools were reorganized from 40 down to 21, and an entire set of deterministic tools was retired
  • Rate limiting (HTTP 429) has already appeared during alpha, but the limit window is unpublished
  • Our countermeasure: a thin wrapper layer isolates interface changes, and article citations are always written as static pointers, never depending on the API at runtime — the way any alpha service should be integrated

Access simplification

This is its strongest side
  • One MCP endpoint replaces hundreds of government portals: a four-step flow of search, metadata fetch, row query, and aggregation, with a consistent field schema
  • Structured row queries support SQL conditions and aggregation; normalized datasets can be used directly as a database
  • A question spanning one address, one year, and one administrative district used to take 15 to 30 minutes of manual cross-checking across three to five portals; now it is a single call, under a second
  • One-click install packages plug Claude, Cursor, and ten-plus other AI clients straight in — the friction of 'making Taiwan's data readable to AI' has been cut by an order of magnitude
Honest gaps
  • Requires an API key (bearer token); free during alpha, with per-tool pricing planned — whether a free path will always exist is a question the open data ecosystem should keep asking
  • The service itself is closed-source: the data is open, the channel currently is not. Raw downloads from data.gov.tw remain the fallback path that bypasses any gateway

Five vertical corpora: the part beyond mirroring

Wrapping datasets in a search interface is nothing special; these five vertical domains add semantic retrieval and structured extraction — the part of Twinkle Hub that goes beyond being a 'data.gov.tw mirror'.

Patents

TIPO published invention patents, full text
全文語意檢索

Natural-language queries over the patent corpus, with full technical descriptions and claims retrievable. When writing about Taiwan's industries, 'does this company actually hold this technology' can be verified by semantic search for the first time.

National exams

64,815 exam papers · 320,000 questions (2012–2025)

Ministry of Examination papers across the years, searchable down to the question level. Taiwan's civil service exam culture — the public-sector job craze, the cram school streets — is a story no one has told with data yet.

Court judgments

about 1.24 million records (2024-05 to 2026-03, alpha)

Plain-language retrieval over the judgment corpus. For articles on the judiciary, labor disputes, and rental conflicts, 'how courts actually rule in practice' now has a verifiable entry point.

Drugs and health

71,836 drug licenses · 96,803 Chinese ICD-10 codes

Drug permits, structured package-insert fields, health food certifications, and preliminary interaction screening. The factual layer for articles on National Health Insurance and medicine.

Food nutrition

226,825 nutrition analysis rows

The Ministry of Health and Welfare's food nutrition composition database: twenty-plus nutrients per ingredient, rankable by nutrient, summable per meal. The numeric base for night market and food articles.

The magnitude bars use a log scale: the judgment corpus is 17 times the size of drug licenses, and a linear plot would squash the other four bars into invisibility.

Analysis recipes: which data to use, and how to combine it, to understand one thing

This is the heart of the page. Each card is a real analysis question: which datasets to use, what keys join them, what method to read them with, and which article has already turned the analysis into a story.

Housing justice: the cheap homes the government built — who got rich off them in the end?

How to combine: Align by administrative district and housing complex name: allocation records tell you what price the government sold at back then, real-price registration tells you what the same address is worth today, and the social housing statistics give the volumes after the shift from selling to renting.

How best to analyze: Build a time series for the same complex, then segment it by policy milestones: allocation sales in 1985, the resale wall coming down in 2002, rent-only-no-sale in 2016, Taoyuan's return to selling in 2026. The appreciation multiple divided by the years elapsed is the slope of the 'asset escalator'.

Energy transition: nuclear power went to zero and then restarted — what happened in the numbers?

How to combine: Generation performance gives annual output and capacity factors; the unit table gives each reactor's decommissioning date. Align the two tables by year, then overlay the referendum and policy milestones.

How best to analyze: Draw an annual capacity-factor curve and mark the three referendums (2018, 2021, 2025): how the curve descends to zero, and whether a single number moves within a year of each vote — the time lag between 'political decisions' and 'physical reality' surfaces on its own.

NHI finances: who uses it, who pays in, and how many more years can the system hold?

How to combine: Enrollment counts by age bracket yield the structural ratio of those who pay to those who use; the minutes give the timeline of premium-rate decisions; the subsidy statistics show the implementation side of ability-to-pay.

How best to analyze: Turn the age-structure ratio into a quarterly series and overlay the premium-rate decisions: the structure keeps deteriorating while the rate stays put, so what fills the gap (budget injections, point values, copayments)? Every 'maintain without adjustment' in the minutes has a corresponding cost entry.

Democratic quality: how big is an election's electorate, and how intense is the enforcement?

How to combine: Voter counts give each election's electorate; the conviction statistics give the yearly conviction volumes for vote-buying and election interference. Both slice by county and align with election results.

How best to analyze: Build an 'electorate × conviction rate' pairing for each election cycle and compare enforcement intensity across cycles. The 2026 round adds AI disinformation as a new enforcement focus — establish the baseline for the two traditional categories (vote-buying, interference) first, so the scale of the new threat has a frame of reference.

Street economy: how was the output of 230,000 vendor stalls actually calculated?

How to combine: The five-year census by DGBAS (Directorate-General of Budget, Accounting and Statistics) gives the national vendor population (stall counts, workers, revenue); the county registries give the roster of night markets officially recognized by government.

How best to analyze: Cross-census comparison is the key: five-year changes in stall counts and revenue, set against mobile-payment penetration and tourist arrivals. When the next census lands, every article citing the 233,000-stall figure should come back and reconcile.

Road safety: what links 14 million scooters to three thousand deaths a year?

How to combine: Vehicle registrations give the denominator (exposure); crash casualties give the numerator. Use the long county-level series (24 years for Taoyuan) as the methodological template first, then extend to other counties.

How best to analyze: Don't stop at absolute death counts — normalize to casualties per 100,000 vehicles, computed separately for scooters and cars; the 'pedestrian hell' debate needs exactly this denominator.

Twenty data domains × Taiwan.md's story map

On the left, Twinkle Hub's domain categories (crawled live); on the right, our curation mapping: which articles on this island each domain's data connects to. For domains marked 'Story not yet written', the flagship datasets and analysis paths are fully curated but the article hasn't been written: that is our development map, and an open invitation to anyone who wants to write it.

不動產與地政

realestate_land

土地、建物、房屋、都市計畫、地價、建照使照、不動產交易、租金

Typical questions: 某地段近一年實價中位數;某學區內近期使用執照核發數;都市更新案件清單

經濟、產業、公司商業

economy_business

營業/公司/工廠登記、產業統計、進出口貿易、景氣/物價指數、金融市場、上市櫃公司、公平交易

Typical questions: 某統編公司歷史登記變更;本月某產業景氣燈號;某產業上市公司營收

政府採購與補助

procurement_subsidy

招標/決標公告、補助案件、獎助、政府支出予個人

Typical questions: 某廠商近五年得標金額;某機關本月補助清單

Story not yet written Who won Taiwan: the government outsourcing map inside 135,000 contract award records

How to analyze it: Join award records to business registrations via the unified business number: which agencies a supplier has won how much from over the years. Plot a heatmap on three axes — amount, agency, year — and the geography and networks of public spending surface on their own.

政府預決算與會計

public_finance

中央/地方總預算、會計月報、附屬單位預算、債務、國庫、主計統計

Typical questions: 某機關歷年預算趨勢;中央政府公共債務餘額

Story not yet written The nation's debt sheet: how much the central government owes, and how fast it pays it back

How to analyze it: Build a monthly series of the debt balance, set against GDP and the borrowing ceiling in the Public Debt Act; then layer on each year's special budgets (pandemic relief, Forward-Looking, resilience) item by item, and watch how 'exceptional spending' becomes the norm.

稅務與稅收

tax_revenue

綜合所得稅、營業稅、地價/房屋/牌照稅、稅捐稽徵、申報核定統計

Typical questions: 某縣市本月稅收結構;某稅目歷年實徵淨額

Story not yet written What taxes keep your county alive: a fitness check on local public finance

How to analyze it: Break each county's net collections down by tax category: who lives on land value and house taxes (metro areas), and who lives on centrally allotted funds (everywhere else). Align with population and housing-price data via administrative district codes, and the fiscal autonomy ranking computes itself.

交通運輸、道路與停車

transport

車禍事故、公車/客運/捷運/鐵路/航班、停車場、即時路況、油價、車籍、道路設施

Typical questions: 某路口近一年事故數;即時公車到站;本市公有停車場剩餘車位

治安、警消與災防

public_safety

刑案、警政、消防/救護、災害示警、地震/颱風/淹水、海巡、110/119

Typical questions: 本市本月詐騙手法統計;即時災害示警;消防救護案件

Connected articles Typhoons

司法、法務、矯正與裁罰

judicial_legal

法院判決、檢察偵查/起訴、矯正/監所/受刑人、訴願、政府機關裁罰名單

Typical questions: 某公司被金管會裁罰歷史;某地檢偵查終結概況

立法院/國會

legislature

立法院議案、法律提案、表決、公報、質詢、發言、IVOD 影音索引、立委個人資料、選區、會議記錄。

Typical questions: 某委員第N屆提了哪些法案;某黨團對 X 議案的表決傾向;某議題在公報的歷次發言

Connected articles The Sunflower Movement

醫療、衛生、食品與藥物

health_food

醫事機構、健保特約、藥局、藥品/食品許可、疫情、長照、母嬰親善、食安

Typical questions: 住家附近健保藥局;某藥品/醫材許可資訊;近期傳染病通報

環境、氣象、生態與水文

environment

空品 AQI、河川水質、雨量、水庫、廢棄物回收、林班、生態保育、噪音、碳排

Typical questions: 今日本區 AQI;某河川水質歷史;本市資源回收成果

教育與科研

education_research

各級學校、教師/學生統計、補習班、圖書館、科研計畫、專利、學位論文

Typical questions: 某學區學校清單;某學校歷年學生數;某機構研究專利

農林漁牧

agriculture_fisheries

農產交易、畜牧場、漁港/漁船、農藥/肥料、農會、養殖、畜產統計

Typical questions: 某果菜市場今日交易行情;某縣畜牧場分布

勞動與就業

labor_employment

違反勞動法令、薪資、職缺、職業訓練、勞退/勞保、職災

Typical questions: 某雇主違反勞動法令紀錄;某產業薪資中位數

Story not yet written The insured-salary ceiling: three Ministry of Labor tables that reveal what Taiwanese actually earn

How to analyze it: Cross-tabulate the three insured-salary series — labor, employment, and occupational accident insurance — by industry and unit size. Mind the right-censoring caused by the NT$45,800 insured-salary cap: leave it unhandled and the averages for high-paying industries get systematically understated. Half the 'average salary' controversy comes from here.

社會福利、戶政、人口、選舉與公務人事

social_population

人口/戶籍/出生/死亡/結婚/離婚、低收入戶、身心障礙、原住民/新住民、選舉投票、公務員人事

Typical questions: 某選區歷次得票結構;某縣身心障礙人口;本市本月人口變動

文化、觀光與體育

culture_tourism_sport

景點、博物館、古蹟、寺廟、活動行事曆、體育場館、運動賽事

Typical questions: 本週某縣市活動;某博物館館藏

外交、領事與兩岸

foreign_affairs

外交部公告、領事/簽證/護照、駐外館處、兩岸貿易/政策/案件、僑務、國際合作、新南向、邦交國

Typical questions: 某國家近年我國進出口金額;近期外交部聲明 / 兩岸政策談話;簽證 / 護照申辦規定;駐外館處清單與聯絡資訊

政府公告與檔案

gov_publication

機關新聞稿、公報、最新消息、電子公布欄、公文範本、檔案目錄、施政方針、資訊公開申請、公共政策參與

Typical questions: 本週某機關新聞稿;行政院公報全文檢索;某類公文 / 表單範本;政府資訊公開申請統計

Story not yet written Taiwan in the gazettes: what the government announces about itself each month

How to analyze it: Build keyword time series of regulatory changes from gazette full texts, then set them against the Legislative Yuan records in the legislature domain: the time gap between administrative announcements and the legislative trail is the real speed at which a policy takes effect.

地理底圖(橫向層)

geo_basemap

行政區界、村里界、門牌、坐標、路網、河系、土地利用

Typical questions: 作為其他資料集的 join 來源;空間查詢

能源、水電瓦斯與電信(橫向層)

utilities_telecom

電力供需、加油站、自來水、瓦斯、再生能源、電信與寬頻、無線網路

Typical questions: 即時電力負載;某行政區自來水水質;某地加油站清單

Joining methodology: the keys that line two tables up

A single dataset is a dot; only combination makes a web. These are the joining clues our testing found most useful.

Update frequency = analysis resolution: a spectrum of the fifteen showcase datasets

Each dot is a dataset cited on this page; hover to see its name. Before designing an analysis, check which end of the spectrum your data sits on.

每1月 ×4
每3月 ×2
每1年 ×3
每4年 ×1
每5年 ×1
不定期 ×4
← Updated monthly: event studies possible Once every five years: cross-period comparison only →

Administrative district codes

The most universal join key. Standard codes for counties, cities, townships, and districts let population, housing prices, crashes, and tax revenue align onto one map; identically named districts (two Xinyi Districts) are disambiguated by code.

Unified business numbers

A company's ID card. Business registrations, procurement awards, patent filings, and penalty lists all carry the unified number — it is how you trace a company's complete footprint.

Coordinates and station codes

Environmental data (air quality, water quality, weather) hangs off monitoring stations; geographic data hangs off coordinates. Converting between these and administrative district codes is step one of any spatial analysis.

Quality tiers

The platinum-to-bronze tiers are a quick screen for 'can this dataset be used as-is': platinum ones are mostly normalized and queryable as structured rows; untested ones (like the self-curated real-price registration) you verify yourself.

Update frequency is analysis resolution

Monthly data supports event studies, yearly data only trends, five-yearly data (the vendor census) only cross-period comparison. Check the frequency before designing the analysis, not the other way around.

The dual-pointer principle

When Taiwan.md articles cite a dataset, the link always points to data.gov.tw or the competent authority's persistent page, with the query layer (Twinkle Hub) as a parallel value-added path. Keep the data's home and the query path separate, and a change in either layer never breaks the chain.

Tool catalog: the complete list of 21 tools

Crawled directly from the MCP endpoint (refreshed on every page rebuild). The grouping is ours.

Core dataset four-piece kit + domain index ×5

  • search_datasets Search Taiwan government open datasets (台灣政府開放資料 / data.gov.tw /
  • get_dataset 取得資料集完整 metadata 與樣本資料列。
  • query_rows 讀取已 normalise 資料集的實際列;支援聚合查詢(v1.11.2+)。
  • materialize_dataset 強制下載並轉換指定資料集(若已 cache 則為 no-op)。
  • list_domains 列出全部 19 個 domain 標籤的定義(key、中文名、scope、典型問題、anchor 範例)。

Patents ×2

  • search_patents 以自然語言查詢 TIPO 發明專利公開案 corpus (data.gov.tw dataset 15992,
  • get_patent_body 取得單一專利的完整 description body (技術領域 / 先前技術 / 實施方式)

National exams ×3

  • search_exam 以自然語言檢索台灣國家考試試卷 (dataset 170565,考選部,OGDL).
  • search_exam_questions 以自然語言 + (可選) 關鍵字過濾, 檢索國家考試題目級別.
  • get_exam_paper 取得單一國考試卷的全部題目 + 標準答案 (測驗題).

Court judgments ×2

  • search_judicial 以白話 + (可選)關鍵字 / 結構化過濾,檢索台灣判決書 corpus.
  • get_judicial_full 取單一判決書完整 metadata + JFULL + T3 抽出欄位 (若已處理).

Drugs and medical codes ×6

  • lookup_icd10 ICD-10-CM 中文版查詢 (衛福部健保署翻譯 v2023, data.gov.tw 177507).
  • search_drug 衛福部食藥署 全部藥品許可證 search (data.gov.tw 9122, 71,836 件).
  • get_drug_details 取單一藥品許可證的全 28 欄詳細 (data.gov.tw 9122).
  • search_health_supplements TFDA 健康食品許可證 (data.gov.tw 6951, 562 件).
  • search_drug_label 搜 twinkle-ai/tw-drug-labels-vision (CC-BY-4.0, ~72k 藥品仿單) 結構化欄位.
  • check_drug_interaction 初步篩查多藥品之間的交互作用 — naive substring scan over each

Food nutrition ×3

  • query_food_nutrition 衛福部 台灣食品營養成分資料 (data.gov.tw 8543, 226,825 行).
  • search_foods_by_nutrient 依單一營養素排行食物 (per 100g) — 衛福部食品營養成分 dataset 8543.
  • analyze_meal_nutrition 給定一頓餐 (食物名 → 克數), 算總營養成分.

Articles already wired to the data layer

Starting June 2026, we are adding a 'Public data' section at the end of articles: each lists the datasets that could verify (or overturn) the article's claims, with one line on why each pointer is there. The first batch of six:

Twinkle Hub is in alpha, and this page's assessment will be updated as it evolves. Taiwan.md currently has no commercial relationship of any kind with Twinkle Hub; this page is a heavy user's first-hand checkup — and an invitation: only when the data layer and the meaning layer work together will Taiwan be understood in full.

Tool and domain lists on this page crawled live at 2026-06-10 · hub.twinkleai.tw · data.gov.tw 🧬