Using GitHub Innovation Graph Data to Uncover the Digital Complexity of Nations: A Step-by-Step Guide

Introduction

For years, economists have measured national economic complexity by analyzing physical exports, patents, and research publications. But these metrics miss a massive and growing part of the global economy: software. Code doesn't cross borders via customs—it moves through git pushes, cloud services, and package managers. This invisible productive knowledge, often called the "digital dark matter" of the economy, is now trackable thanks to the GitHub Innovation Graph. A recent study published in Research Policy by Sándor Juhász, Johannes Wachs, Jermain Kaminski, and César A. Hidalgo used this data to measure the digital complexity of nations. Their findings show that software production complexity predicts GDP growth, inequality, and emissions in ways traditional data cannot. This guide will walk you through how to replicate their approach—step by step—so you can explore the digital complexity of any nation using open data.

Using GitHub Innovation Graph Data to Uncover the Digital Complexity of Nations: A Step-by-Step Guide
Source: github.blog

What You Need

Step-by-Step Guide

Step 1: Obtain the GitHub Innovation Graph Data

Head to the GitHub Innovation Graph website. Download the latest quarterly dataset (the researchers used Q4 2025 release, but you can pick any quarter). The key table is "developers_by_country_language", which shows the number of active developers per economy (based on IP addresses) pushing code in each programming language. Save this as a CSV or JSON file.

Step 2: Clean and Prepare the Data

Open the dataset in your analytical tool. You'll see columns like country_code, language, developer_count. Remove any entries with missing country codes or languages that are too rare (e.g., languages with fewer than 100 developers globally). Normalize the developer counts by total developers in each country to avoid biases from population size. For example, calculate the share of developers using each language per country.

Step 3: Apply the Economic Complexity Index (ECI) Methodology

The ECI originally measures the complexity of a country's export basket. Here, we apply it to programming languages. The logic: a country is more digitally complex if it has many developers using many different languages (diversity) AND those languages are used by few other countries (ubiquity). Follow these substeps:

  1. Create a binary matrix: Set cell (c, l) to 1 if country c has a revealed comparative advantage (RCA) > 1 in language l. RCA is calculated as (share of developers in language l in country c) divided by (global share of developers in language l). Use a threshold of 1.
  2. Compute diversity and ubiquity: Diversity = sum of languages with RCA>1 per country. Ubiquity = sum of countries with RCA>1 per language.
  3. Iterate the ECI algorithm: Standard method - calculate average ubiquity of languages in a country’s basket, then average diversity of countries using those languages, and repeat until convergence. Use Python's econplomplexity package or implement manually.
  4. Standardize: Normalize the resulting ECI values to have mean 0 and standard deviation 1.

Step 4: Analyze the Software ECI Scores

You now have a software ECI score for each country. Sort the list. Which countries rank highest? The researchers found that high software complexity nations (like the US, Sweden, and Singapore) are not necessarily the ones with the largest developer populations, but those with diverse, specialized language usage. Create a bar chart or map to visualize the distribution. Compare with traditional economic complexity indices to see where they diverge.

Using GitHub Innovation Graph Data to Uncover the Digital Complexity of Nations: A Step-by-Step Guide
Source: github.blog

Step 5: Correlate with Macroeconomic Indicators

Download GDP per capita, Gini coefficient (inequality), and CO2 emissions per capita from reliable sources (e.g., World Bank WDI). Align the year of the software data with the economic data (preferably one year lag). Run correlation tests (Pearson, Spearman) between software ECI and each indicator. Plot scatter plots with trend lines. The researchers found that software ECI predicts GDP and emissions even after controlling for traditional complexity measures. Try adding controls for population, education, and internet penetration.

Step 6: Validate and Interpret Findings

To ensure robustness, perform out-of-sample tests: predict future GDP growth using current software ECI. Compare with predictions from traditional ECI. Check if software ECI adds explanatory power (e.g., using nested regression models). Also consider limitations: IP addresses may not capture all developers (VPNs, offices abroad). Interpret cautiously—correlation does not imply causation. The paper suggests that software complexity captures a distinct dimension of productive knowledge, especially in countries transitioning to digital economies.

Tips

By following these steps, you can replicate a cutting-edge economic analysis using open-source development data. The digital complexity of nations is no longer invisible. Start exploring today and contribute to a new understanding of how software shapes economies.

Recommended

Discover More

The Gentlemen RaaS Surpasses 320 Victims as SystemBC Botnet Reveals Corporate Focus8 Critical Steps for Post-Quantum Cryptography Migration: Lessons from MetaCloudflare Deploys Coordinated AI Agents to Slash Code Review DelaysHow International Law Enforcement Disrupted Massive IoT Botnets: A Step-by-Step GuideHow to Get Ready for the Next Generation of iPads: A Rumour-Based Preparation Guide