ReadMetric

April 2026

ReadMetric is a concluded research project that aimed to determine if the quality of a user's README was indicative of how well their repository would perform in terms of popularity. Standard KDD processes were followed in order to develop a scientifically valid result.

Using a variety of statistical analysis models, no tangible correlation was found between the two factors, concluding that README quality does not meaningfully impact project success. Other factors are probably much more important.

Scraping

Data was scraped from GitHub manually using a combination of GitHub's REST API and GraphQL. GHTorrent databases are outdated for the purposes of our project, and no single official source remains, leaving us no choice but to manually fetch these repos ourselves. GraphQL helped dodge rate limits by providing denser queries.

Roadblocks

Many compromises were made in the utilization of data in order to get a valid list of repositories. The goal was 5,000 clean repositories, but many things were in the way:

API ordering - GitHub's API automatically ordered repos according to some unknown "relevancy" metric. If we only scrape 5,000 repos, we'd scrape a biased dataset by default. The solution was to partition queries by month and use a low-star-count group as a control group.
Influencer Repos - Repos from popular individuals and companies would skew the dataset greatly. Manual effort was made to remove these repos, as they were not sound for analysis.
Quantity - GitHub's API does not immediately store every repo ever, hence the value of the now-deceased GHTorrent. Luckily, 5,000 was an acceptable amount of repositories.
Absurd READMEs - Some READMEs had odd characteristics--like 200 images of graph renders. It's hard to quantify the quality of a README like that.
Absurd fork counts - Forks should count as a success metric, but some projects insist on being forked as a qualifier for its usefulness, making forks a flimsy metric for measuring success unless each repo was manually reviewed. Educational repos could have hundreds of thousands of forks (and some did).

Result

After extensively tuning our scraper and doing lots of manual filtering, we believe we had a dataset worth using. After applying both basic correlative and machine learning algorithms, we could not find any substantial correlations between any individual README quality and the success of the project.