ReadMetric

ReadMetric is a concluded research project that aimed to determine if the quality of a user's README was indicative of how well their repository would perform in terms of popularity. Standard KDD processes were followed in order to develop a scientifically valid result.

Using a variety of statistical analysis models, no tangible correlation was found between the two factors, concluding that README quality does not meaningfully impact project success. Other factors are probably much more important.

Scraping

Data was scraped from GitHub manually using a combination of GitHub's REST API and GraphQL. GHTorrent databases are outdated for the purposes of our project, and no single official source remains, leaving us no choice but to manually fetch these repos ourselves. GraphQL helped dodge rate limits by providing denser queries.

Roadblocks

Many compromises were made in the utilization of data in order to get a valid list of repositories. The goal was 5,000 clean repositories, but many things were in the way:

Result

After extensively tuning our scraper and doing lots of manual filtering, we believe we had a dataset worth using. After applying both basic correlative and machine learning algorithms, we could not find any substantial correlations between any individual README quality and the success of the project.