Recently, I’ve been working on a page to track GitHub activity for ICON-related repositories. The page has a component that displays every commit made to the tracked repositories. This component is powered by a background service that scrapes GitHub’s API for new commits every few minutes, and stores new commits in a MongoDB database.

As you can imagine, this task is fairly resource-intensive (especially for repos with a large number of commits) because GitHub only displays up to 100 commits on each API call. So, scraping commit data from a repo with 5,000 commits would require 50 API calls.

To make things a bit more efficient, I added some additional logic that queries the database for all commits to a repo, queries the GitHub API for the total number of commits, and compares the two numbers. If the two numbers are equal, the background service can just skip the resource-intensive API calls for the given repo.

Here’s a simplified version of the function I wrote to get the number of commits to a GitHub repo:

import requests
from urllib.parse import parse_qs, urlparse

def get_commits_count(self, owner_name: str, repo_name: str) -> int:
    Returns the number of commits to a GitHub repository.
    url = f"{owner_name}/{repo_name}/commits?per_page=1"
    r = requests.get(url)
    links = r.links
    rel_last_link_url = urlparse(links["last"]["url"])
    rel_last_link_url_args = parse_qs(rel_last_link_url.query)
    rel_last_link_url_page_arg = rel_last_link_url_args["page"][0]
    commits_count = int(rel_last_link_url_page_arg)
    return commits_count