Mining GitHub for Founder Signals

At Powerset, we believe the best founders leave traces of their work long before they start companies. Open source contributions, side projects, and technical writing often reveal the builders who will go on to create category-defining startups.

This post walks through how we built a system to surface these signals from GitHub data.

The Hypothesis

Most venture sourcing happens reactively: a founder announces a raise, and investors scramble to get a meeting. But what if we could identify exceptional builders before they start companies?

GitHub is a goldmine for this. Every commit, issue, and pull request tells a story. The challenge is separating signal from noise across millions of developers.

Data pipeline visualization

Building the Data Pipeline

We started with MergeStat, an open source tool that syncs Git repository data into a Postgres database. This gives us a SQL interface to explore commit history, file changes, and contributor patterns.

Here's the basic schema we work with:

CREATE TABLE commits (
  hash TEXT PRIMARY KEY,
  author_name TEXT,
  author_email TEXT,
  author_when TIMESTAMPTZ,
  message TEXT,
  repo_id UUID REFERENCES repos(id)
);

CREATE TABLE repos (
  id UUID PRIMARY KEY,
  name TEXT,
  owner TEXT,
  stars INTEGER,
  forks INTEGER,
  language TEXT,
  created_at TIMESTAMPTZ
);

CREATE TABLE file_changes (
  commit_hash TEXT REFERENCES commits(hash),
  file_path TEXT,
  additions INTEGER,
  deletions INTEGER
);

With this foundation, we can start asking interesting questions.

Signal 1: Consistent Contributors

Our first signal looks for developers who maintain a steady cadence of contributions over time. One-off contributors are common; sustained engagement is rare.

WITH monthly_commits AS (
  SELECT
    author_email,
    DATE_TRUNC('month', author_when) AS month,
    COUNT(*) AS commit_count
  FROM commits
  WHERE author_when > NOW() - INTERVAL '2 years'
  GROUP BY author_email, DATE_TRUNC('month', author_when)
),
contributor_stats AS (
  SELECT
    author_email,
    COUNT(DISTINCT month) AS active_months,
    AVG(commit_count) AS avg_monthly_commits,
    STDDEV(commit_count) AS commit_variance
  FROM monthly_commits
  GROUP BY author_email
)
SELECT
  author_email,
  active_months,
  ROUND(avg_monthly_commits, 1) AS avg_commits,
  ROUND(commit_variance / NULLIF(avg_monthly_commits, 0), 2) AS consistency_score
FROM contributor_stats
WHERE active_months >= 18
ORDER BY avg_monthly_commits DESC
LIMIT 100;

The consistency_score (coefficient of variation) helps us distinguish between someone who commits sporadically in bursts versus someone with steady output. Lower variance relative to mean suggests a more disciplined, sustainable work pattern.

Signal 2: Rising Stars

We also look for developers whose influence is growing. This query identifies contributors who are getting more pull requests merged in popular repositories over time:

interface ContributorTrend {
  email: string;
  recentPRs: number;
  olderPRs: number;
  growthRate: number;
  topRepos: string[];
}

async function findRisingContributors(
  minStars: number = 1000,
): Promise<ContributorTrend[]> {
  const result = await db.query(
    `
    WITH pr_activity AS (
      SELECT
        pr.author_email,
        pr.merged_at,
        r.name AS repo_name,
        r.stars,
        CASE
          WHEN pr.merged_at > NOW() - INTERVAL '6 months' THEN 'recent'
          ELSE 'older'
        END AS period
      FROM pull_requests pr
      JOIN repos r ON pr.repo_id = r.id
      WHERE pr.merged_at IS NOT NULL
        AND r.stars >= $1
    )
    SELECT
      author_email,
      COUNT(*) FILTER (WHERE period = 'recent') AS recent_prs,
      COUNT(*) FILTER (WHERE period = 'older') AS older_prs,
      ARRAY_AGG(DISTINCT repo_name ORDER BY repo_name) AS top_repos
    FROM pr_activity
    GROUP BY author_email
    HAVING COUNT(*) FILTER (WHERE period = 'recent') > 
           COUNT(*) FILTER (WHERE period = 'older')
    ORDER BY recent_prs DESC
  `,
    [minStars],
  );

  return result.rows.map((row) => ({
    email: row.author_email,
    recentPRs: row.recent_prs,
    olderPRs: row.older_prs,
    growthRate:
      row.older_prs > 0 ? row.recent_prs / row.older_prs : row.recent_prs,
    topRepos: row.top_repos,
  }));
}

This surfaces developers who are increasingly active in high-quality projects—a strong indicator of growing expertise and reputation.

Signal 3: Project Starters

Some of the best founders are serial project creators. We track developers who start repositories that gain meaningful traction:

SELECT
  r.owner AS github_username,
  COUNT(*) AS projects_started,
  SUM(r.stars) AS total_stars,
  AVG(r.stars) AS avg_stars_per_project,
  ARRAY_AGG(r.name ORDER BY r.stars DESC) AS projects
FROM repos r
WHERE r.stars >= 100
  AND r.created_at > NOW() - INTERVAL '3 years'
GROUP BY r.owner
HAVING COUNT(*) >= 3
ORDER BY avg_stars_per_project DESC
LIMIT 50;

Analytics dashboard

Putting It Together

Each signal alone is noisy. Someone might have consistent commits but only to their dotfiles. Another might start many projects that never get traction. The magic is in combining signals:

interface FounderCandidate {
  email: string;
  githubUsername: string;
  signals: {
    consistency: number;
    growth: number;
    projectSuccess: number;
  };
  compositeScore: number;
  topProjects: string[];
}

function scoreCandidate(
  consistency: ConsistencyData,
  growth: GrowthData,
  projects: ProjectData,
): FounderCandidate {
  // Normalize each signal to 0-100 scale
  const consistencyScore = normalizeScore(
    consistency.activeMonths,
    12,
    24, // min, max expected range
  );

  const growthScore = normalizeScore(growth.growthRate, 1, 5);

  const projectScore = normalizeScore(projects.avgStars, 100, 5000);

  // Weighted composite
  const composite =
    consistencyScore * 0.3 + growthScore * 0.3 + projectScore * 0.4;

  return {
    email: consistency.email,
    githubUsername: projects.username,
    signals: {
      consistency: consistencyScore,
      growth: growthScore,
      projectSuccess: projectScore,
    },
    compositeScore: composite,
    topProjects: projects.repos.slice(0, 5),
  };
}

function normalizeScore(value: number, min: number, max: number): number {
  return Math.min(100, Math.max(0, ((value - min) / (max - min)) * 100));
}

Results and Iteration

We've been running this system for six months. Some observations:

False positives are educational. When we reach out to a high-scoring developer who isn't interested in starting a company, we learn about their motivations and refine our model.
Timing matters. The best signals often come 6-12 months before someone is ready to start something. Building relationships early pays dividends.
Context is everything. A developer contributing to AI infrastructure projects in 2024 is a different signal than the same contribution pattern in 2019.

What's Next

We're expanding beyond GitHub to include:

Technical writing — Blog posts, documentation, and conference talks
Community signals — Discord activity, Twitter engagement, podcast appearances
Team formation — When multiple high-signal developers start collaborating

The goal isn't to replace human judgment—it's to surface candidates we'd otherwise miss and give us a head start on building relationships.

Interested in how we're building this? We're hiring engineers who want to work at the intersection of data and venture. Get in touch.