Mining GitHub for Founder Signals
At Powerset, we believe the best founders leave traces of their work long before they start companies. Open source contributions, side projects, and technical writing often reveal the builders who will go on to create category-defining startups.
This post walks through how we built a system to surface these signals from GitHub data.
The Hypothesis
Most venture sourcing happens reactively: a founder announces a raise, and investors scramble to get a meeting. But what if we could identify exceptional builders before they start companies?
GitHub is a goldmine for this. Every commit, issue, and pull request tells a story. The challenge is separating signal from noise across millions of developers.
Building the Data Pipeline
We started with MergeStat, an open source tool that syncs Git repository data into a Postgres database. This gives us a SQL interface to explore commit history, file changes, and contributor patterns.
Here's the basic schema we work with:
CREATE TABLE commits (
hash TEXT PRIMARY KEY,
author_name TEXT,
author_email TEXT,
author_when TIMESTAMPTZ,
message TEXT,
repo_id UUID REFERENCES repos(id)
);
CREATE TABLE repos (
id UUID PRIMARY KEY,
name TEXT,
owner TEXT,
stars INTEGER,
forks INTEGER,
language TEXT,
created_at TIMESTAMPTZ
);
CREATE TABLE file_changes (
commit_hash TEXT REFERENCES commits(hash),
file_path TEXT,
additions INTEGER,
deletions INTEGER
);
With this foundation, we can start asking interesting questions.
Signal 1: Consistent Contributors
Our first signal looks for developers who maintain a steady cadence of contributions over time. One-off contributors are common; sustained engagement is rare.
WITH monthly_commits AS (
SELECT
author_email,
DATE_TRUNC('month', author_when) AS month,
COUNT(*) AS commit_count
FROM commits
WHERE author_when > NOW() - INTERVAL '2 years'
GROUP BY author_email, DATE_TRUNC('month', author_when)
),
contributor_stats AS (
SELECT
author_email,
COUNT(DISTINCT month) AS active_months,
AVG(commit_count) AS avg_monthly_commits,
STDDEV(commit_count) AS commit_variance
FROM monthly_commits
GROUP BY author_email
)
SELECT
author_email,
active_months,
ROUND(avg_monthly_commits, 1) AS avg_commits,
ROUND(commit_variance / NULLIF(avg_monthly_commits, 0), 2) AS consistency_score
FROM contributor_stats
WHERE active_months >= 18
ORDER BY avg_monthly_commits DESC
LIMIT 100;
The consistency_score (coefficient of variation) helps us distinguish between someone who commits sporadically in bursts versus someone with steady output. Lower variance relative to mean suggests a more disciplined, sustainable work pattern.
Signal 2: Rising Stars
We also look for developers whose influence is growing. This query identifies contributors who are getting more pull requests merged in popular repositories over time:
interface ContributorTrend {
email: string;
recentPRs: number;
olderPRs: number;
growthRate: number;
topRepos: string[];
}
async function findRisingContributors(
minStars: number = 1000,
): Promise<ContributorTrend[]> {
const result = await db.query(
`
WITH pr_activity AS (
SELECT
pr.author_email,
pr.merged_at,
r.name AS repo_name,
r.stars,
CASE
WHEN pr.merged_at > NOW() - INTERVAL '6 months' THEN 'recent'
ELSE 'older'
END AS period
FROM pull_requests pr
JOIN repos r ON pr.repo_id = r.id
WHERE pr.merged_at IS NOT NULL
AND r.stars >= $1
)
SELECT
author_email,
COUNT(*) FILTER (WHERE period = 'recent') AS recent_prs,
COUNT(*) FILTER (WHERE period = 'older') AS older_prs,
ARRAY_AGG(DISTINCT repo_name ORDER BY repo_name) AS top_repos
FROM pr_activity
GROUP BY author_email
HAVING COUNT(*) FILTER (WHERE period = 'recent') >
COUNT(*) FILTER (WHERE period = 'older')
ORDER BY recent_prs DESC
`,
[minStars],
);
return result.rows.map((row) => ({
email: row.author_email,
recentPRs: row.recent_prs,
olderPRs: row.older_prs,
growthRate:
row.older_prs > 0 ? row.recent_prs / row.older_prs : row.recent_prs,
topRepos: row.top_repos,
}));
}
This surfaces developers who are increasingly active in high-quality projects—a strong indicator of growing expertise and reputation.
Signal 3: Project Starters
Some of the best founders are serial project creators. We track developers who start repositories that gain meaningful traction:
SELECT
r.owner AS github_username,
COUNT(*) AS projects_started,
SUM(r.stars) AS total_stars,
AVG(r.stars) AS avg_stars_per_project,
ARRAY_AGG(r.name ORDER BY r.stars DESC) AS projects
FROM repos r
WHERE r.stars >= 100
AND r.created_at > NOW() - INTERVAL '3 years'
GROUP BY r.owner
HAVING COUNT(*) >= 3
ORDER BY avg_stars_per_project DESC
LIMIT 50;
Putting It Together
Each signal alone is noisy. Someone might have consistent commits but only to their dotfiles. Another might start many projects that never get traction. The magic is in combining signals:
interface FounderCandidate {
email: string;
githubUsername: string;
signals: {
consistency: number;
growth: number;
projectSuccess: number;
};
compositeScore: number;
topProjects: string[];
}
function scoreCandidate(
consistency: ConsistencyData,
growth: GrowthData,
projects: ProjectData,
): FounderCandidate {
// Normalize each signal to 0-100 scale
const consistencyScore = normalizeScore(
consistency.activeMonths,
12,
24, // min, max expected range
);
const growthScore = normalizeScore(growth.growthRate, 1, 5);
const projectScore = normalizeScore(projects.avgStars, 100, 5000);
// Weighted composite
const composite =
consistencyScore * 0.3 + growthScore * 0.3 + projectScore * 0.4;
return {
email: consistency.email,
githubUsername: projects.username,
signals: {
consistency: consistencyScore,
growth: growthScore,
projectSuccess: projectScore,
},
compositeScore: composite,
topProjects: projects.repos.slice(0, 5),
};
}
function normalizeScore(value: number, min: number, max: number): number {
return Math.min(100, Math.max(0, ((value - min) / (max - min)) * 100));
}
Results and Iteration
We've been running this system for six months. Some observations:
-
False positives are educational. When we reach out to a high-scoring developer who isn't interested in starting a company, we learn about their motivations and refine our model.
-
Timing matters. The best signals often come 6-12 months before someone is ready to start something. Building relationships early pays dividends.
-
Context is everything. A developer contributing to AI infrastructure projects in 2024 is a different signal than the same contribution pattern in 2019.
What's Next
We're expanding beyond GitHub to include:
- Technical writing — Blog posts, documentation, and conference talks
- Community signals — Discord activity, Twitter engagement, podcast appearances
- Team formation — When multiple high-signal developers start collaborating
The goal isn't to replace human judgment—it's to surface candidates we'd otherwise miss and give us a head start on building relationships.
Interested in how we're building this? We're hiring engineers who want to work at the intersection of data and venture. Get in touch.