We prepared a dataset from the GH Archive that contains all the events in all GitHub repositories since 2011 in structured format. The dataset was uploaded into ClickHouse, where it contains 3.1 billion records. We redistribute it for research purposes and it can be downloaded at this direct link. This dataset can help answer almost any question about GitHub that you can imagine.
This article shows the usage of the dataset and offers insights into the GitHub ecosystem. It emphasizes how easy it is to play with data using modern tools.
The WatchEvent is the event when someone gives a star to a repo.
:) SELECT count() FROM github_events WHERE event_type = 'WatchEvent'
┌───count()─┐
│ 232118474 │
└───────────┘
1 rows in set. Elapsed: 0.034 sec. Processed 232.13 million rows, 232.13 MB (6.85 billion rows/s., 6.85 GB/s.)
There is no event when someone removes the star. This means that star removal is invisible, but it should be rare in any case, so we can calculate the approximate number of stars from these events.
:) SELECT action, count() FROM github_events WHERE event_type = 'WatchEvent' GROUP BY action
┌─action──┬───count()─┐
│ started │ 232118474 │
└─────────┴───────────┘
1 rows in set. Elapsed: 0.034 sec. Processed 232.13 million rows, 464.26 MB (6.77 billion rows/s., 13.53 GB/s.)
There are over 200 million stars on GitHub!
Let's validate the star count by comparing it for my favorite repository:
:) SELECT count() FROM github_events WHERE event_type = 'WatchEvent' AND repo_name IN ('ClickHouse/ClickHouse', 'yandex/ClickHouse') GROUP BY action
┌─count()─┐
│ 14359 │
└─────────┘
1 rows in set. Elapsed: 0.006 sec. Processed 32.77 thousand rows, 166.65 KB (5.52 million rows/s., 28.07 MB/s.)
The number looks consistent with the real star count.
(The repository was moved from yandex/ClickHouse to ClickHouse/ClickHouse on October 2019, so we have to sum up the numbers for two repos.)
This is the first report that comes to mind.
:) SET output_format_pretty_row_numbers = 1
:) SELECT repo_name, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' GROUP BY repo_name ORDER BY stars DESC LIMIT 50
┌─repo_name──────────────────────────────┬──stars─┐
1. │ 996icu/996.ICU │ 354850 │
2. │ FreeCodeCamp/FreeCodeCamp │ 225490 │
3. │ vuejs/vue │ 199731 │
4. │ facebook/react │ 188575 │
5. │ tensorflow/tensorflow │ 173681 │
6. │ sindresorhus/awesome │ 160542 │
7. │ kamranahmedse/developer-roadmap │ 147561 │
8. │ getify/You-Dont-Know-JS │ 144146 │
9. │ freeCodeCamp/freeCodeCamp │ 140027 │
10. │ twbs/bootstrap │ 126371 │
11. │ torvalds/linux │ 121415 │
12. │ jwasham/coding-interview-university │ 119797 │
13. │ github/gitignore │ 119322 │
14. │ donnemartin/system-design-primer │ 118394 │
15. │ flutter/flutter │ 116303 │
16. │ airbnb/javascript │ 115651 │
17. │ jackfrued/Python-100-Days │ 108760 │
18. │ robbyrussell/oh-my-zsh │ 107173 │
19. │ facebook/react-native │ 105346 │
20. │ TheAlgorithms/Python │ 102067 │
21. │ vinta/awesome-python │ 98604 │
22. │ Snailclimb/JavaGuide │ 97793 │
23. │ vhf/free-programming-books │ 94137 │
24. │ CyC2018/CS-Notes │ 93320 │
25. │ golang/go │ 92407 │
26. │ EbookFoundation/free-programming-books │ 92016 │
27. │ trekhleb/javascript-algorithms │ 89512 │
28. │ danistefanovic/build-your-own-x │ 89253 │
29. │ jlevy/the-art-of-command-line │ 87750 │
30. │ Microsoft/vscode │ 82043 │
31. │ angular/angular │ 80602 │
32. │ justjavac/free-programming-books-zh_CN │ 78460 │
33. │ labuladong/fucking-algorithm │ 77568 │
34. │ angular/angular.js │ 76251 │
35. │ FortAwesome/Font-Awesome │ 75924 │
36. │ nodejs/node │ 75477 │
37. │ tensorflow/models │ 75206 │
38. │ laravel/laravel │ 74136 │
39. │ daneden/animate.css │ 74095 │
40. │ mrdoob/three.js │ 72597 │
41. │ M4cs/BabySploit │ 71572 │
42. │ ant-design/ant-design │ 71552 │
43. │ electron/electron │ 71394 │
44. │ kubernetes/kubernetes │ 68644 │
45. │ resume/resume.github.com │ 68423 │
46. │ atom/atom │ 68396 │
47. │ iluwatar/java-design-patterns │ 67831 │
48. │ PanJiaChen/vue-element-admin │ 67625 │
49. │ django/django │ 66415 │
50. │ apple/swift │ 66350 │
└────────────────────────────────────────┴────────┘
50 rows in set. Elapsed: 0.663 sec. Processed 232.13 million rows, 1.81 GB (350.34 million rows/s., 2.73 GB/s.)
The top repository ever is "996.ICU" — it's not for software, but more like a project to improve awareness about work schedules in different Chinese companies. But wait... it's not the top repo. We see two copies of "FreeCodeCamp": FreeCodeCamp/FreeCodeCamp and freeCodeCamp/freeCodeCamp. If we sum up the stars it will be 225390 + 139197 = 360137 — slightly more than "996.ICU". So, the top repository on GitHub is an educational resource.
Next we see: JavaScript framework, then Machine Learning framework, then another JavaScript framework, then the list of Awesome Awesomeness, then the JavaScript education resources, then another education project, another education project, UI framework, education, education, UI framework, JS something, CSS framework and finally in 17th place there is Linux. Let's stop at this point.
Bottomline: GitHub is full of JavaScript frameworks and education content. But don't jump to conclusions yet — we've only scratched the surface!
By the way, what is the distribution of repositories by stars?
SELECT
exp10(floor(log10(c))) AS stars,
uniq(k)
FROM
(
SELECT
repo_name AS k,
count() AS c
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY k
)
GROUP BY stars
ORDER BY stars ASC
┌──stars─┬──uniq(k)─┐
│ 1 │ 14946505 │
│ 10 │ 1196622 │
│ 100 │ 213026 │
│ 1000 │ 28944 │
│ 10000 │ 1847 │
│ 100000 │ 20 │
└────────┴──────────┘
6 rows in set. Elapsed: 1.249 sec. Processed 232.13 million rows, 1.81 GB (185.80 million rows/s., 1.45 GB/s.)
There are only 16 million repositories that have at least one star. There are only 1.4 million repositories with 10+ stars. Just 240 000 repositories have over 100 stars, 29 000 have over 1k stars, 1800 have over 10k stars and there are only 20 repositories with more than 100k stars.
:) SELECT uniq(repo_name) FROM github_events
┌─uniq(repo_name)─┐
│ 164059648 │
└─────────────────┘
1 rows in set. Elapsed: 3.801 sec. Processed 3.12 billion rows, 24.87 GB (820.73 million rows/s., 6.54 GB/s.)
If we will take all stars and distribute uniformly across all the repositories, every repository will get 1.37 stars.
2015 | 2016 | 2017 | 2018 | 2019 | 2020
:) SELECT repo_name, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' AND toYear(created_at) = '2020' GROUP BY repo_name ORDER BY stars DESC LIMIT 50 ┌─repo_name────────────────────────────────────┬─stars─┐ 1. │ labuladong/fucking-algorithm │ 77568 │ 2. │ jwasham/coding-interview-university │ 54591 │ 3. │ kamranahmedse/developer-roadmap │ 50944 │ 4. │ donnemartin/system-design-primer │ 37384 │ 5. │ EbookFoundation/free-programming-books │ 36686 │ 6. │ public-apis/public-apis │ 36405 │ 7. │ TheAlgorithms/Python │ 33837 │ 8. │ CyC2018/CS-Notes │ 33517 │ 9. │ danistefanovic/build-your-own-x │ 33283 │ 10. │ microsoft/PowerToys │ 33137 │ 11. │ denoland/deno │ 31550 │ 12. │ flutter/flutter │ 31320 │ 13. │ Snailclimb/JavaGuide │ 31103 │ 14. │ trekhleb/javascript-algorithms │ 30803 │ 15. │ sindresorhus/awesome │ 28424 │ 16. │ florinpop17/app-ideas │ 27989 │ 17. │ vuejs/vue │ 27458 │ 18. │ jackfrued/Python-100-Days │ 26699 │ 19. │ CSSEGISandData/COVID-19 │ 26659 │ 20. │ ytdl-org/youtube-dl │ 26475 │ 21. │ facebook/react │ 24122 │ 22. │ bradtraversy/design-resources-for-developers │ 23170 │ 23. │ geekxh/hello-algorithm │ 22650 │ 24. │ ohmyzsh/ohmyzsh │ 21705 │ 25. │ microsoft/vscode │ 21607 │ 26. │ cli/cli │ 21541 │ 27. │ torvalds/linux │ 20547 │ 28. │ github/gitignore │ 19863 │ 29. │ getify/You-Dont-Know-JS │ 19778 │ 30. │ goldbergyoni/nodebestpractices │ 19452 │ 31. │ huggingface/transformers │ 19214 │ 32. │ microsoft/playwright │ 18751 │ 33. │ ossu/computer-science │ 18621 │ 34. │ Genymobile/scrcpy │ 18538 │ 35. │ macrozheng/mall │ 18335 │ 36. │ tiangolo/fastapi │ 17962 │ 37. │ gothinkster/realworld │ 17696 │ 38. │ AobingJava/JavaFamily │ 17495 │ 39. │ PanJiaChen/vue-element-admin │ 17330 │ 40. │ microsoft/terminal │ 16719 │ 41. │ willmcgugan/rich │ 16627 │ 42. │ evanw/esbuild │ 16621 │ 43. │ jlevy/the-art-of-command-line │ 16351 │ 44. │ tensorflow/tensorflow │ 16104 │ 45. │ angular/angular │ 15762 │ 46. │ kon9chunkit/GitHub-Chinese-Top-Charts │ 15738 │ 47. │ freeCodeCamp/freeCodeCamp │ 15701 │ 48. │ MisterBooo/LeetCodeAnimation │ 15686 │ 49. │ azl397985856/leetcode │ 15352 │ 50. │ anuraghazra/github-readme-stats │ 15193 │ └──────────────────────────────────────────────┴───────┘ 50 rows in set. Elapsed: 0.400 sec. Processed 214.00 million rows, 2.41 GB (535.27 million rows/s., 6.02 GB/s.)
The top repo is a Chinese educational resource to prepare for an algorithms interview. Actually, amost all of the top 10 are related to education. The only exceptions are: Deno — a JS runtime and CSSEGISandData/COVID-19 — a repo for COVID-19 data. Yes, I expected that 2020 would be about COVID-19 but happily enough it's not only about COVID-19.
:) SELECT repo_name, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' AND toYear(created_at) = '2019' GROUP BY repo_name ORDER BY stars DESC LIMIT 50 ┌─repo_name──────────────────────────────────┬──stars─┐ 1. │ 996icu/996.ICU │ 344825 │ 2. │ jackfrued/Python-100-Days │ 76845 │ 3. │ M4cs/BabySploit │ 71013 │ 4. │ Snailclimb/JavaGuide │ 53444 │ 5. │ TheAlgorithms/Python │ 48859 │ 6. │ CyC2018/CS-Notes │ 46609 │ 7. │ MisterBooo/LeetCodeAnimation │ 41526 │ 8. │ vuejs/vue │ 39159 │ 9. │ flutter/flutter │ 38733 │ 10. │ doocs/advanced-java │ 35338 │ 11. │ microsoft/Terminal │ 34359 │ 12. │ jlevy/the-art-of-command-line │ 30852 │ 13. │ kamranahmedse/developer-roadmap │ 30842 │ 14. │ facebook/react │ 29040 │ 15. │ donnemartin/system-design-primer │ 27042 │ 16. │ tensorflow/tensorflow │ 27032 │ 17. │ macrozheng/mall │ 26520 │ 18. │ testerSunshine/12306 │ 26414 │ 19. │ sindresorhus/awesome │ 26286 │ 20. │ azl397985856/leetcode │ 25765 │ 21. │ jwasham/coding-interview-university │ 25556 │ 22. │ PanJiaChen/vue-element-admin │ 25101 │ 23. │ 0voice/interview_internal_reference │ 24557 │ 24. │ lib-pku/libpku │ 23015 │ 25. │ getify/You-Dont-Know-JS │ 22647 │ 26. │ EbookFoundation/free-programming-books │ 21884 │ 27. │ microsoft/terminal │ 21878 │ 28. │ scutan90/DeepLearning-500-questions │ 21784 │ 29. │ deepfakes/faceswap │ 21237 │ 30. │ sveltejs/svelte │ 20969 │ 31. │ torvalds/linux │ 20455 │ 32. │ vinta/awesome-python │ 20217 │ 33. │ justjavac/free-programming-books-zh_CN │ 20137 │ 34. │ 30-seconds/30-seconds-of-code │ 19784 │ 35. │ formulahendry/955.WLB │ 19772 │ 36. │ NationalSecurityAgency/ghidra │ 19651 │ 37. │ robbyrussell/oh-my-zsh │ 19602 │ 38. │ imhuay/Algorithm_Interview_Notes-Chinese │ 19508 │ 39. │ trekhleb/javascript-algorithms │ 19460 │ 40. │ LisaDziuba/Awesome-Design-Tools │ 19142 │ 41. │ freeCodeCamp/freeCodeCamp │ 18955 │ 42. │ golang/go │ 18842 │ 43. │ alibaba/flutter-go │ 18382 │ 44. │ ziishaned/learn-regex │ 18352 │ 45. │ github/gitignore │ 18228 │ 46. │ danistefanovic/build-your-own-x │ 18155 │ 47. │ Advanced-Frontend/Daily-Interview-Question │ 17797 │ 48. │ ossu/computer-science │ 17794 │ 49. │ dcloudio/uni-app │ 17759 │ 50. │ codercom/code-server │ 17702 │ └────────────────────────────────────────────┴────────┘ 50 rows in set. Elapsed: 0.366 sec. Processed 219.33 million rows, 2.55 GB (599.92 million rows/s., 6.96 GB/s.)
996.ICU, education and pentesting framework. MS Terminal is worth noting. Also a Chinese e-commerce product sits next to Tensorflow.
:) SELECT repo_name, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' AND toYear(created_at) = '2018' GROUP BY repo_name ORDER BY stars DESC LIMIT 50 ┌─repo_name──────────────────────────────┬─stars─┐ 1. │ vuejs/vue │ 51515 │ 2. │ trekhleb/javascript-algorithms │ 39249 │ 3. │ facebook/react │ 38817 │ 4. │ flutter/flutter │ 38357 │ 5. │ danistefanovic/build-your-own-x │ 37815 │ 6. │ tensorflow/tensorflow │ 36976 │ 7. │ kamranahmedse/developer-roadmap │ 35278 │ 8. │ x64dbg/x64dbg │ 32381 │ 9. │ donnemartin/system-design-primer │ 31384 │ 10. │ CyC2018/Interview-Notebook │ 29941 │ 11. │ kelseyhightower/nocode │ 26326 │ 12. │ Microsoft/vscode │ 26211 │ 13. │ xingshaocheng/architect-awesome │ 25750 │ 14. │ sindresorhus/awesome │ 24977 │ 15. │ ry/deno │ 22798 │ 16. │ getify/You-Dont-Know-JS │ 22014 │ 17. │ tensorflow/models │ 21511 │ 18. │ GoogleChrome/puppeteer │ 21137 │ 19. │ leonardomso/33-js-concepts │ 21033 │ 20. │ facebook/create-react-app │ 20300 │ 21. │ axios/axios │ 20102 │ 22. │ houshanren/hangzhou_house_knowledge │ 19227 │ 23. │ ant-design/ant-design │ 19162 │ 24. │ facebookresearch/Detectron │ 18611 │ 25. │ jwasham/coding-interview-university │ 18255 │ 26. │ robbyrussell/oh-my-zsh │ 18254 │ 27. │ EbookFoundation/free-programming-books │ 18127 │ 28. │ github/gitignore │ 17872 │ 29. │ airbnb/javascript │ 17616 │ 30. │ justjavac/free-programming-books-zh_CN │ 17293 │ 31. │ PanJiaChen/vue-element-admin │ 17256 │ 32. │ yangshun/front-end-interview-handbook │ 17138 │ 33. │ vinta/awesome-python │ 16797 │ 34. │ golang/go │ 16424 │ 35. │ facebook/react-native │ 16039 │ 36. │ torvalds/linux │ 15936 │ 37. │ TheAlgorithms/Python │ 15914 │ 38. │ tabler/tabler │ 15849 │ 39. │ kubernetes/kubernetes │ 15800 │ 40. │ Avik-Jain/100-Days-Of-ML-Code │ 15777 │ 41. │ iluwatar/java-design-patterns │ 15370 │ 42. │ parcel-bundler/parcel │ 15257 │ 43. │ Meituan-Dianping/mpvue │ 15017 │ 44. │ storybooks/storybook │ 14835 │ 45. │ dawnlabs/carbon │ 14807 │ 46. │ toddmotto/public-apis │ 14781 │ 47. │ electron/electron │ 14750 │ 48. │ shadowsocks/shadowsocks-windows │ 14619 │ 49. │ kdn251/interviews │ 14586 │ 50. │ scutan90/DeepLearning-500-questions │ 14497 │ └────────────────────────────────────────┴───────┘ 50 rows in set. Elapsed: 0.304 sec. Processed 219.55 million rows, 2.73 GB (722.04 million rows/s., 8.96 GB/s.)
JavaScript frameworks and education.
:) SELECT repo_name, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' AND toYear(created_at) = '2017' GROUP BY repo_name ORDER BY stars DESC LIMIT 50 ┌─repo_name──────────────────────────────────────┬─stars─┐ 1. │ freeCodeCamp/freeCodeCamp │ 90989 │ 2. │ tensorflow/tensorflow │ 49278 │ 3. │ vuejs/vue │ 48185 │ 4. │ facebook/react │ 34524 │ 5. │ mr-mig/every-programmer-should-know │ 30991 │ 6. │ kamranahmedse/developer-roadmap │ 30497 │ 7. │ sindresorhus/awesome │ 29084 │ 8. │ getify/You-Dont-Know-JS │ 28902 │ 9. │ thedaviddias/Front-End-Checklist │ 24745 │ 10. │ facebookincubator/create-react-app │ 24479 │ 11. │ Microsoft/vscode │ 23504 │ 12. │ donnemartin/system-design-primer │ 22584 │ 13. │ GoogleChrome/puppeteer │ 22408 │ 14. │ airbnb/javascript │ 22119 │ 15. │ sdmg15/Best-websites-a-programmer-should-visit │ 21764 │ 16. │ jwasham/coding-interview-university │ 21395 │ 17. │ twbs/bootstrap │ 21157 │ 18. │ toddmotto/public-apis │ 20881 │ 19. │ kamranahmedse/design-patterns-for-humans │ 19420 │ 20. │ robbyrussell/oh-my-zsh │ 19325 │ 21. │ airbnb/lottie-android │ 18700 │ 22. │ facebook/react-native │ 18696 │ 23. │ vuejs/awesome-vue │ 18499 │ 24. │ tipsy/github-profile-summary │ 17847 │ 25. │ FreeCodeCampChina/freecodecamp.cn │ 17794 │ 26. │ electron/electron │ 17490 │ 27. │ kdn251/interviews │ 17471 │ 28. │ ryanmcdermott/clean-code-javascript │ 17433 │ 29. │ mzabriskie/axios │ 17414 │ 30. │ vinta/awesome-python │ 17254 │ 31. │ torvalds/linux │ 17200 │ 32. │ tensorflow/models │ 17066 │ 33. │ nodejs/node │ 17062 │ 34. │ github/gitignore │ 16873 │ 35. │ python/cpython │ 16627 │ 36. │ yangshun/tech-interview-handbook │ 16388 │ 37. │ ElemeFE/element │ 16320 │ 38. │ bailicangdu/vue2-elm │ 16249 │ 39. │ d3/d3 │ 16167 │ 40. │ Hack-with-Github/Awesome-Hacking │ 16153 │ 41. │ golang/go │ 16096 │ 42. │ k88hudson/git-flight-rules │ 15950 │ 43. │ webpack/webpack │ 15869 │ 44. │ mbeaudru/modern-js-cheatsheet │ 15321 │ 45. │ EbookFoundation/free-programming-books │ 15319 │ 46. │ angular/angular │ 14392 │ 47. │ prettier/prettier │ 14344 │ 48. │ vhf/free-programming-books │ 14304 │ 49. │ JetBrains/kotlin │ 14210 │ 50. │ parcel-bundler/parcel │ 14190 │ └────────────────────────────────────────────────┴───────┘ 50 rows in set. Elapsed: 0.321 sec. Processed 218.60 million rows, 2.80 GB (680.46 million rows/s., 8.72 GB/s.)
:) SELECT repo_name, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' AND toYear(created_at) = '2016' GROUP BY repo_name ORDER BY stars DESC LIMIT 50 ┌─repo_name───────────────────────────────────────┬──stars─┐ 1. │ FreeCodeCamp/FreeCodeCamp │ 182203 │ 2. │ jwasham/google-interview-university │ 31522 │ 3. │ vhf/free-programming-books │ 28870 │ 4. │ vuejs/vue │ 28831 │ 5. │ tensorflow/tensorflow │ 28282 │ 6. │ facebook/react │ 26046 │ 7. │ getify/You-Dont-Know-JS │ 24975 │ 8. │ sindresorhus/awesome │ 24829 │ 9. │ chrislgarry/Apollo-11 │ 23483 │ 10. │ yarnpkg/yarn │ 21518 │ 11. │ facebook/react-native │ 19779 │ 12. │ twbs/bootstrap │ 19396 │ 13. │ airbnb/javascript │ 19064 │ 14. │ joshbuchea/HEAD │ 18439 │ 15. │ facebookincubator/create-react-app │ 18191 │ 16. │ firehol/netdata │ 17908 │ 17. │ robbyrussell/oh-my-zsh │ 17364 │ 18. │ FallibleInc/security-guide-for-developers │ 15901 │ 19. │ open-guides/og-aws │ 15620 │ 20. │ github/gitignore │ 15205 │ 21. │ electron/electron │ 15138 │ 22. │ googlesamples/android-architecture │ 14647 │ 23. │ torvalds/linux │ 14434 │ 24. │ getlantern/lantern │ 14189 │ 25. │ reactjs/redux │ 13997 │ 26. │ nodejs/node │ 13602 │ 27. │ Microsoft/vscode │ 13284 │ 28. │ angular/angular │ 13204 │ 29. │ apple/swift │ 13060 │ 30. │ ParsePlatform/parse-server │ 12837 │ 31. │ braydie/HowToBeAProgrammer │ 12574 │ 32. │ docker/docker │ 12482 │ 33. │ atom/atom │ 12468 │ 34. │ webpack/webpack │ 12186 │ 35. │ ZuzooVn/machine-learning-for-software-engineers │ 11916 │ 36. │ jgthms/bulma │ 11685 │ 37. │ vinta/awesome-python │ 11626 │ 38. │ angular/angular.js │ 11446 │ 39. │ golang/go │ 11337 │ 40. │ toddmotto/public-apis │ 11266 │ 41. │ ReactiveX/RxJava │ 11244 │ 42. │ daneden/animate.css │ 11109 │ 43. │ naptha/tesseract.js │ 10955 │ 44. │ tensorflow/models │ 10874 │ 45. │ josephmisiti/awesome-machine-learning │ 10546 │ 46. │ FortAwesome/Font-Awesome │ 10496 │ 47. │ juliangarnier/anime │ 10462 │ 48. │ AHAAAAAAA/PokemonGo-Map │ 10448 │ 49. │ Microsoft/TypeScript │ 10340 │ 50. │ caesar0301/awesome-public-datasets │ 10333 │ └─────────────────────────────────────────────────┴────────┘ 50 rows in set. Elapsed: 0.277 sec. Processed 216.06 million rows, 2.88 GB (780.88 million rows/s., 10.39 GB/s.)
:) SELECT repo_name, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' AND toYear(created_at) = '2015' GROUP BY repo_name ORDER BY stars DESC LIMIT 50 ┌─repo_name────────────────────────────────────┬─stars─┐ 1. │ FreeCodeCamp/FreeCodeCamp │ 38485 │ 2. │ facebook/react-native │ 25888 │ 3. │ apple/swift │ 25834 │ 4. │ sindresorhus/awesome │ 24420 │ 5. │ facebook/react │ 22977 │ 6. │ jlevy/the-art-of-command-line │ 22105 │ 7. │ NARKOZ/hacker-scripts │ 20450 │ 8. │ twbs/bootstrap │ 19775 │ 9. │ google/material-design-lite │ 17904 │ 10. │ airbnb/javascript │ 17586 │ 11. │ nvbn/thefuck │ 17168 │ 12. │ vhf/free-programming-books │ 16923 │ 13. │ getify/You-Dont-Know-JS │ 16404 │ 14. │ atom/electron │ 16174 │ 15. │ tensorflow/tensorflow │ 16009 │ 16. │ FreeCodeCamp/freecodecamp │ 15321 │ 17. │ nylas/N1 │ 14927 │ 18. │ angular/angular.js │ 14824 │ 19. │ atom/atom │ 13737 │ 20. │ mbostock/d3 │ 13497 │ 21. │ Dogfalo/materialize │ 13150 │ 22. │ ripienaar/free-for-dev │ 12847 │ 23. │ torvalds/linux │ 12455 │ 24. │ robbyrussell/oh-my-zsh │ 12281 │ 25. │ github/gitignore │ 11358 │ 26. │ bevacqua/dragula │ 11060 │ 27. │ meteor/meteor │ 10963 │ 28. │ FortAwesome/Font-Awesome │ 10849 │ 29. │ docker/docker │ 10845 │ 30. │ daneden/animate.css │ 10732 │ 31. │ zenorocha/clipboard.js │ 10471 │ 32. │ 0xAX/linux-insides │ 10459 │ 33. │ typicode/json-server │ 10247 │ 34. │ wasabeef/awesome-android-ui │ 10187 │ 35. │ babel/babel │ 10147 │ 36. │ h5bp/Front-end-Developer-Interview-Questions │ 10011 │ 37. │ prakhar1989/awesome-courses │ 9947 │ 38. │ lukehoban/es6features │ 9890 │ 39. │ driftyco/ionic │ 9771 │ 40. │ gorhill/uBlock │ 9761 │ 41. │ minimaxir/big-list-of-naughty-strings │ 9550 │ 42. │ Semantic-Org/Semantic-UI │ 9546 │ 43. │ Microsoft/vscode │ 9489 │ 44. │ alex/what-happens-when │ 9479 │ 45. │ vinta/awesome-python │ 9304 │ 46. │ golang/go │ 9254 │ 47. │ phanan/htaccess │ 9240 │ 48. │ nwjs/nw.js │ 9203 │ 49. │ johnpapa/angular-styleguide │ 9129 │ 50. │ google/material-design-icons │ 9099 │ └──────────────────────────────────────────────┴───────┘ 50 rows in set. Elapsed: 0.268 sec. Processed 213.30 million rows, 2.99 GB (795.76 million rows/s., 11.17 GB/s.)
Bonus: all these results in a single report:
SELECT
year,
lower(repo_name) AS repo,
count()
FROM github_events
WHERE (event_type = 'WatchEvent') AND (year >= 2015)
GROUP BY
repo,
toYear(created_at) AS year
ORDER BY
year ASC,
count() DESC
LIMIT 10 BY year
┌─year─┬─repo──────────────────────────┬─count()─┐
│ 2015 │ freecodecamp/freecodecamp │ 53806 │
│ 2015 │ facebook/react-native │ 25888 │
│ 2015 │ apple/swift │ 25834 │
│ 2015 │ sindresorhus/awesome │ 24420 │
│ 2015 │ facebook/react │ 22977 │
│ 2015 │ jlevy/the-art-of-command-line │ 22105 │
│ 2015 │ narkoz/hacker-scripts │ 20450 │
│ 2015 │ twbs/bootstrap │ 19775 │
│ 2015 │ google/material-design-lite │ 17904 │
│ 2015 │ airbnb/javascript │ 17586 │
└──────┴───────────────────────────────┴─────────┘
┌─year─┬─repo────────────────────────────────┬─count()─┐
│ 2016 │ freecodecamp/freecodecamp │ 182203 │
│ 2016 │ jwasham/google-interview-university │ 31522 │
│ 2016 │ vhf/free-programming-books │ 28870 │
│ 2016 │ vuejs/vue │ 28831 │
│ 2016 │ tensorflow/tensorflow │ 28282 │
│ 2016 │ facebook/react │ 26046 │
│ 2016 │ getify/you-dont-know-js │ 24975 │
│ 2016 │ sindresorhus/awesome │ 24829 │
│ 2016 │ chrislgarry/apollo-11 │ 23483 │
│ 2016 │ yarnpkg/yarn │ 21518 │
└──────┴─────────────────────────────────────┴─────────┘
┌─year─┬─repo────────────────────────────────┬─count()─┐
│ 2017 │ freecodecamp/freecodecamp │ 96359 │
│ 2017 │ tensorflow/tensorflow │ 49278 │
│ 2017 │ vuejs/vue │ 48185 │
│ 2017 │ facebook/react │ 34524 │
│ 2017 │ mr-mig/every-programmer-should-know │ 30991 │
│ 2017 │ kamranahmedse/developer-roadmap │ 30497 │
│ 2017 │ sindresorhus/awesome │ 29084 │
│ 2017 │ getify/you-dont-know-js │ 28902 │
│ 2017 │ thedaviddias/front-end-checklist │ 24745 │
│ 2017 │ facebookincubator/create-react-app │ 24479 │
└──────┴─────────────────────────────────────┴─────────┘
┌─year─┬─repo─────────────────────────────┬─count()─┐
│ 2018 │ vuejs/vue │ 51515 │
│ 2018 │ trekhleb/javascript-algorithms │ 39249 │
│ 2018 │ facebook/react │ 38817 │
│ 2018 │ flutter/flutter │ 38357 │
│ 2018 │ danistefanovic/build-your-own-x │ 37815 │
│ 2018 │ tensorflow/tensorflow │ 36976 │
│ 2018 │ kamranahmedse/developer-roadmap │ 35278 │
│ 2018 │ x64dbg/x64dbg │ 32381 │
│ 2018 │ donnemartin/system-design-primer │ 31384 │
│ 2018 │ cyc2018/interview-notebook │ 29941 │
└──────┴──────────────────────────────────┴─────────┘
┌─year─┬─repo─────────────────────────┬─count()─┐
│ 2019 │ 996icu/996.icu │ 344825 │
│ 2019 │ jackfrued/python-100-days │ 76845 │
│ 2019 │ m4cs/babysploit │ 71013 │
│ 2019 │ microsoft/terminal │ 56844 │
│ 2019 │ snailclimb/javaguide │ 53444 │
│ 2019 │ thealgorithms/python │ 48859 │
│ 2019 │ cyc2018/cs-notes │ 46609 │
│ 2019 │ misterbooo/leetcodeanimation │ 41526 │
│ 2019 │ vuejs/vue │ 39159 │
│ 2019 │ flutter/flutter │ 38733 │
└──────┴──────────────────────────────┴─────────┘
┌─year─┬─repo───────────────────────────────────┬─count()─┐
│ 2020 │ labuladong/fucking-algorithm │ 77568 │
│ 2020 │ jwasham/coding-interview-university │ 54591 │
│ 2020 │ kamranahmedse/developer-roadmap │ 50944 │
│ 2020 │ donnemartin/system-design-primer │ 37384 │
│ 2020 │ ebookfoundation/free-programming-books │ 36686 │
│ 2020 │ public-apis/public-apis │ 36405 │
│ 2020 │ thealgorithms/python │ 33837 │
│ 2020 │ cyc2018/cs-notes │ 33517 │
│ 2020 │ danistefanovic/build-your-own-x │ 33283 │
│ 2020 │ microsoft/powertoys │ 33137 │
└──────┴────────────────────────────────────────┴─────────┘
60 rows in set. Elapsed: 15.368 sec. Processed 231.51 million rows, 2.71 GB (15.06 million rows/s., 176.33 MB/s.)
And a multi-line graph:
SELECT
repo AS name,
groupArrayInsertAt(toUInt32(c), toUInt64(dateDiff('month', toDate('2015-01-01'), month))) AS data
FROM
(
SELECT
lower(repo_name) AS repo,
toStartOfMonth(created_at) AS month,
count() AS c
FROM github_events
WHERE (event_type = 'WatchEvent') AND (toYear(created_at) >= 2015) AND (repo IN
(
SELECT lower(repo_name) AS repo
FROM github_events
WHERE (event_type = 'WatchEvent') AND (toYear(created_at) >= 2015)
GROUP BY repo
ORDER BY count() DESC
LIMIT 10
))
GROUP BY
repo,
month
)
GROUP BY repo
ORDER BY repo ASC
10 rows in set. Elapsed: 1.729 sec. Processed 231.51 million rows, 1.58 GB (133.88 million rows/s., 913.63 MB/s.)
:) SELECT toYear(created_at) AS year, count() AS stars, bar(stars, 0, 50000000, 10) AS bar FROM github_events WHERE event_type = 'WatchEvent' GROUP BY year ORDER BY year
┌─year─┬────stars─┬─bar────────┐
│ 2011 │ 1831742 │ ▎ │
│ 2012 │ 4048676 │ ▋ │
│ 2013 │ 7432800 │ █▍ │
│ 2014 │ 11952935 │ ██▍ │
│ 2015 │ 18994833 │ ███▋ │
│ 2016 │ 26166310 │ █████▏ │
│ 2017 │ 32640040 │ ██████▌ │
│ 2018 │ 37068153 │ ███████▍ │
│ 2019 │ 46118187 │ █████████▏ │
│ 2020 │ 45864798 │ █████████▏ │
└──────┴──────────┴────────────┘
10 rows in set. Elapsed: 0.102 sec. Processed 232.13 million rows, 1.16 GB (2.27 billion rows/s., 11.35 GB/s.)
We see just steady linear growth with a YoY rate of about 25%.
:) SELECT actor_login, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' GROUP BY actor_login ORDER BY stars DESC LIMIT 50
┌─actor_login─────┬──stars─┐
1. │ 4148 │ 232492 │
2. │ salifm │ 202534 │
3. │ x0rzkov │ 73531 │
4. │ fly51fly │ 57756 │
5. │ PCOffline │ 56725 │
6. │ baslr │ 50363 │
7. │ gkze │ 49078 │
8. │ StarTheWorld │ 45691 │
9. │ thanhtoan1196 │ 44715 │
10. │ mcanthony │ 44331 │
11. │ gauravssnl │ 43832 │
12. │ ik5 │ 42817 │
13. │ jenniemanphonsy │ 41784 │
14. │ madshansens │ 41045 │
15. │ denji │ 40921 │
16. │ jorgemachucav │ 39905 │
17. │ korolr │ 39723 │
18. │ esneko │ 39073 │
19. │ 5l1v3r1 │ 38034 │
20. │ roscopecoltran │ 35734 │
21. │ tdusty46 │ 34200 │
22. │ JT5D │ 32595 │
23. │ pranavlathigara │ 32068 │
24. │ reduxionist │ 31377 │
25. │ StarsEverywhere │ 31079 │
26. │ nikolay │ 28975 │
27. │ maoabc1818 │ 28857 │
28. │ gaoypChina │ 28767 │
29. │ vulcangz │ 28602 │
30. │ hcilab │ 28215 │
31. │ CNXTEoE │ 28045 │
32. │ tdusty47 │ 27534 │
33. │ lifa123 │ 24146 │
34. │ MihalczGabor │ 23416 │
35. │ romanofficial │ 22976 │
36. │ morristech │ 22269 │
37. │ savage69kr │ 22178 │
38. │ trietptm │ 22165 │
39. │ the-cc-dev │ 22094 │
40. │ Jackneill │ 21974 │
41. │ hartca │ 21820 │
42. │ jiangplus │ 21366 │
43. │ topcloud │ 21343 │
44. │ Jerzerak │ 21066 │
45. │ mrluanma │ 20812 │
46. │ pkt │ 20799 │
47. │ wangpengwen │ 20754 │
48. │ denisfitz57 │ 20680 │
49. │ mhaidarh │ 20418 │
50. │ likaihere │ 19667 │
└─────────────────┴────────┘
50 rows in set. Elapsed: 3.024 sec. Processed 232.13 million rows, 3.98 GB (76.77 million rows/s., 1.32 GB/s.)
I checked the results: 4148 looks like an empty account... very suspicious. The next user looks okay. But how is it possible to give 202 534 stars?
In comparison, I only gave 203 stars:
:) SELECT actor_login, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' AND actor_login = 'alexey-milovidov' GROUP BY actor_login ORDER BY stars DESC LIMIT 50
┌─actor_login──────┬─stars─┐
│ alexey-milovidov │ 203 │
└──────────────────┴───────┘
1 rows in set. Elapsed: 0.974 sec. Processed 232.13 million rows, 3.81 GB (238.24 million rows/s., 3.91 GB/s.)
And here are my favorites sorted by the total number of stars:
SELECT
repo_name,
count() AS stars
FROM github_events
WHERE (event_type = 'WatchEvent') AND (repo_name IN
(
SELECT repo_name
FROM github_events
WHERE (event_type = 'WatchEvent') AND (actor_login = 'alexey-milovidov')
))
GROUP BY repo_name
ORDER BY stars DESC
LIMIT 50
┌─repo_name──────────────────────────────┬─stars─┐
1. │ koalaman/shellcheck │ 23579 │
2. │ mxgmn/WaveFunctionCollapse │ 16007 │
3. │ nothings/stb │ 15327 │
4. │ Swordfish90/cool-retro-term │ 13619 │
5. │ gabime/spdlog │ 10904 │
6. │ kovidgoyal/kitty │ 10829 │
7. │ lemire/simdjson │ 9541 │
8. │ rsms/inter │ 8653 │
9. │ yandex/ClickHouse │ 8629 │
10. │ olivernn/lunr.js │ 7635 │
11. │ llvm/llvm-project │ 7228 │
12. │ mre/awesome-static-analysis │ 7126 │
13. │ cube-js/cube.js │ 6783 │
14. │ sharkdp/hyperfine │ 6685 │
15. │ darlinghq/darling │ 6194 │
16. │ google/oss-fuzz │ 6036 │
17. │ catboost/catboost │ 6010 │
18. │ chrissimpkins/codeface │ 5998 │
19. │ jemalloc/jemalloc │ 5895 │
20. │ mozilla/rr │ 5848 │
21. │ hashicorp/packer │ 5740 │
22. │ rqlite/rqlite │ 5390 │
23. │ antirez/kilo │ 5372 │
24. │ angr/angr │ 5013 │
25. │ sysown/proxysql │ 4288 │
26. │ unicorn-engine/unicorn │ 4175 │
27. │ zmap/zmap │ 3821 │
28. │ plausible/analytics │ 3614 │
29. │ kahing/goofys │ 3351 │
30. │ weidai11/cryptopp │ 2732 │
31. │ jbeder/yaml-cpp │ 2539 │
32. │ intel/hyperscan │ 2200 │
33. │ jlfwong/speedscope │ 2177 │
34. │ lief-project/LIEF │ 2131 │
35. │ boyter/scc │ 1919 │
36. │ simongog/sdsl-lite │ 1860 │
37. │ yandex/odyssey │ 1833 │
38. │ phracker/MacOSX-SDKs │ 1793 │
39. │ apache/avro │ 1792 │
40. │ zserge/awfice │ 1706 │
41. │ randombit/botan │ 1677 │
42. │ arrowtype/recursive │ 1613 │
43. │ GoogleCloudPlatform/PerfKitBenchmarker │ 1565 │
44. │ drh/lcc │ 1449 │
45. │ uncrustify/uncrustify │ 1322 │
46. │ orlp/pdqsort │ 1319 │
47. │ google/highwayhash │ 1142 │
48. │ centaurean/density │ 988 │
49. │ google/boringssl │ 976 │
50. │ woboq/woboq_codebrowser │ 916 │
└────────────────────────────────────────┴───────┘
50 rows in set. Elapsed: 1.880 sec. Processed 1.68 million rows, 10.17 MB (893.37 thousand rows/s., 5.41 MB/s.)
A quick question I'm eager to ask: what are the top repositories sorted by the number of stars from people who starred the ClickHouse repository?
SELECT
repo_name,
count() AS stars
FROM github_events
WHERE (event_type = 'WatchEvent') AND (actor_login IN
(
SELECT actor_login
FROM github_events
WHERE (event_type = 'WatchEvent') AND (repo_name IN ('ClickHouse/ClickHouse', 'yandex/ClickHouse'))
)) AND (repo_name NOT IN ('ClickHouse/ClickHouse', 'yandex/ClickHouse'))
GROUP BY repo_name
ORDER BY stars DESC
LIMIT 50
┌─repo_name─────────────────────────────────┬─stars─┐
1. │ tensorflow/tensorflow │ 4895 │
2. │ golang/go │ 4354 │
3. │ pingcap/tidb │ 4128 │
4. │ donnemartin/system-design-primer │ 4020 │
5. │ sindresorhus/awesome │ 3899 │
6. │ prometheus/prometheus │ 3881 │
7. │ avelino/awesome-go │ 3792 │
8. │ kubernetes/kubernetes │ 3741 │
9. │ grafana/grafana │ 3574 │
10. │ facebook/react │ 3511 │
11. │ vuejs/vue │ 3347 │
12. │ gin-gonic/gin │ 3246 │
13. │ flutter/flutter │ 3160 │
14. │ facebook/rocksdb │ 3156 │
15. │ antirez/redis │ 3101 │
16. │ istio/istio │ 3094 │
17. │ ansible/ansible │ 3094 │
18. │ torvalds/linux │ 3092 │
19. │ minio/minio │ 3068 │
20. │ cockroachdb/cockroach │ 3006 │
21. │ robbyrussell/oh-my-zsh │ 2981 │
22. │ vinta/awesome-python │ 2964 │
23. │ 996icu/996.ICU │ 2959 │
24. │ grpc/grpc │ 2942 │
25. │ jlevy/the-art-of-command-line │ 2916 │
26. │ elastic/elasticsearch │ 2781 │
27. │ containous/traefik │ 2776 │
28. │ hashicorp/consul │ 2745 │
29. │ rust-lang/rust │ 2673 │
30. │ wg/wrk │ 2671 │
31. │ metabase/metabase │ 2666 │
32. │ facebook/react-native │ 2656 │
33. │ kamranahmedse/developer-roadmap │ 2642 │
34. │ ant-design/ant-design │ 2602 │
35. │ GoogleChrome/puppeteer │ 2549 │
36. │ getify/You-Dont-Know-JS │ 2504 │
37. │ apache/spark │ 2486 │
38. │ google/leveldb │ 2475 │
39. │ github/gitignore │ 2437 │
40. │ airbnb/javascript │ 2435 │
41. │ coreos/etcd │ 2429 │
42. │ danistefanovic/build-your-own-x │ 2392 │
43. │ Microsoft/vscode │ 2375 │
44. │ fatedier/frp │ 2367 │
45. │ tensorflow/models │ 2355 │
46. │ getsentry/sentry │ 2354 │
47. │ dgraph-io/dgraph │ 2338 │
48. │ TheAlgorithms/Python │ 2320 │
49. │ mholt/caddy │ 2317 │
50. │ astaxie/build-web-application-with-golang │ 2317 │
└───────────────────────────────────────────┴───────┘
50 rows in set. Elapsed: 1.162 sec. Processed 232.16 million rows, 7.51 GB (199.71 million rows/s., 6.46 GB/s.)
It's Tensorflow... and TiDB. Yes, I knew that we are friends with TiDB (they are using the ClickHouse engine under the hood for analytical queries). Then golang, prometheus and kubernetes. The overall list is slightly biased towards data-intensive applications: database engines, distributed filesystems, data orchestration, etc.
Let's calculate the affinity index: the ratio of stars to repositories that were given by users that gave a star to ClickHouse | Linux | LLVM. Don't worry, the query is pretty simple.
SELECT repo_name, uniq(actor_login) AS total_stars, uniqIf(actor_login, actor_login IN ( SELECT actor_login FROM github_events WHERE (event_type = 'WatchEvent') AND (repo_name IN ('ClickHouse/ClickHouse', 'yandex/ClickHouse')) )) AS clickhouse_stars, round(clickhouse_stars / total_stars, 2) AS ratio FROM github_events WHERE (event_type = 'WatchEvent') AND (repo_name NOT IN ('ClickHouse/ClickHouse', 'yandex/ClickHouse')) GROUP BY repo_name HAVING total_stars >= 100 ORDER BY ratio DESC LIMIT 50 ┌─repo_name─────────────────────────────┬─total_stars─┬─clickhouse_stars─┬─ratio─┐ │ yandex/clickhouse-jdbc │ 340 │ 240 │ 0.71 │ │ yandex/clickhouse-presentations │ 244 │ 170 │ 0.7 │ │ f1yegor/clickhouse_exporter │ 180 │ 125 │ 0.69 │ │ cloudflare/sqlalchemy-clickhouse │ 232 │ 158 │ 0.68 │ │ Percona-Lab/clickhousedb_fdw │ 117 │ 80 │ 0.68 │ │ housepower/clickhouse_sinker │ 218 │ 147 │ 0.67 │ │ roistat/go-clickhouse │ 175 │ 115 │ 0.66 │ │ Vertamedia/clickhouse-grafana │ 434 │ 281 │ 0.65 │ │ Altinity/clickhouse-rpm-install │ 142 │ 91 │ 0.64 │ │ smi2/clickhouse-frontend │ 195 │ 125 │ 0.64 │ │ yandex/graphouse │ 214 │ 136 │ 0.64 │ │ hatarist/clickhouse-cli │ 159 │ 99 │ 0.62 │ │ Percona-Lab/PromHouse │ 216 │ 133 │ 0.62 │ │ nikepan/clickhouse-bulk │ 241 │ 146 │ 0.61 │ │ Vertamedia/chproxy │ 504 │ 306 │ 0.61 │ │ HouseOps/HouseOps │ 172 │ 105 │ 0.61 │ │ ClickHouse-China/ClickhouseMeetup │ 113 │ 68 │ 0.6 │ │ Altinity/clickhouse-mysql-data-reader │ 250 │ 149 │ 0.6 │ │ smi2/tabix.ui │ 300 │ 179 │ 0.6 │ │ kszucs/pandahouse │ 114 │ 68 │ 0.6 │ │ ClickHouse/clickhouse-jdbc │ 217 │ 127 │ 0.59 │ │ VKCOM/lighthouse │ 101 │ 60 │ 0.59 │ │ mailru/go-clickhouse │ 219 │ 129 │ 0.59 │ │ kshvakov/clickhouse │ 637 │ 379 │ 0.59 │ │ mymarilyn/clickhouse-driver │ 507 │ 295 │ 0.58 │ │ tabixio/tabix │ 764 │ 442 │ 0.58 │ │ Altinity/clicktail │ 211 │ 123 │ 0.58 │ │ mkabilov/pg2ch │ 124 │ 71 │ 0.57 │ │ lomik/carbon-clickhouse │ 140 │ 79 │ 0.56 │ │ analysys/olap │ 109 │ 61 │ 0.56 │ │ housepower/ClickHouse-Native-JDBC │ 262 │ 143 │ 0.55 │ │ mintance/nginx-clickhouse │ 102 │ 56 │ 0.55 │ │ Infinidat/infi.clickhouse_orm │ 254 │ 136 │ 0.54 │ │ lomik/graphite-clickhouse │ 161 │ 87 │ 0.54 │ │ ClickHouse/clickhouse-presentations │ 202 │ 108 │ 0.53 │ │ vectorengine/vectorsql │ 176 │ 90 │ 0.51 │ │ enqueue/metabase-clickhouse-driver │ 167 │ 85 │ 0.51 │ │ VKCOM/kittenhouse │ 117 │ 59 │ 0.5 │ │ Altinity/clickhouse-operator │ 406 │ 203 │ 0.5 │ │ ClickHouse/clickhouse-go │ 457 │ 230 │ 0.5 │ │ apla/node-clickhouse │ 153 │ 75 │ 0.49 │ │ xzkostyan/clickhouse-sqlalchemy │ 154 │ 76 │ 0.49 │ │ AlexAkulov/clickhouse-backup │ 297 │ 146 │ 0.49 │ │ pingcap/titan │ 102 │ 49 │ 0.48 │ │ blynkkk/clickhouse4j │ 101 │ 48 │ 0.48 │ │ ivi-ru/flink-clickhouse-sink │ 124 │ 58 │ 0.47 │ │ killwort/ClickHouse-Net │ 132 │ 62 │ 0.47 │ │ siddontang/xcodis │ 167 │ 76 │ 0.46 │ │ TimoKersten/db-engine-paradigms │ 100 │ 45 │ 0.45 │ │ badoo/lsd │ 181 │ 79 │ 0.44 │ └───────────────────────────────────────┴─────────────┴──────────────────┴───────┘ 50 rows in set. Elapsed: 2.423 sec. Processed 232.16 million rows, 5.86 GB (95.81 million rows/s., 2.42 GB/s.)
This is the perfect exploration tool for related repositories!
I also added a threshold on the number of stars and I'm counting stars from unique authors instead of just star events, because otherwise I face some data anomalies.
SELECT repo_name, uniq(actor_login) AS total_stars, uniqIf(actor_login, actor_login IN ( SELECT actor_login FROM github_events WHERE (event_type = 'WatchEvent') AND (repo_name IN ('torvalds/linux')) )) AS clickhouse_stars, round(clickhouse_stars / total_stars, 2) AS ratio FROM github_events WHERE (event_type = 'WatchEvent') AND (repo_name NOT IN ('torvalds/linux')) GROUP BY repo_name HAVING total_stars >= 100 ORDER BY ratio DESC LIMIT 50 ┌─repo_name─────────────────────┬─total_stars─┬─clickhouse_stars─┬─ratio─┐ │ torvalds/libdc-for-dirk │ 103 │ 80 │ 0.78 │ │ torvalds/test-tlb │ 364 │ 252 │ 0.69 │ │ torvalds/subsurface │ 1218 │ 811 │ 0.67 │ │ torvalds/pesconvert │ 184 │ 123 │ 0.67 │ │ archlinux/linux │ 177 │ 112 │ 0.63 │ │ subsurface/subsurface │ 186 │ 116 │ 0.62 │ │ torvalds/uemacs │ 681 │ 419 │ 0.62 │ │ torvalds/subsurface-for-dirk │ 169 │ 103 │ 0.61 │ │ joshumax/hurd │ 112 │ 68 │ 0.61 │ │ Subsurface-divelog/subsurface │ 718 │ 433 │ 0.6 │ │ coreutils/gnulib │ 144 │ 86 │ 0.6 │ │ apple/darwin-libpthread │ 120 │ 71 │ 0.59 │ │ coreos/grub │ 301 │ 174 │ 0.58 │ │ letolabs/nasm │ 102 │ 59 │ 0.58 │ │ ValveSoftware/steamos_kernel │ 343 │ 196 │ 0.57 │ │ apple/darwin-libplatform │ 109 │ 61 │ 0.56 │ │ haskell/HTTP │ 190 │ 104 │ 0.55 │ │ bminor/binutils-gdb │ 223 │ 123 │ 0.55 │ │ xfce-mirror/thunar │ 113 │ 62 │ 0.55 │ │ mirror/busybox │ 765 │ 420 │ 0.55 │ │ mirrors/gcc │ 435 │ 240 │ 0.55 │ │ Debian/apt │ 345 │ 190 │ 0.55 │ │ google/kmsan │ 268 │ 148 │ 0.55 │ │ rostedt/trace-cmd │ 129 │ 71 │ 0.55 │ │ GNOME/gparted │ 174 │ 95 │ 0.55 │ │ gregkh/kernel-development │ 528 │ 281 │ 0.53 │ │ freebsd/openlaunchd │ 101 │ 54 │ 0.53 │ │ bminor/bash │ 165 │ 87 │ 0.53 │ │ openSUSE/kernel │ 116 │ 62 │ 0.53 │ │ mirrors/chromium │ 106 │ 56 │ 0.53 │ │ GNOME/gnome-terminal │ 255 │ 135 │ 0.53 │ │ mkerrisk/man-pages │ 275 │ 147 │ 0.53 │ │ gregkh/linux │ 173 │ 92 │ 0.53 │ │ coreboot/seabios │ 154 │ 82 │ 0.53 │ │ huawei-openlab/oct │ 105 │ 56 │ 0.53 │ │ gregkh/kdbus │ 286 │ 152 │ 0.53 │ │ coreos/init │ 102 │ 54 │ 0.53 │ │ Stebalien/term │ 119 │ 62 │ 0.52 │ │ phuang/ibus │ 133 │ 69 │ 0.52 │ │ svn2github/valgrind │ 237 │ 124 │ 0.52 │ │ archlinux/archweb │ 167 │ 87 │ 0.52 │ │ shadow-maint/shadow │ 135 │ 70 │ 0.52 │ │ mirror/ncurses │ 159 │ 83 │ 0.52 │ │ bminor/glibc │ 481 │ 250 │ 0.52 │ │ xen-project/xen │ 326 │ 168 │ 0.52 │ │ GNOME/nautilus │ 265 │ 139 │ 0.52 │ │ vlang/vc │ 103 │ 54 │ 0.52 │ │ gatieme/KernelInKernel │ 118 │ 61 │ 0.52 │ │ dell/dkms │ 300 │ 153 │ 0.51 │ │ tytso/e2fsprogs │ 200 │ 102 │ 0.51 │ └───────────────────────────────┴─────────────┴──────────────────┴───────┘ 50 rows in set. Elapsed: 2.580 sec. Processed 232.26 million rows, 5.86 GB (90.04 million rows/s., 2.27 GB/s.)
SELECT repo_name, uniq(actor_login) AS total_stars, uniqIf(actor_login, actor_login IN ( SELECT actor_login FROM github_events WHERE (event_type = 'WatchEvent') AND (repo_name IN ('llvm/llvm-project')) )) AS clickhouse_stars, round(clickhouse_stars / total_stars, 2) AS ratio FROM github_events WHERE (event_type = 'WatchEvent') AND (repo_name NOT IN ('llvm/llvm-project')) GROUP BY repo_name HAVING total_stars >= 100 ORDER BY ratio DESC LIMIT 50 ┌─repo_name──────────────────────────────────────┬─total_stars─┬─clickhouse_stars─┬─ratio─┐ │ AliveToolkit/alive2 │ 171 │ 75 │ 0.44 │ │ clangd/clangd │ 234 │ 91 │ 0.39 │ │ llvm/circt │ 228 │ 86 │ 0.38 │ │ rust-lang/team │ 108 │ 40 │ 0.37 │ │ banach-space/clang-tutor │ 164 │ 61 │ 0.37 │ │ bytecodealliance/wasm-tools │ 107 │ 39 │ 0.36 │ │ WebAssembly/wasi-libc │ 146 │ 52 │ 0.36 │ │ gimli-rs/cpp_demangle │ 124 │ 43 │ 0.35 │ │ google/llvm-propeller │ 250 │ 86 │ 0.34 │ │ bondhugula/pluto │ 129 │ 42 │ 0.33 │ │ llvm-mirror/lld │ 227 │ 75 │ 0.33 │ │ itanium-cxx-abi/cxx-abi │ 274 │ 89 │ 0.32 │ │ boostorg/hof │ 100 │ 32 │ 0.32 │ │ boostorg/mp11 │ 133 │ 42 │ 0.32 │ │ KhronosGroup/SPIRV-LLVM-Translator │ 219 │ 67 │ 0.31 │ │ HongxuChen/awesome-llvm │ 163 │ 50 │ 0.31 │ │ rust-lang/rust-forge │ 167 │ 52 │ 0.31 │ │ hfinkel/llvm-project-cxxjit │ 176 │ 54 │ 0.31 │ │ abenkhadra/llvm-pass-tutorial │ 117 │ 36 │ 0.31 │ │ apple/llvm-project │ 313 │ 95 │ 0.3 │ │ seahorn/crab-llvm │ 143 │ 43 │ 0.3 │ │ 8BitMate/reflect │ 103 │ 31 │ 0.3 │ │ polyglot-compiler/JLang │ 162 │ 48 │ 0.3 │ │ intel/llvm │ 396 │ 119 │ 0.3 │ │ banach-space/llvm-tutor │ 792 │ 234 │ 0.3 │ │ m4b/faerie │ 178 │ 53 │ 0.3 │ │ rust-lang/wg-allocators │ 112 │ 34 │ 0.3 │ │ flang-compiler/f18 │ 230 │ 68 │ 0.3 │ │ rui314/mold │ 248 │ 74 │ 0.3 │ │ llvm-mirror/polly │ 102 │ 31 │ 0.3 │ │ fwsGonzo/libriscv │ 119 │ 35 │ 0.29 │ │ wasmerio/wapm-cli │ 224 │ 66 │ 0.29 │ │ WebAssembly/tool-conventions │ 120 │ 35 │ 0.29 │ │ rust-lang/compiler-team │ 187 │ 54 │ 0.29 │ │ apple/indexstore-db │ 159 │ 46 │ 0.29 │ │ SamTebbs33/pluto │ 119 │ 34 │ 0.29 │ │ cdisselkoen/llvm-ir │ 101 │ 29 │ 0.29 │ │ GabrielDosReis/ipr │ 119 │ 35 │ 0.29 │ │ executors/executors │ 108 │ 31 │ 0.29 │ │ SRI-CSL/gllvm │ 112 │ 32 │ 0.29 │ │ ingve/awesome-clang │ 170 │ 47 │ 0.28 │ │ jk-jeon/dragonbox │ 119 │ 33 │ 0.28 │ │ zufuliu/llvm-utils │ 114 │ 32 │ 0.28 │ │ mgaudet/CompilerJobs │ 160 │ 44 │ 0.28 │ │ facebookexperimental/libunifex │ 347 │ 98 │ 0.28 │ │ google/llvm-bazel │ 118 │ 33 │ 0.28 │ │ bytecodealliance/cargo-wasi │ 152 │ 42 │ 0.28 │ │ google/iree │ 540 │ 151 │ 0.28 │ │ f0rki/mapping-high-level-constructs-to-llvm-ir │ 293 │ 82 │ 0.28 │ │ llvm-mirror/compiler-rt │ 312 │ 88 │ 0.28 │ └────────────────────────────────────────────────┴─────────────┴──────────────────┴───────┘ 50 rows in set. Elapsed: 2.397 sec. Processed 232.14 million rows, 5.86 GB (96.85 million rows/s., 2.45 GB/s.)
Surprisingly this simple heuristic works really well!
Let me find a friend by the intersection on starred repositories.
WITH repo_name IN
(
SELECT repo_name
FROM github_events
WHERE (event_type = 'WatchEvent') AND (actor_login IN ('alexey-milovidov'))
) AS is_my_repo
SELECT
actor_login,
sum(is_my_repo) AS stars_my,
sum(NOT is_my_repo) AS stars_other,
round(stars_my / (203 + stars_other), 3) AS ratio
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY actor_login
ORDER BY ratio DESC
LIMIT 50
┌─actor_login───────┬─stars_my─┬─stars_other─┬─ratio─┐
1. │ alexey-milovidov │ 203 │ 0 │ 1 │
2. │ leicsss │ 14 │ 14 │ 0.065 │
3. │ exitNA │ 31 │ 293 │ 0.062 │
4. │ filimonov │ 19 │ 152 │ 0.054 │
5. │ LingangJiang │ 13 │ 64 │ 0.049 │
6. │ geofflangdale │ 10 │ 11 │ 0.047 │
7. │ lzy305 │ 12 │ 58 │ 0.046 │
8. │ LiuYangkuan │ 10 │ 15 │ 0.046 │
9. │ isublimity │ 12 │ 64 │ 0.045 │
10. │ gmaurel │ 9 │ 1 │ 0.044 │
11. │ peterborodatyy │ 13 │ 97 │ 0.043 │
12. │ hagen1778 │ 22 │ 331 │ 0.041 │
13. │ ludv1x │ 9 │ 23 │ 0.04 │
14. │ jason-godden │ 8 │ 3 │ 0.039 │
15. │ powturbo │ 9 │ 25 │ 0.039 │
16. │ artpaul │ 10 │ 51 │ 0.039 │
17. │ hodgesrm │ 8 │ 8 │ 0.038 │
18. │ gabime │ 13 │ 141 │ 0.038 │
19. │ gecko984 │ 8 │ 10 │ 0.038 │
20. │ jackpgao │ 15 │ 209 │ 0.036 │
21. │ mkabilov │ 10 │ 95 │ 0.034 │
22. │ vstakhov │ 12 │ 148 │ 0.034 │
23. │ eden-shiwm │ 9 │ 75 │ 0.032 │
24. │ openbsod │ 19 │ 413 │ 0.031 │
25. │ xiaoyem │ 8 │ 67 │ 0.03 │
26. │ UmGabrielQualquer │ 6 │ 0 │ 0.03 │
27. │ Johnny-Three │ 10 │ 140 │ 0.029 │
28. │ as-hu-aabk │ 10 │ 136 │ 0.029 │
29. │ nenomius │ 12 │ 216 │ 0.029 │
30. │ ztlpn │ 7 │ 41 │ 0.029 │
31. │ oliverchang │ 6 │ 14 │ 0.028 │
32. │ leannor │ 7 │ 51 │ 0.028 │
33. │ ahmedsharifkhan │ 6 │ 10 │ 0.028 │
34. │ zhang2014 │ 7 │ 43 │ 0.028 │
35. │ dairui2 │ 9 │ 127 │ 0.027 │
36. │ yakud │ 8 │ 97 │ 0.027 │
37. │ daskol │ 10 │ 173 │ 0.027 │
38. │ devjgm │ 6 │ 21 │ 0.027 │
39. │ bocharov │ 6 │ 17 │ 0.027 │
40. │ 251 │ 9 │ 149 │ 0.026 │
41. │ s-mx │ 8 │ 109 │ 0.026 │
42. │ 4ertus2 │ 6 │ 31 │ 0.026 │
43. │ Vitaliy-Grigoriev │ 9 │ 138 │ 0.026 │
44. │ kshvakov │ 32 │ 1076 │ 0.025 │
45. │ yfedoseev │ 7 │ 78 │ 0.025 │
46. │ chinaworld │ 13 │ 312 │ 0.025 │
47. │ nikitamikhaylov │ 6 │ 35 │ 0.025 │
48. │ geek-li │ 5 │ 1 │ 0.025 │
49. │ xzkostyan │ 5 │ 8 │ 0.024 │
50. │ break1ngpoint │ 6 │ 52 │ 0.024 │
└───────────────────┴──────────┴─────────────┴───────┘
50 rows in set. Elapsed: 4.376 sec. Processed 464.26 million rows, 9.66 GB (106.10 million rows/s., 2.21 GB/s.)
Here are my friends... and I don't know what to do with this list.
Authors that contributed to ClickHouse also contributed to what repositories?
SELECT
repo_name,
count() AS prs,
uniq(actor_login) AS authors
FROM github_events
WHERE (event_type = 'PullRequestEvent') AND (action = 'opened') AND (actor_login IN
(
SELECT actor_login
FROM github_events
WHERE (event_type = 'PullRequestEvent') AND (action = 'opened') AND (repo_name IN ('yandex/ClickHouse', 'ClickHouse/ClickHouse'))
)) AND (repo_name NOT ILIKE '%ClickHouse%')
GROUP BY repo_name
ORDER BY authors DESC
LIMIT 50
┌─repo_name─────────────────────────┬──prs─┬─authors─┐
│ Homebrew/homebrew-core │ 33 │ 15 │
│ pocoproject/poco │ 28 │ 14 │
│ saltstack/salt │ 38 │ 13 │
│ apache/flink │ 54 │ 12 │
│ ansible/ansible │ 42 │ 12 │
│ grafana/grafana │ 87 │ 10 │
│ helm/charts │ 35 │ 10 │
│ avelino/awesome-go │ 33 │ 10 │
│ facebook/rocksdb │ 18 │ 9 │
│ kubernetes/kubernetes │ 17 │ 9 │
│ apache/spark │ 18 │ 9 │
│ NixOS/nixpkgs │ 208 │ 9 │
│ getredash/redash │ 20 │ 9 │
│ python/cpython │ 18 │ 8 │
│ Vertamedia/chproxy │ 30 │ 8 │
│ hashicorp/consul │ 9 │ 8 │
│ django/django │ 24 │ 8 │
│ catboost/catboost │ 13 │ 7 │
│ grafana/grafana-plugin-repository │ 35 │ 7 │
│ DefinitelyTyped/DefinitelyTyped │ 26 │ 7 │
│ apache/incubator-doris │ 68 │ 7 │
│ caskroom/homebrew-cask │ 32 │ 7 │
│ iovisor/bcc │ 36 │ 7 │
│ ant-design/ant-design │ 236 │ 6 │
│ tomchristie/django-rest-framework │ 8 │ 6 │
│ edenhill/librdkafka │ 8 │ 6 │
│ aio-libs/aiohttp │ 515 │ 6 │
│ grpc/grpc │ 11 │ 6 │
│ nodejs/node │ 14 │ 6 │
│ antirez/redis │ 7 │ 6 │
│ Shopify/sarama │ 25 │ 6 │
│ ceph/ceph │ 14 │ 5 │
│ prometheus/client_java │ 25 │ 5 │
│ pinterest/secor │ 1074 │ 5 │
│ VictoriaMetrics/VictoriaMetrics │ 99 │ 5 │
│ golang/go │ 5 │ 5 │
│ symfony/symfony │ 13 │ 5 │
│ apache/hadoop │ 11 │ 5 │
│ TechEmpower/FrameworkBenchmarks │ 39 │ 5 │
│ apache/mesos │ 5 │ 5 │
│ pingcap/tidb │ 100 │ 5 │
│ php/php-src │ 12 │ 5 │
│ apache/incubator-dolphinscheduler │ 51 │ 5 │
│ apache/zookeeper │ 15 │ 5 │
│ simdjson/simdjson │ 7 │ 5 │
│ yiisoft/yii2 │ 17 │ 5 │
│ Homebrew/brew │ 627 │ 5 │
│ apache/airflow │ 13 │ 5 │
│ rust-lang/rust │ 6 │ 5 │
│ yandex/graphouse │ 74 │ 5 │
└───────────────────────────────────┴──────┴─────────┘
50 rows in set. Elapsed: 0.562 sec. Processed 214.68 million rows, 2.82 GB (382.16 million rows/s., 5.01 GB/s.)
Authors that filed an issue in ClickHouse also filed issues in what repositories?
SELECT
repo_name,
count() AS prs,
uniq(actor_login) AS authors
FROM github_events
WHERE (event_type = 'IssuesEvent') AND (action = 'opened') AND (actor_login IN
(
SELECT actor_login
FROM github_events
WHERE (event_type = 'IssuesEvent') AND (action = 'opened') AND (repo_name IN ('yandex/ClickHouse', 'ClickHouse/ClickHouse'))
)) AND (repo_name NOT ILIKE '%ClickHouse%')
GROUP BY repo_name
ORDER BY authors DESC
LIMIT 50
┌─repo_name─────────────────────────┬─prs─┬─authors─┐
│ grafana/grafana │ 224 │ 68 │
│ golang/go │ 186 │ 54 │
│ apache/incubator-superset │ 124 │ 36 │
│ kubernetes/kubernetes │ 98 │ 34 │
│ elastic/elasticsearch │ 66 │ 30 │
│ ansible/ansible │ 53 │ 30 │
│ pingcap/tidb │ 122 │ 24 │
│ Microsoft/vscode │ 39 │ 23 │
│ travis-ci/travis-ci │ 32 │ 23 │
│ docker/docker │ 58 │ 23 │
│ prestodb/presto │ 97 │ 22 │
│ prometheus/prometheus │ 50 │ 22 │
│ sysown/proxysql │ 43 │ 21 │
│ Vertamedia/chproxy │ 31 │ 21 │
│ VictoriaMetrics/VictoriaMetrics │ 75 │ 20 │
│ metabase/metabase │ 56 │ 20 │
│ telegramdesktop/tdesktop │ 85 │ 19 │
│ getredash/redash │ 48 │ 18 │
│ tabixio/tabix │ 39 │ 18 │
│ edenhill/librdkafka │ 36 │ 18 │
│ influxdata/influxdb │ 30 │ 17 │
│ pypa/pip │ 16 │ 16 │
│ saltstack/salt │ 71 │ 16 │
│ alibaba/druid │ 32 │ 16 │
│ apache/incubator-dolphinscheduler │ 39 │ 16 │
│ tensorflow/tensorflow │ 32 │ 16 │
│ rust-lang/rust │ 32 │ 16 │
│ kubernetes/ingress-nginx │ 21 │ 15 │
│ pandas-dev/pandas │ 26 │ 15 │
│ minio/minio │ 31 │ 15 │
│ yiisoft/yii2 │ 43 │ 15 │
│ elastic/logstash │ 21 │ 15 │
│ timberio/vector │ 178 │ 15 │
│ druid-io/druid │ 83 │ 15 │
│ containous/traefik │ 18 │ 14 │
│ rancher/rancher │ 64 │ 14 │
│ apache/incubator-doris │ 41 │ 14 │
│ grpc/grpc │ 30 │ 14 │
│ antirez/redis │ 18 │ 14 │
│ gitlabhq/gitlabhq │ 17 │ 14 │
│ elastic/kibana │ 17 │ 14 │
│ prestosql/presto │ 89 │ 14 │
│ go-gitea/gitea │ 21 │ 13 │
│ systemd/systemd │ 25 │ 13 │
│ helm/charts │ 26 │ 13 │
│ rails/rails │ 36 │ 13 │
│ nodejs/node │ 20 │ 13 │
│ hashicorp/consul │ 27 │ 13 │
│ alibaba/DataX │ 15 │ 13 │
│ facebook/folly │ 17 │ 12 │
└───────────────────────────────────┴─────┴─────────┘
50 rows in set. Elapsed: 0.297 sec. Processed 111.30 million rows, 1.59 GB (374.69 million rows/s., 5.35 GB/s.)
SELECT
repo_name,
toDate(created_at) AS day,
count() AS stars
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY
repo_name,
day
ORDER BY count() DESC
LIMIT 1 BY repo_name
LIMIT 50
┌─repo_name──────────────────────────────────────┬────────day─┬─stars─┐
1. │ 996icu/996.ICU │ 2019-03-28 │ 76056 │
2. │ M4cs/BabySploit │ 2019-09-08 │ 46985 │
3. │ x64dbg/x64dbg │ 2018-01-06 │ 26459 │
4. │ microsoft/Terminal │ 2019-05-07 │ 14419 │
5. │ Rohfosho/CosmosBrowserBackend │ 2014-10-13 │ 13345 │
6. │ apple/swift │ 2015-12-04 │ 10619 │
7. │ desktop/desktop │ 2020-11-16 │ 10047 │
8. │ openbilibili/go-common │ 2019-04-22 │ 8176 │
9. │ amattson21/gitjob │ 2016-10-08 │ 6511 │
10. │ mr-mig/every-programmer-should-know │ 2017-09-05 │ 6433 │
11. │ skylot/jadx │ 2018-01-06 │ 6319 │
12. │ phase/sure │ 2015-06-16 │ 6039 │
13. │ Konloch/bytecode-viewer │ 2018-01-06 │ 6000 │
14. │ jlevy/the-art-of-command-line │ 2015-06-16 │ 5897 │
15. │ yarnpkg/yarn │ 2016-10-12 │ 5823 │
16. │ DankMemer/webhook-server │ 2019-02-16 │ 5667 │
17. │ tensorflow/tensorflow │ 2015-11-10 │ 5247 │
18. │ open-guides/og-aws │ 2016-10-12 │ 5097 │
19. │ danistefanovic/build-your-own-x │ 2018-05-13 │ 5077 │
20. │ facebookresearch/Detectron │ 2018-01-23 │ 5036 │
21. │ Microsoft/calculator │ 2019-03-07 │ 4965 │
22. │ NationalSecurityAgency/ghidra │ 2019-03-06 │ 4908 │
23. │ FallibleInc/security-guide-for-developers │ 2016-07-22 │ 4861 │
24. │ vk-com/kphp-kdb │ 2018-01-07 │ 4723 │
25. │ LingDong-/wenyan-lang │ 2019-12-18 │ 4478 │
26. │ facebook/react-native │ 2015-03-27 │ 4324 │
27. │ MSWorkers/support.996.ICU │ 2019-04-23 │ 4310 │
28. │ chrislgarry/Apollo-11 │ 2016-07-10 │ 4309 │
29. │ jwasham/google-interview-university │ 2016-10-06 │ 4300 │
30. │ google/material-design-lite │ 2015-07-07 │ 4209 │
31. │ GoogleChrome/puppeteer │ 2017-08-17 │ 4162 │
32. │ NARKOZ/hacker-scripts │ 2015-11-24 │ 4100 │
33. │ donnemartin/system-design-primer │ 2017-03-09 │ 4087 │
34. │ houshanren/hangzhou_house_knowledge │ 2018-02-26 │ 4085 │
35. │ AioiLight/TJAPlayer3 │ 2019-01-29 │ 4076 │
36. │ sdmg15/Best-websites-a-programmer-should-visit │ 2017-06-07 │ 4056 │
37. │ vuejs/vue │ 2018-06-15 │ 3897 │
38. │ Dman95/SASM │ 2018-01-07 │ 3761 │
39. │ formulahendry/955.WLB │ 2019-03-29 │ 3714 │
40. │ facebook/prepack │ 2017-05-04 │ 3709 │
41. │ trekhleb/javascript-algorithms │ 2018-05-24 │ 3655 │
42. │ bradtraversy/design-resources-for-developers │ 2020-05-07 │ 3624 │
43. │ facebook/react │ 2018-06-15 │ 3563 │
44. │ ry/deno │ 2018-05-31 │ 3536 │
45. │ minimaxir/big-list-of-naughty-strings │ 2017-01-16 │ 3533 │
46. │ kelseyhightower/nocode │ 2018-02-07 │ 3528 │
47. │ felixrieseberg/windows95 │ 2018-08-24 │ 3518 │
48. │ dylanaraps/pure-bash-bible │ 2019-09-19 │ 3501 │
49. │ lib-pku/libpku │ 2019-04-09 │ 3366 │
50. │ Awesome-HarmonyOS/HarmonyOS │ 2019-08-13 │ 3324 │
└────────────────────────────────────────────────┴────────────┴───────┘
50 rows in set. Elapsed: 8.959 sec. Processed 232.13 million rows, 2.74 GB (25.91 million rows/s., 305.37 MB/s.)
The GitHub Desktop is the most new. There is "openbilibili" — something like Chinese YouTube, but the repository is closed due to DMCA takedown.
I accidentially made a similar query for repositories the gained the most stars over just one second:
:) SELECT repo_name, created_at, count() AS stars FROM github_events WHERE event_type = 'WatchEvent' GROUP BY repo_name, created_at ORDER BY count() DESC LIMIT 50
┌─repo_name────────────────────────────┬──────────created_at─┬─stars─┐
1. │ expressjs/express │ 2018-04-29 20:30:59 │ 370 │
2. │ expressjs/express │ 2018-04-29 20:31:00 │ 195 │
3. │ danieldaeschle/swapy │ 2018-04-29 16:10:44 │ 132 │
4. │ flutter/flutter │ 2018-04-29 15:13:58 │ 118 │
5. │ danieldaeschle/amazon-music-linux │ 2018-04-08 19:15:47 │ 100 │
6. │ Apress/java-regular-expressions │ 2018-04-29 20:29:01 │ 99 │
7. │ danieldaeschle/amazon-music-linux │ 2018-04-08 19:15:42 │ 93 │
8. │ Apress/java-regular-expressions │ 2018-04-29 20:29:06 │ 93 │
9. │ simplyianm/preston │ 2015-02-28 03:58:12 │ 91 │
10. │ danieldaeschle/swapy │ 2018-04-29 16:10:41 │ 91 │
11. │ tejasmanohar/omnus │ 2015-02-28 03:59:37 │ 89 │
12. │ Dman95/SASM │ 2018-01-07 02:09:35 │ 88 │
13. │ simplyianm/motivate │ 2015-02-28 03:58:32 │ 86 │
14. │ flutter/flutter │ 2018-04-29 15:13:55 │ 83 │
15. │ LRB-Experiments/PDForumNotifications │ 2018-04-29 15:15:07 │ 83 │
16. │ LRB-Experiments/PDForumNotifications │ 2018-04-29 15:15:04 │ 82 │
17. │ torusrxxx/x64dbg │ 2018-01-07 01:54:02 │ 82 │
18. │ borisrepl/boris │ 2015-07-13 07:52:56 │ 81 │
19. │ HJLebbink/asm-dude │ 2018-01-07 01:57:30 │ 81 │
20. │ tejasmanohar/twilio-plays-2048 │ 2015-02-28 03:56:55 │ 79 │
21. │ x64dbg/x64dbg │ 2018-01-07 01:47:08 │ 79 │
22. │ tj/n │ 2015-02-28 04:03:14 │ 79 │
23. │ danieldaeschle/amazon-music-linux │ 2018-04-08 19:15:39 │ 78 │
24. │ x64dbg/x64dbg │ 2018-01-07 01:46:55 │ 78 │
25. │ Dman95/SASM │ 2018-01-07 02:08:59 │ 77 │
26. │ x64dbg/docs │ 2018-01-06 06:45:25 │ 76 │
27. │ vk-com/kphp-kdb │ 2018-01-07 02:06:09 │ 76 │
28. │ x64dbg/x64dbgpy │ 2018-01-07 01:59:23 │ 75 │
29. │ aionnetwork/aion │ 2018-05-10 20:25:09 │ 75 │
30. │ Dman95/SASM │ 2018-01-07 02:09:31 │ 74 │
31. │ LordZamy/BurstTab │ 2015-02-28 04:57:11 │ 74 │
32. │ ThomasJaeger/VisualMASM │ 2018-01-06 06:51:24 │ 74 │
33. │ Dman95/SASM │ 2018-01-07 02:08:23 │ 73 │
34. │ x64dbg/x64dbgpy │ 2018-01-07 01:59:09 │ 73 │
35. │ tejasmanohar/npm-algos │ 2015-07-13 07:50:59 │ 73 │
36. │ x64dbg/x64dbg │ 2018-01-07 01:47:30 │ 73 │
37. │ Dman95/SASM │ 2018-01-07 02:09:12 │ 73 │
38. │ x64dbg/x64dbg │ 2018-01-06 07:24:09 │ 73 │
39. │ Dman95/SASM │ 2018-01-07 02:08:54 │ 72 │
40. │ vk-com/kphp-kdb │ 2018-01-07 02:06:18 │ 72 │
41. │ x64dbg/x64dbg │ 2018-01-07 01:47:28 │ 72 │
42. │ x64dbg/x64dbg │ 2018-01-06 07:18:02 │ 71 │
43. │ vk-com/kphp-kdb │ 2018-01-07 02:06:38 │ 71 │
44. │ x64dbg/x64dbg │ 2018-01-07 01:47:17 │ 71 │
45. │ andrewtian/whatcanidoforbitcoin.xyz │ 2015-09-27 21:45:27 │ 71 │
46. │ x64dbg/x64dbg │ 2018-01-06 07:22:56 │ 70 │
47. │ pfmoore/shimmy │ 2018-04-29 13:02:22 │ 69 │
48. │ x64dbg/x64dbg │ 2018-01-07 01:47:10 │ 69 │
49. │ flutter/flutter │ 2018-04-29 15:13:32 │ 69 │
50. │ qw3rtman/gg │ 2015-08-19 02:45:39 │ 68 │
└──────────────────────────────────────┴─────────────────────┴───────┘
50 rows in set. Elapsed: 14.476 sec. Processed 232.13 million rows, 2.74 GB (16.04 million rows/s., 188.99 MB/s.)
Another JavaScript framework got 370 stars in one peaceful April evening? It's complete madness... By the way, what is borisrepl?
WITH toYear(created_at) AS year
SELECT
repo_name,
sum(year = 2020) AS stars2020,
sum(year = 2019) AS stars2019,
round(stars2020 / stars2019, 3) AS yoy,
min(created_at) AS first_seen
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY repo_name
HAVING (min(created_at) <= '2019-01-01 00:00:00') AND (stars2019 >= 1000)
ORDER BY yoy DESC
LIMIT 50
┌─repo_name──────────────────────────────────┬─stars2020─┬─stars2019─┬───yoy─┬──────────first_seen─┐
1. │ puppeteer/puppeteer │ 11310 │ 1255 │ 9.012 │ 2014-03-14 12:01:39 │
2. │ ohmyzsh/ohmyzsh │ 21705 │ 2529 │ 8.582 │ 2017-03-22 16:44:20 │
3. │ halfrost/LeetCode-Go │ 8887 │ 1191 │ 7.462 │ 2018-04-16 09:29:48 │
4. │ jitsi/jitsi-meet │ 10451 │ 1548 │ 6.751 │ 2014-03-31 20:41:15 │
5. │ desktop/desktop │ 12512 │ 2600 │ 4.812 │ 2017-05-16 17:10:37 │
6. │ TheAlgorithms/Javascript │ 6009 │ 1312 │ 4.58 │ 2017-08-03 16:27:42 │
7. │ iamadamdev/bypass-paywalls-chrome │ 9503 │ 2246 │ 4.231 │ 2018-04-07 23:54:45 │
8. │ r-spacex/SpaceX-API │ 4020 │ 1083 │ 3.712 │ 2017-06-26 07:00:38 │
9. │ loadimpact/k6 │ 5018 │ 1352 │ 3.712 │ 2017-01-18 18:26:09 │
10. │ openspug/spug │ 4086 │ 1116 │ 3.661 │ 2018-02-10 11:01:08 │
11. │ Dod-o/Statistical-Learning-Method_Code │ 5227 │ 1437 │ 3.637 │ 2018-11-15 16:38:25 │
12. │ Fndroid/clash_for_windows_pkg │ 9438 │ 2979 │ 3.168 │ 2018-10-19 00:14:07 │
13. │ williamfiset/Algorithms │ 5339 │ 1778 │ 3.003 │ 2017-05-12 03:37:24 │
14. │ aniftyco/awesome-tailwindcss │ 2992 │ 1007 │ 2.971 │ 2018-08-07 10:00:59 │
15. │ LingCoder/OnJava8 │ 11972 │ 4082 │ 2.933 │ 2018-12-06 07:16:35 │
16. │ the1812/Bilibili-Evolved │ 3678 │ 1255 │ 2.931 │ 2018-08-30 09:55:58 │
17. │ jgraph/drawio-desktop │ 9253 │ 3293 │ 2.81 │ 2017-06-07 17:17:38 │
18. │ darlinghq/darling │ 3130 │ 1133 │ 2.763 │ 2015-10-11 15:42:44 │
19. │ twintproject/twint │ 5899 │ 2254 │ 2.617 │ 2018-06-12 03:34:17 │
20. │ ctgk/PRML │ 5129 │ 1970 │ 2.604 │ 2017-07-03 03:06:25 │
21. │ tomnomnom/gron │ 2733 │ 1051 │ 2.6 │ 2016-06-04 23:19:04 │
22. │ trojan-gfw/trojan │ 10218 │ 3945 │ 2.59 │ 2017-11-13 05:54:52 │
23. │ dabreegster/abstreet │ 3865 │ 1503 │ 2.572 │ 2018-06-06 22:07:38 │
24. │ an-tao/drogon │ 3223 │ 1262 │ 2.554 │ 2018-05-04 10:03:32 │
25. │ ritchieng/the-incredible-pytorch │ 3499 │ 1401 │ 2.498 │ 2017-02-11 08:37:30 │
26. │ elsewhencode/project-guidelines │ 3752 │ 1503 │ 2.496 │ 2018-03-21 18:06:16 │
27. │ AlexeyAB/darknet │ 8569 │ 3465 │ 2.473 │ 2016-12-04 19:18:28 │
28. │ denoland/deno │ 31550 │ 12833 │ 2.459 │ 2018-08-03 00:26:45 │
29. │ GitSquared/edex-ui │ 14151 │ 5797 │ 2.441 │ 2017-10-14 15:55:21 │
30. │ MathewSachin/Captura │ 3392 │ 1392 │ 2.437 │ 2015-09-12 12:31:42 │
31. │ Aircoookie/WLED │ 2767 │ 1141 │ 2.425 │ 2017-12-17 08:36:13 │
32. │ edent/SuperTinyIcons │ 3747 │ 1559 │ 2.403 │ 2017-11-12 01:57:38 │
33. │ tiangolo/fastapi │ 17962 │ 7474 │ 2.403 │ 2018-12-08 10:05:29 │
34. │ remoteintech/remote-jobs │ 6819 │ 2858 │ 2.386 │ 2017-09-09 13:12:05 │
35. │ retejs/rete │ 2541 │ 1075 │ 2.364 │ 2018-05-31 03:06:05 │
36. │ Anuken/Mindustry │ 5439 │ 2322 │ 2.342 │ 2017-05-12 17:52:31 │
37. │ Dreamacro/clash │ 8836 │ 3850 │ 2.295 │ 2018-06-10 14:41:49 │
38. │ debauchee/barrier │ 5731 │ 2505 │ 2.288 │ 2018-02-23 12:36:21 │
39. │ ryansolid/solid │ 3249 │ 1433 │ 2.267 │ 2018-05-26 14:28:51 │
40. │ wilsonfreitas/awesome-quant │ 2362 │ 1053 │ 2.243 │ 2016-05-10 10:19:52 │
41. │ TheAlgorithms/Go │ 2905 │ 1299 │ 2.236 │ 2017-07-11 04:34:48 │
42. │ JanDeDobbeleer/oh-my-posh │ 2748 │ 1229 │ 2.236 │ 2018-04-12 04:42:01 │
43. │ The-Art-of-Hacking/h4cker │ 5791 │ 2622 │ 2.209 │ 2018-10-06 12:02:56 │
44. │ TheCherno/Hazel │ 2852 │ 1292 │ 2.207 │ 2018-10-20 04:01:11 │
45. │ FreeCAD/FreeCAD │ 3897 │ 1777 │ 2.193 │ 2015-07-04 23:06:13 │
46. │ squidfunk/mkdocs-material │ 2318 │ 1066 │ 2.174 │ 2016-02-09 19:27:50 │
47. │ JaeYeopHan/Interview_Question_for_Beginner │ 3727 │ 1719 │ 2.168 │ 2017-06-16 15:39:58 │
48. │ Requarks/wiki │ 5999 │ 2769 │ 2.166 │ 2016-08-30 18:46:22 │
49. │ midwayjs/midway │ 2257 │ 1052 │ 2.145 │ 2018-07-12 07:12:06 │
50. │ jwasham/coding-interview-university │ 54591 │ 25556 │ 2.136 │ 2017-02-27 20:09:29 │
└────────────────────────────────────────────┴───────────┴───────────┴───────┴─────────────────────┘
50 rows in set. Elapsed: 0.779 sec. Processed 232.13 million rows, 2.74 GB (297.88 million rows/s., 3.51 GB/s.)
It's not surprising to see Jitsi (open-source video conferences) has grown almost sixfold in 2020!
WITH toYear(created_at) AS year
SELECT
repo_name,
sum(year = 2020) AS stars2020,
sum(year = 2019) AS stars2019,
round(stars2020 / stars2019, 3) AS yoy,
min(created_at) AS first_seen
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY repo_name
HAVING (min(created_at) <= '2019-01-01 00:00:00') AND (max(created_at) >= '2020-06-01 00:00:00') AND (stars2019 >= 1000)
ORDER BY yoy ASC
LIMIT 50
┌─repo_name─────────────────────────────────────┬─stars2020─┬─stars2019─┬───yoy─┬──────────first_seen─┐
1. │ DoubleLabyrinth/navicat-keygen │ 1 │ 6459 │ 0 │ 2017-12-08 02:04:48 │
2. │ arpitjindal97/technology_books │ 1 │ 4001 │ 0 │ 2017-08-16 19:08:59 │
3. │ b3log/pipe │ 2 │ 2380 │ 0.001 │ 2017-12-20 05:08:15 │
4. │ energicryptocurrency/energi │ 12 │ 6720 │ 0.002 │ 2017-07-14 16:01:39 │
5. │ M4cs/BabySploit │ 126 │ 71013 │ 0.002 │ 2018-11-11 10:10:30 │
6. │ AioiLight/TJAPlayer3 │ 29 │ 5308 │ 0.005 │ 2018-01-17 19:59:10 │
7. │ everitoken/evt │ 8 │ 1366 │ 0.006 │ 2018-04-23 03:57:30 │
8. │ SqueezerIO/squeezer │ 41 │ 3307 │ 0.012 │ 2017-05-15 07:31:14 │
9. │ dengyuhan/magnetW │ 54 │ 4215 │ 0.013 │ 2018-03-09 09:18:23 │
10. │ react-native-training/react-native-elements │ 48 │ 3341 │ 0.014 │ 2017-04-03 04:27:02 │
11. │ app-developers/top │ 37 │ 1950 │ 0.019 │ 2018-06-27 08:04:07 │
12. │ KidkArolis/jetpack │ 32 │ 1174 │ 0.027 │ 2017-03-13 12:09:24 │
13. │ achael/eht-imaging │ 170 │ 5358 │ 0.032 │ 2016-02-03 03:06:34 │
14. │ metatron-app/metatron-discovery │ 119 │ 3120 │ 0.038 │ 2018-07-27 07:55:44 │
15. │ kezhenxu94/mini-github │ 50 │ 1283 │ 0.039 │ 2018-11-04 15:11:38 │
16. │ soheilpro/catj │ 52 │ 1307 │ 0.04 │ 2014-12-13 06:16:38 │
17. │ marcan/takeover.sh │ 112 │ 2786 │ 0.04 │ 2017-02-10 16:27:02 │
18. │ sinclairzx81/zero │ 94 │ 2160 │ 0.044 │ 2017-06-14 03:45:10 │
19. │ FreeCodeCampChina/freecodecamp.cn │ 343 │ 7651 │ 0.045 │ 2016-09-24 10:10:42 │
20. │ koekeishiya/chunkwm │ 61 │ 1331 │ 0.046 │ 2017-01-15 14:43:56 │
21. │ trojanowski/react-apollo-hooks │ 90 │ 1976 │ 0.046 │ 2018-10-29 07:56:09 │
22. │ JesseKPhillips/USA-Constitution │ 93 │ 1924 │ 0.048 │ 2017-11-16 21:14:35 │
23. │ RomuloOliveira/commit-messages-guide │ 305 │ 5817 │ 0.052 │ 2018-02-26 02:54:42 │
24. │ mgp25/Instagram-API │ 96 │ 1769 │ 0.054 │ 2015-11-16 18:48:36 │
25. │ apachecn/awesome-algorithm │ 191 │ 3424 │ 0.056 │ 2018-09-24 16:46:26 │
26. │ ecthros/uncaptcha2 │ 259 │ 4219 │ 0.061 │ 2018-12-31 17:20:19 │
27. │ TheBerkin/rant │ 83 │ 1283 │ 0.065 │ 2017-07-24 23:31:51 │
28. │ dwyl/learn-json-web-tokens │ 175 │ 2500 │ 0.07 │ 2015-07-06 13:09:04 │
29. │ trimstray/the-practical-linux-hardening-guide │ 516 │ 7355 │ 0.07 │ 2018-10-06 22:36:36 │
30. │ chrisdickinson/git-rs │ 88 │ 1197 │ 0.074 │ 2018-12-22 05:52:49 │
31. │ uswds/public-sans │ 272 │ 3601 │ 0.076 │ 2018-10-06 20:42:23 │
32. │ danijar/handout │ 140 │ 1825 │ 0.077 │ 2018-11-24 21:07:46 │
33. │ trimstray/htrace.sh │ 221 │ 2788 │ 0.079 │ 2018-07-13 22:40:41 │
34. │ jfcoz/postgresqltuner │ 146 │ 1750 │ 0.083 │ 2016-12-15 22:08:38 │
35. │ adblockradio/adblockradio │ 100 │ 1184 │ 0.084 │ 2018-11-15 14:33:52 │
36. │ jhuangtw-dev/xg2xg │ 600 │ 7033 │ 0.085 │ 2017-01-07 15:54:00 │
37. │ lib-pku/libpku │ 1950 │ 23015 │ 0.085 │ 2018-11-22 13:35:21 │
38. │ charlax/professional-programming │ 1112 │ 12892 │ 0.086 │ 2016-05-07 03:25:06 │
39. │ Bogdan-Lyashenko/codecrumbs │ 199 │ 2293 │ 0.087 │ 2018-05-03 10:50:17 │
40. │ facebookincubator/spectrum │ 131 │ 1471 │ 0.089 │ 2018-11-20 16:31:01 │
41. │ Wookai/paper-tips-and-tricks │ 251 │ 2705 │ 0.093 │ 2015-07-09 18:16:32 │
42. │ sveinbjornt/Sloth │ 329 │ 3483 │ 0.094 │ 2011-12-03 20:50:21 │
43. │ CoolPhilChen/SJTU-Courses │ 538 │ 5739 │ 0.094 │ 2018-01-20 02:04:41 │
44. │ geekinglcq/CDCS │ 154 │ 1628 │ 0.095 │ 2018-03-19 07:54:49 │
45. │ Louiszhai/tool │ 418 │ 4286 │ 0.098 │ 2016-07-14 06:04:09 │
46. │ mlabouardy/komiser │ 189 │ 1904 │ 0.099 │ 2018-03-17 16:43:27 │
47. │ hiroppy/fusuma │ 332 │ 3327 │ 0.1 │ 2018-04-27 01:43:35 │
48. │ x-ream/x7 │ 131 │ 1275 │ 0.103 │ 2018-12-23 13:46:14 │
49. │ facebookincubator/redux-react-hook │ 185 │ 1799 │ 0.103 │ 2018-11-09 16:19:29 │
50. │ jeffgerickson/algorithms │ 677 │ 6137 │ 0.11 │ 2018-12-28 03:01:40 │
└───────────────────────────────────────────────┴───────────┴───────────┴───────┴─────────────────────┘
50 rows in set. Elapsed: 0.716 sec. Processed 232.13 million rows, 2.74 GB (324.22 million rows/s., 3.82 GB/s.)
The first entries are not available on GitHub due to removal.
SELECT
repo_name,
max(stars) AS daily_stars,
sum(stars) AS total_stars,
total_stars / daily_stars AS rate
FROM
(
SELECT
repo_name,
toDate(created_at) AS day,
count() AS stars
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY
repo_name,
day
)
GROUP BY repo_name
ORDER BY rate DESC
LIMIT 50
┌─repo_name───────────────────────────────┬─daily_stars─┬─total_stars─┬───rate─┐
1. │ mongodb/mongo │ 24 │ 21412 │ 892.17 │
2. │ plataformatec/devise │ 24 │ 21212 │ 883.83 │
3. │ senchalabs/connect │ 11 │ 9636 │ 876 │
4. │ wycats/handlebars.js │ 21 │ 16863 │ 803 │
5. │ janl/mustache.js │ 19 │ 15149 │ 797.32 │
6. │ fxsjy/jieba │ 34 │ 26901 │ 791.21 │
7. │ scrooloose/nerdtree │ 18 │ 14070 │ 781.67 │
8. │ rails/rails │ 70 │ 53620 │ 766 │
9. │ cheeriojs/cheerio │ 29 │ 22095 │ 761.9 │
10. │ creationix/nvm │ 49 │ 36678 │ 748.53 │
11. │ powerline/fonts │ 28 │ 20781 │ 742.18 │
12. │ tpope/vim-fugitive │ 20 │ 14643 │ 732.15 │
13. │ tornadoweb/tornado │ 21 │ 15221 │ 724.81 │
14. │ square/retrofit │ 58 │ 41066 │ 708.03 │
15. │ swagger-api/swagger-ui │ 27 │ 18724 │ 693.48 │
16. │ andymccurdy/redis-py │ 14 │ 9548 │ 682 │
17. │ capistrano/capistrano │ 19 │ 12794 │ 673.37 │
18. │ rspec/rspec-rails │ 7 │ 4668 │ 666.86 │
19. │ thoughtbot/paperclip │ 12 │ 7961 │ 663.42 │
20. │ VundleVim/Vundle.vim │ 25 │ 16555 │ 662.2 │
21. │ ansible/ansible-examples │ 15 │ 9841 │ 656.07 │
22. │ gradle/gradle │ 19 │ 12338 │ 649.37 │
23. │ tpope/vim-pathogen │ 21 │ 13592 │ 647.24 │
24. │ mishoo/UglifyJS │ 15 │ 9647 │ 643.13 │
25. │ nodejitsu/node-http-proxy │ 17 │ 10925 │ 642.65 │
26. │ zxing/zxing │ 46 │ 29547 │ 642.33 │
27. │ square/okhttp │ 67 │ 43008 │ 641.91 │
28. │ FFmpeg/FFmpeg │ 38 │ 24142 │ 635.32 │
29. │ expressjs/session │ 9 │ 5708 │ 634.22 │
30. │ nvie/gitflow │ 42 │ 26324 │ 626.76 │
31. │ tymondesigns/jwt-auth │ 17 │ 10534 │ 619.65 │
32. │ ReactiveX/RxJava │ 74 │ 45735 │ 618.04 │
33. │ JamesNK/Newtonsoft.Json │ 15 │ 9174 │ 611.6 │
34. │ jquery/jquery-ui │ 19 │ 11604 │ 610.74 │
35. │ expressjs/multer │ 15 │ 9135 │ 609 │
36. │ jashkenas/backbone │ 31 │ 18787 │ 606.03 │
37. │ miguelgrinberg/flasky │ 13 │ 7859 │ 604.54 │
38. │ oblador/react-native-vector-icons │ 25 │ 15100 │ 604 │
39. │ Maximus5/ConEmu │ 13 │ 7849 │ 603.77 │
40. │ desandro/masonry │ 27 │ 16292 │ 603.41 │
41. │ jquery/jquery │ 110 │ 65497 │ 595.43 │
42. │ benoitc/gunicorn │ 13 │ 7736 │ 595.08 │
43. │ nostra13/Android-Universal-Image-Loader │ 34 │ 20222 │ 594.76 │
44. │ greenrobot/EventBus │ 43 │ 25438 │ 591.58 │
45. │ fabric/fabric │ 23 │ 13606 │ 591.57 │
46. │ tastejs/todomvc │ 42 │ 24716 │ 588.48 │
47. │ iissnan/hexo-theme-next │ 32 │ 18709 │ 584.66 │
48. │ mochajs/mocha │ 29 │ 16839 │ 580.66 │
49. │ codemirror/CodeMirror │ 31 │ 17997 │ 580.55 │
50. │ scikit-learn/scikit-learn │ 84 │ 48654 │ 579.21 │
└─────────────────────────────────────────┴─────────────┴─────────────┴────────┘
50 rows in set. Elapsed: 41.287 sec. Processed 232.13 million rows, 2.74 GB (5.62 million rows/s., 66.26 MB/s.)
:) SELECT toDayOfWeek(created_at) AS day, count() AS stars, bar(stars, 0, 50000000, 10) AS bar FROM github_events WHERE event_type = 'WatchEvent' GROUP BY day ORDER BY day
┌─day─┬────stars─┬─bar──────┐
│ 1 │ 36491986 │ ███████▎ │
│ 2 │ 38094378 │ ███████▌ │
│ 3 │ 37570733 │ ███████▌ │
│ 4 │ 37208005 │ ███████▍ │
│ 5 │ 34924484 │ ██████▊ │
│ 6 │ 23726322 │ ████▋ │
│ 7 │ 24102566 │ ████▋ │
└─────┴──────────┴──────────┘
7 rows in set. Elapsed: 0.093 sec. Processed 232.13 million rows, 1.16 GB (2.50 billion rows/s., 12.50 GB/s.)
It is Tuesday. Definitely not the weekend. Maybe Wednesday or Thursday, but not Monday or Friday.
:) SELECT uniq(actor_login) FROM github_events
┌─uniq(actor_login)─┐
│ 34138551 │
└───────────────────┘
1 rows in set. Elapsed: 3.358 sec. Processed 3.12 billion rows, 18.54 GB (928.96 million rows/s., 5.52 GB/s.)
34 million. Actually, these are users that are not only registered but also participated at least in... something.
Total number of users that gave at least one star:
:) SELECT uniq(actor_login) FROM github_events WHERE event_type = 'WatchEvent'
┌─uniq(actor_login)─┐
│ 10176170 │
└───────────────────┘
1 rows in set. Elapsed: 1.186 sec. Processed 232.13 million rows, 3.98 GB (195.65 million rows/s., 3.36 GB/s.)
Just 10 million. I've heard that some people don't give stars. They just do their job instead.
Total number of users with at least one push:
:) SELECT uniq(actor_login) FROM github_events WHERE event_type = 'PushEvent'
┌─uniq(actor_login)─┐
│ 18796966 │
└───────────────────┘
1 rows in set. Elapsed: 0.964 sec. Processed 1.60 billion rows, 9.33 GB (1.66 billion rows/s., 9.67 GB/s.)
There are actually more people who pushed code than those who gave stars.
Total number of users with at least one created PR:
:) SELECT uniq(actor_login) FROM github_events WHERE event_type = 'PullRequestEvent' AND action = 'opened'
┌─uniq(actor_login)─┐
│ 6407734 │
└───────────────────┘
1 rows in set. Elapsed: 0.299 sec. Processed 214.63 million rows, 1.34 GB (718.00 million rows/s., 4.48 GB/s.)
What if we count top starred repositories but only from those who made at least one PR in at least one repo?
SELECT
repo_name,
count()
FROM github_events
WHERE (event_type = 'WatchEvent') AND (actor_login IN
(
SELECT actor_login
FROM github_events
WHERE (event_type = 'PullRequestEvent') AND (action = 'opened')
))
GROUP BY repo_name
ORDER BY count() DESC
LIMIT 50
┌─repo_name──────────────────────────────┬─count()─┐
1. │ facebook/react │ 121976 │
2. │ vuejs/vue │ 109518 │
3. │ 996icu/996.ICU │ 109244 │
4. │ sindresorhus/awesome │ 106729 │
5. │ getify/You-Dont-Know-JS │ 102997 │
6. │ tensorflow/tensorflow │ 96960 │
7. │ kamranahmedse/developer-roadmap │ 93499 │
8. │ FreeCodeCamp/FreeCodeCamp │ 89246 │
9. │ airbnb/javascript │ 81594 │
10. │ donnemartin/system-design-primer │ 79909 │
11. │ github/gitignore │ 79316 │
12. │ torvalds/linux │ 77051 │
13. │ robbyrussell/oh-my-zsh │ 75246 │
14. │ twbs/bootstrap │ 70968 │
15. │ jwasham/coding-interview-university │ 68898 │
16. │ facebook/react-native │ 65952 │
17. │ flutter/flutter │ 65423 │
18. │ danistefanovic/build-your-own-x │ 62992 │
19. │ golang/go │ 62718 │
20. │ vhf/free-programming-books │ 62366 │
21. │ M4cs/BabySploit │ 61966 │
22. │ trekhleb/javascript-algorithms │ 59556 │
23. │ jlevy/the-art-of-command-line │ 58887 │
24. │ resume/resume.github.com │ 57943 │
25. │ Microsoft/vscode │ 57636 │
26. │ freeCodeCamp/freeCodeCamp │ 57145 │
27. │ vinta/awesome-python │ 57027 │
28. │ nodejs/node │ 50677 │
29. │ FortAwesome/Font-Awesome │ 50169 │
30. │ angular/angular.js │ 49818 │
31. │ angular/angular │ 48952 │
32. │ EbookFoundation/free-programming-books │ 48548 │
33. │ daneden/animate.css │ 47795 │
34. │ nvbn/thefuck │ 47668 │
35. │ TheAlgorithms/Python │ 46791 │
36. │ hakimel/reveal.js │ 46051 │
37. │ mrdoob/three.js │ 46046 │
38. │ atom/atom │ 46045 │
39. │ kubernetes/kubernetes │ 45303 │
40. │ webpack/webpack │ 45159 │
41. │ toddmotto/public-apis │ 45011 │
42. │ electron/electron │ 44893 │
43. │ avelino/awesome-go │ 44819 │
44. │ apple/swift │ 43735 │
45. │ laravel/laravel │ 43569 │
46. │ ant-design/ant-design │ 43134 │
47. │ GoogleChrome/puppeteer │ 42974 │
48. │ django/django │ 41836 │
49. │ tonsky/FiraCode │ 41629 │
50. │ adam-p/markdown-here │ 40647 │
└────────────────────────────────────────┴─────────┘
50 rows in set. Elapsed: 7.308 sec. Processed 446.76 million rows, 6.93 GB (61.13 million rows/s., 948.56 MB/s.)
The list is similar to the overall top list.
What if we take authors who have made at least 10 PRs?
SELECT
repo_name,
count()
FROM github_events
WHERE (event_type = 'WatchEvent') AND (actor_login IN
(
SELECT actor_login
FROM github_events
WHERE (event_type = 'PullRequestEvent') AND (action = 'opened')
GROUP BY actor_login
HAVING count() >= 10
))
GROUP BY repo_name
ORDER BY count() DESC
LIMIT 50
┌─repo_name───────────────────────────┬─count()─┐
1. │ facebook/react │ 56889 │
2. │ getify/You-Dont-Know-JS │ 49496 │
3. │ sindresorhus/awesome │ 48750 │
4. │ vuejs/vue │ 43864 │
5. │ airbnb/javascript │ 40277 │
6. │ kamranahmedse/developer-roadmap │ 39809 │
7. │ tensorflow/tensorflow │ 38087 │
8. │ resume/resume.github.com │ 37815 │
9. │ robbyrussell/oh-my-zsh │ 37015 │
10. │ donnemartin/system-design-primer │ 36259 │
11. │ torvalds/linux │ 35229 │
12. │ github/gitignore │ 34664 │
13. │ golang/go │ 31545 │
14. │ danistefanovic/build-your-own-x │ 30382 │
15. │ facebook/react-native │ 30200 │
16. │ Microsoft/vscode │ 29541 │
17. │ twbs/bootstrap │ 29045 │
18. │ vhf/free-programming-books │ 29025 │
19. │ jlevy/the-art-of-command-line │ 28095 │
20. │ FreeCodeCamp/FreeCodeCamp │ 27446 │
21. │ trekhleb/javascript-algorithms │ 26912 │
22. │ flutter/flutter │ 26693 │
23. │ jwasham/coding-interview-university │ 26360 │
24. │ nvbn/thefuck │ 26204 │
25. │ hakimel/reveal.js │ 25546 │
26. │ FortAwesome/Font-Awesome │ 24949 │
27. │ nodejs/node │ 24475 │
28. │ angular/angular.js │ 23688 │
29. │ webpack/webpack │ 23617 │
30. │ toddmotto/public-apis │ 23601 │
31. │ rust-lang/rust │ 23408 │
32. │ 996icu/996.ICU │ 23405 │
33. │ atom/atom │ 23250 │
34. │ gatsbyjs/gatsby │ 23062 │
35. │ GoogleChrome/puppeteer │ 22863 │
36. │ tonsky/FiraCode │ 22544 │
37. │ zeit/next.js │ 22455 │
38. │ daneden/animate.css │ 22396 │
39. │ kubernetes/kubernetes │ 22151 │
40. │ apple/swift │ 22129 │
41. │ avelino/awesome-go │ 21905 │
42. │ vinta/awesome-python │ 21870 │
43. │ mrdoob/three.js │ 21756 │
44. │ rails/rails │ 21095 │
45. │ typicode/json-server │ 20671 │
46. │ yarnpkg/yarn │ 20650 │
47. │ neovim/neovim │ 20466 │
48. │ Microsoft/TypeScript │ 20355 │
49. │ angular/angular │ 20242 │
50. │ papers-we-love/papers-we-love │ 20168 │
└─────────────────────────────────────┴─────────┘
50 rows in set. Elapsed: 2.575 sec. Processed 446.76 million rows, 6.94 GB (173.49 million rows/s., 2.69 GB/s.)
If we count only software, the list looks like this: React, Vue, Tensorflow, VSCode, Linux, Golang, Flutter.
:) SELECT repo_name, count(), uniq(actor_login) FROM github_events WHERE event_type = 'PullRequestEvent' AND action = 'opened' GROUP BY repo_name ORDER BY count() DESC LIMIT 50
┌─repo_name───────────────────────────────┬─count()─┬─uniq(actor_login)─┐
1. │ google-test/signcla-probe-repo │ 351806 │ 4 │
2. │ everypolitician/everypolitician-data │ 158134 │ 18 │
3. │ brianchandotcom/liferay-portal │ 93222 │ 338 │
4. │ NixOS/nixpkgs │ 84557 │ 3816 │
5. │ Homebrew/homebrew-core │ 60457 │ 5852 │
6. │ sauron-demo/sauron │ 54806 │ 1 │
7. │ elastic/kibana │ 52317 │ 832 │
8. │ test-organization-kkjeer/bot-validation │ 51409 │ 4 │
9. │ SmartThingsCommunity/SmartThingsPublic │ 51280 │ 716 │
10. │ test-organization-kkjeer/app-test │ 51006 │ 6 │
11. │ kubernetes/kubernetes │ 50768 │ 4124 │
12. │ odoo/odoo │ 46764 │ 2038 │
13. │ Homebrew/homebrew-cask │ 44006 │ 3378 │
14. │ ansible/ansible │ 42592 │ 7337 │
15. │ caskroom/homebrew-cask │ 38985 │ 4825 │
16. │ ceph/ceph │ 37495 │ 1413 │
17. │ code-dot-org/code-dot-org │ 37133 │ 125 │
18. │ jlord/patchwork │ 36787 │ 31919 │
19. │ rdxvsrmv/OfficeDocs-OfficeUpdates-test │ 36706 │ 7 │
20. │ DefinitelyTyped/DefinitelyTyped │ 36240 │ 13826 │
21. │ elastic/elasticsearch │ 35673 │ 2030 │
22. │ saltstack/salt │ 35222 │ 3067 │
23. │ rust-lang/rust │ 35214 │ 3259 │
24. │ apple/swift │ 33758 │ 1154 │
25. │ openmicroscopy/snoopys-sandbox │ 32507 │ 9 │
26. │ argo-testing/app │ 31861 │ 1 │
27. │ Automattic/wp-calypso │ 31687 │ 629 │
28. │ mozilla-b2g/gaia │ 30403 │ 911 │
29. │ apache/spark │ 30079 │ 2970 │
30. │ pytorch/pytorch │ 29990 │ 2192 │
31. │ cms-sw/cmssw │ 29045 │ 973 │
32. │ sauron-demo/sauron-demo │ 28654 │ 12 │
33. │ dimagi/commcare-hq │ 28143 │ 127 │
34. │ CleverRaven/Cataclysm-DDA │ 27698 │ 1430 │
35. │ cockroachdb/cockroach │ 27386 │ 449 │
36. │ firstcontributions/first-contributions │ 27015 │ 24995 │
37. │ ros/rosdistro │ 26541 │ 1197 │
38. │ ideatest1/PullRequestTest │ 26336 │ 1 │
39. │ JuliaRegistries/General │ 25487 │ 171 │
40. │ tgstation/tgstation │ 25305 │ 967 │
41. │ rails/rails │ 25232 │ 5621 │
42. │ openshift/openshift-docs │ 25146 │ 677 │
43. │ edx/edx-platform │ 25049 │ 737 │
44. │ bbq-beets/ForkPRCanary │ 24593 │ 1 │
45. │ dotnet/roslyn │ 24488 │ 582 │
46. │ dotnet/corefx │ 24151 │ 1085 │
47. │ flutter/flutter │ 23707 │ 1331 │
48. │ symfony/symfony │ 23572 │ 3617 │
49. │ void-linux/void-packages │ 23522 │ 790 │
50. │ bioconda/bioconda-recipes │ 23249 │ 1209 │
└─────────────────────────────────────────┴─────────┴───────────────────┘
50 rows in set. Elapsed: 0.990 sec. Processed 214.63 million rows, 2.82 GB (216.69 million rows/s., 2.84 GB/s.)
Here we can see some very specific data. The top repository looks like a test repository where you can check signing a CLA. But there are only 4 users with PRs and they are obviously bots. The next is a repository for a dataset about politicians.
Conclusion: if there are many pull requests from a small number of users, it means that this is just a part of some automated process.
Number three is the package repository for Nix OS. It's quite understandable — if you want your package to be in the OS, just make a pull request. And there are a lot of packages. The same for №5 — Homebrew.
And there is one empty repository with a huge number of trash pull requests: sauron. Looks like a script got out of control.
Repositories with the maximum amount of pull request contributors:
:) SELECT repo_name, count(), uniq(actor_login) AS u FROM github_events WHERE event_type = 'PullRequestEvent' AND action = 'opened' GROUP BY repo_name ORDER BY u DESC LIMIT 50
┌─repo_name─────────────────────────────────────────────────────────────┬─count()─┬─────u─┐
1. │ jlord/patchwork │ 36787 │ 31919 │
2. │ firstcontributions/first-contributions │ 27015 │ 24995 │
3. │ octocat/Spoon-Knife │ 20116 │ 18204 │
4. │ DefinitelyTyped/DefinitelyTyped │ 36240 │ 13826 │
5. │ deadlyvipers/dojo_rules │ 15925 │ 9751 │
6. │ google/it-cert-automation-practice │ 10132 │ 9749 │
7. │ udacity/create-your-own-adventure │ 10423 │ 8839 │
8. │ JetBrains/swot │ 9866 │ 8291 │
9. │ udacity/course-collaboration-travel-plans │ 8417 │ 8132 │
10. │ zero-to-mastery/start-here-guidelines │ 8940 │ 8112 │
11. │ githubschool/open-enrollment-classes-introduction-to-github │ 9299 │ 7534 │
12. │ learn-co-students/js-from-dom-to-node-bootcamp-prep-000 │ 7435 │ 7383 │
13. │ ansible/ansible │ 42592 │ 7337 │
14. │ learn-co-students/javascript-intro-to-functions-lab-bootcamp-prep-000 │ 7161 │ 7131 │
15. │ learn-co-students/js-functions-lab-bootcamp-prep-000 │ 6984 │ 6950 │
16. │ learn-co-students/js-node-practice-lab-bootcamp-prep-000 │ 6761 │ 6736 │
17. │ learn-co-students/js-if-else-files-lab-bootcamp-prep-000 │ 6523 │ 6509 │
18. │ learn-co-students/javascript-arithmetic-lab-bootcamp-prep-000 │ 6501 │ 6472 │
19. │ learn-co-students/js-what-is-a-test-lab-bootcamp-prep-000 │ 6344 │ 6329 │
20. │ learn-co-students/first-ide-lab-bootcamp-prep-000 │ 6250 │ 6202 │
21. │ AliceWonderland/hacktoberfest │ 7550 │ 6092 │
22. │ learn-co-students/js-what-is-a-test-bootcamp-prep-000 │ 6060 │ 6048 │
23. │ learn-co-students/javascript-fix-the-scope-lab-bootcamp-prep-000 │ 5978 │ 5957 │
24. │ Roshanjossey/first-contributions │ 6337 │ 5931 │
25. │ Homebrew/homebrew-core │ 60457 │ 5852 │
26. │ TheOdinProject/curriculum │ 21191 │ 5847 │
27. │ MicrosoftDocs/azure-docs │ 18001 │ 5827 │
28. │ freeCodeCamp/freeCodeCamp │ 19485 │ 5716 │
29. │ rails/rails │ 25232 │ 5621 │
30. │ learn-co-students/javascript-arrays-bootcamp-prep-000 │ 5475 │ 5441 │
31. │ learn-co-students/javascript-arrays-lab-bootcamp-prep-000 │ 5226 │ 5197 │
32. │ laravel/framework │ 20000 │ 5082 │
33. │ learn-co-students/javascript-objects-bootcamp-prep-000 │ 4937 │ 4911 │
34. │ freddier/hyperblog │ 5000 │ 4830 │
35. │ caskroom/homebrew-cask │ 38985 │ 4825 │
36. │ learn-co-students/your-first-lab-cb-gh-000 │ 4727 │ 4700 │
37. │ learn-co-students/javascript-objects-lab-bootcamp-prep-000 │ 4668 │ 4652 │
38. │ Homebrew/homebrew │ 19058 │ 4361 │
39. │ wbond/package_control_channel │ 7843 │ 4358 │
40. │ gatsbyjs/gatsby │ 15847 │ 4358 │
41. │ learn-co-students/javascript-intro-to-looping-bootcamp-prep-000 │ 4350 │ 4325 │
42. │ rdpeng/ProgrammingAssignment2 │ 4713 │ 4318 │
43. │ michaelliao/learngit │ 4574 │ 4299 │
44. │ mxcl/homebrew │ 12780 │ 4283 │
45. │ kubernetes/kubernetes │ 50768 │ 4124 │
46. │ learn-co-students/javascript-logging-lab-js-intro-000 │ 4112 │ 4086 │
47. │ CocoaPods/Specs │ 13787 │ 4072 │
48. │ learn-co-students/js-beatles-loops-lab-bootcamp-prep-000 │ 4042 │ 4024 │
49. │ LarryMad/recipes │ 4417 │ 4018 │
50. │ helm/charts │ 12273 │ 3956 │
└───────────────────────────────────────────────────────────────────────┴─────────┴───────┘
50 rows in set. Elapsed: 0.952 sec. Processed 214.63 million rows, 2.82 GB (225.43 million rows/s., 2.96 GB/s.)
"firstcontributions" is the obvious case: a repository that teaches you how to make a pull request... by allowing you to make a pull request to this repository. "patchwork" is similar. Most of the repositories in this list are similar.
:) SELECT repo_name, count() AS c, uniq(actor_login) AS u FROM github_events WHERE event_type = 'IssuesEvent' AND action = 'opened' GROUP BY repo_name ORDER BY c DESC LIMIT 50
┌─repo_name───────────────────────────────┬──────c─┬─────u─┐
1. │ koorellasuresh/UKRegionTest │ 379379 │ 4 │
2. │ pddemo/demo │ 216215 │ 1 │
3. │ lstjsuperman/fabric │ 173986 │ 1 │
4. │ Khan/khan-exercises │ 157016 │ 462 │
5. │ No-CQRT/GooGuns │ 149264 │ 1 │
6. │ Khan/khan-i18n │ 130920 │ 3 │
7. │ jlippold/tweakCompatible │ 127029 │ 11243 │
8. │ Ramos-dev/jniwebshell │ 108096 │ 1 │
9. │ ron190/jsql-injection │ 91086 │ 30 │
10. │ debricked/debricked-test-integration │ 83341 │ 1 │
11. │ chrmarti/testissues │ 72695 │ 19 │
12. │ BoltunovOleg/ChatRoulette │ 69970 │ 2 │
13. │ imamandrews/imamandrews.github.io │ 68926 │ 3 │
14. │ Microsoft/vscode │ 65122 │ 27038 │
15. │ pulWifi/pulWifi │ 63002 │ 4 │
16. │ AdguardTeam/AdguardFilters │ 61979 │ 634 │
17. │ webcompat/web-bugs │ 61268 │ 3959 │
18. │ nainardev/tamil-dubbed │ 58392 │ 1 │
19. │ PennyDreadfulMTG/perf-reports │ 55259 │ 5 │
20. │ joshjach/zaptest1 │ 48149 │ 1 │
21. │ ssi-qa/Test-ssi │ 45982 │ 1 │
22. │ flutter/flutter │ 45920 │ 16677 │
23. │ znjk123123/uoI3JGf │ 45797 │ 1 │
24. │ MicrosoftDocs/azure-docs │ 44131 │ 21036 │
25. │ xuyuhuan123/x │ 43079 │ 1 │
26. │ bippybop/iitest │ 41581 │ 1 │
27. │ webhintio/scan-issues │ 41570 │ 4 │
28. │ humera987/FXLabs-Test-Automation │ 41383 │ 2 │
29. │ test-organization-kkjeer/bot-validation │ 40140 │ 8 │
30. │ test-organization-kkjeer/app-test │ 39963 │ 4 │
31. │ ssi-qa/TestSSI │ 39666 │ 1 │
32. │ support-ops/sit-repo │ 39296 │ 1 │
33. │ antonba/ApimTest │ 37973 │ 1 │
34. │ microsoft/vscode │ 34798 │ 19479 │
35. │ theapache64/movie-monk-creator │ 34666 │ 2 │
36. │ ZeroK-RTS/CrashReports │ 34133 │ 16 │
37. │ koorellasuresh/samples │ 32636 │ 1 │
38. │ kubernetes/kubernetes │ 31644 │ 8615 │
39. │ golang/go │ 31443 │ 10190 │
40. │ shuhongwu/hockeyapp │ 31401 │ 1 │
41. │ hq450/fancyss │ 31124 │ 778 │
42. │ StepanFirsov/tutorials │ 30336 │ 5 │
43. │ as00789/--------- │ 29134 │ 2 │
44. │ zwwxy001/001 │ 28833 │ 1 │
45. │ cockroachdb/cockroach │ 28390 │ 1049 │
46. │ tensorflow/tensorflow │ 28380 │ 16377 │
47. │ Zhycrin/Time │ 28104 │ 1 │
48. │ rust-lang/rust │ 28036 │ 6283 │
49. │ ansible/ansible │ 27774 │ 13050 │
50. │ ikedaosushi/tech-news │ 27774 │ 6 │
└─────────────────────────────────────────┴────────┴───────┘
50 rows in set. Elapsed: 0.488 sec. Processed 111.27 million rows, 1.59 GB (228.02 million rows/s., 3.25 GB/s.)
The top repository no longer exists — probably some misuse of GitHub. The 2nd-place repository has a funny description: "demo: A new issue is created in this repo every minute". The 3rd-place repository has a similar purpose: "this is a test", but the list of issues look more sane — like someone is using GitHub issues for automated crash reports. 4th place is a replacement for a spreadsheet.
The first meaningful result is Microsoft/vscode with over 50k issues from over 15k authors. And the issues are real.
Conclusion: if a repository has a high number of issues, maybe issues are created automatically from crash reports.
Let's also add the number of stars to this report:
WITH (event_type = 'IssuesEvent') AND (action = 'opened') AS issue_created
SELECT
repo_name,
sum(issue_created) AS c,
uniqIf(actor_login, issue_created) AS u,
sum(event_type = 'WatchEvent') AS stars
FROM github_events
WHERE event_type IN ('IssuesEvent', 'WatchEvent')
GROUP BY repo_name
ORDER BY c DESC
LIMIT 50
┌─repo_name───────────────────────────────┬──────c─┬─────u─┬──stars─┐
1. │ koorellasuresh/UKRegionTest │ 379379 │ 4 │ 1 │
2. │ pddemo/demo │ 216215 │ 1 │ 1 │
3. │ lstjsuperman/fabric │ 173986 │ 1 │ 5 │
4. │ Khan/khan-exercises │ 157016 │ 462 │ 1749 │
5. │ No-CQRT/GooGuns │ 149264 │ 1 │ 0 │
6. │ Khan/khan-i18n │ 130920 │ 3 │ 9 │
7. │ jlippold/tweakCompatible │ 127029 │ 11243 │ 478 │
8. │ Ramos-dev/jniwebshell │ 108096 │ 1 │ 5 │
9. │ ron190/jsql-injection │ 91086 │ 30 │ 907 │
10. │ debricked/debricked-test-integration │ 83341 │ 1 │ 0 │
11. │ chrmarti/testissues │ 72695 │ 19 │ 9 │
12. │ BoltunovOleg/ChatRoulette │ 69970 │ 2 │ 0 │
13. │ imamandrews/imamandrews.github.io │ 68926 │ 3 │ 3 │
14. │ Microsoft/vscode │ 65122 │ 27038 │ 82043 │
15. │ pulWifi/pulWifi │ 63002 │ 4 │ 24 │
16. │ AdguardTeam/AdguardFilters │ 61979 │ 634 │ 984 │
17. │ webcompat/web-bugs │ 61268 │ 3959 │ 483 │
18. │ nainardev/tamil-dubbed │ 58392 │ 1 │ 1 │
19. │ PennyDreadfulMTG/perf-reports │ 55259 │ 5 │ 3 │
20. │ joshjach/zaptest1 │ 48149 │ 1 │ 0 │
21. │ ssi-qa/Test-ssi │ 45982 │ 1 │ 0 │
22. │ flutter/flutter │ 45920 │ 16677 │ 116303 │
23. │ znjk123123/uoI3JGf │ 45797 │ 1 │ 0 │
24. │ MicrosoftDocs/azure-docs │ 44131 │ 21036 │ 4888 │
25. │ xuyuhuan123/x │ 43079 │ 1 │ 0 │
26. │ bippybop/iitest │ 41581 │ 1 │ 0 │
27. │ webhintio/scan-issues │ 41570 │ 4 │ 7 │
28. │ humera987/FXLabs-Test-Automation │ 41383 │ 2 │ 0 │
29. │ test-organization-kkjeer/bot-validation │ 40140 │ 8 │ 6 │
30. │ test-organization-kkjeer/app-test │ 39963 │ 4 │ 5 │
31. │ ssi-qa/TestSSI │ 39666 │ 1 │ 0 │
32. │ support-ops/sit-repo │ 39296 │ 1 │ 0 │
33. │ antonba/ApimTest │ 37973 │ 1 │ 0 │
34. │ microsoft/vscode │ 34798 │ 19479 │ 38395 │
35. │ theapache64/movie-monk-creator │ 34666 │ 2 │ 0 │
36. │ ZeroK-RTS/CrashReports │ 34133 │ 16 │ 11 │
37. │ koorellasuresh/samples │ 32636 │ 1 │ 0 │
38. │ kubernetes/kubernetes │ 31644 │ 8615 │ 68644 │
39. │ golang/go │ 31443 │ 10190 │ 92407 │
40. │ shuhongwu/hockeyapp │ 31401 │ 1 │ 0 │
41. │ hq450/fancyss │ 31124 │ 778 │ 7921 │
42. │ StepanFirsov/tutorials │ 30336 │ 5 │ 0 │
43. │ as00789/--------- │ 29134 │ 2 │ 1 │
44. │ zwwxy001/001 │ 28833 │ 1 │ 0 │
45. │ cockroachdb/cockroach │ 28390 │ 1049 │ 21147 │
46. │ tensorflow/tensorflow │ 28380 │ 16377 │ 173681 │
47. │ Zhycrin/Time │ 28104 │ 1 │ 1 │
48. │ rust-lang/rust │ 28036 │ 6283 │ 53027 │
49. │ ansible/ansible │ 27774 │ 13050 │ 51144 │
50. │ ikedaosushi/tech-news │ 27774 │ 6 │ 15 │
└─────────────────────────────────────────┴────────┴───────┴────────┘
50 rows in set. Elapsed: 2.344 sec. Processed 343.40 million rows, 7.68 GB (146.50 million rows/s., 3.28 GB/s.)
Now we can distinguish real issues from robot ones. Let's add a cutoff at 1000 stars:
WITH (event_type = 'IssuesEvent') AND (action = 'opened') AS issue_created
SELECT
repo_name,
sum(issue_created) AS c,
uniqIf(actor_login, issue_created) AS u,
sum(event_type = 'WatchEvent') AS stars
FROM github_events
WHERE event_type IN ('IssuesEvent', 'WatchEvent')
GROUP BY repo_name
HAVING stars >= 1000
ORDER BY c DESC
LIMIT 50
┌─repo_name───────────────────┬──────c─┬─────u─┬──stars─┐
1. │ Khan/khan-exercises │ 157016 │ 462 │ 1749 │
2. │ Microsoft/vscode │ 65122 │ 27038 │ 82043 │
3. │ flutter/flutter │ 45920 │ 16677 │ 116303 │
4. │ MicrosoftDocs/azure-docs │ 44131 │ 21036 │ 4888 │
5. │ microsoft/vscode │ 34798 │ 19479 │ 38395 │
6. │ kubernetes/kubernetes │ 31644 │ 8615 │ 68644 │
7. │ golang/go │ 31443 │ 10190 │ 92407 │
8. │ hq450/fancyss │ 31124 │ 778 │ 7921 │
9. │ cockroachdb/cockroach │ 28390 │ 1049 │ 21147 │
10. │ tensorflow/tensorflow │ 28380 │ 16377 │ 173681 │
11. │ rust-lang/rust │ 28036 │ 6283 │ 53027 │
12. │ ansible/ansible │ 27774 │ 13050 │ 51144 │
13. │ elastic/kibana │ 26962 │ 4151 │ 13434 │
14. │ dotnet/roslyn │ 24027 │ 3153 │ 15605 │
15. │ godotengine/godot │ 23546 │ 5521 │ 34238 │
16. │ 996icu/996.ICU │ 22791 │ 12818 │ 354850 │
17. │ rancher/rancher │ 22512 │ 4689 │ 15241 │
18. │ saltstack/salt │ 22450 │ 5777 │ 12989 │
19. │ angular/angular │ 21806 │ 10403 │ 80602 │
20. │ Microsoft/TypeScript │ 21004 │ 6744 │ 52328 │
21. │ fanout/pushpin │ 20793 │ 260 │ 2928 │
22. │ / │ 20676 │ 883 │ 5496 │
23. │ facebook/react-native │ 19926 │ 11861 │ 105346 │
24. │ dart-lang/sdk │ 19737 │ 2965 │ 6976 │
25. │ TrinityCore/TrinityCore │ 19513 │ 2806 │ 7942 │
26. │ spring-projects/spring-boot │ 19349 │ 5732 │ 58232 │
27. │ ant-design/ant-design │ 19149 │ 9312 │ 71552 │
28. │ magento/magento2 │ 18884 │ 7452 │ 10987 │
29. │ NixOS/nixpkgs │ 18497 │ 3236 │ 6679 │
30. │ JuliaLang/julia │ 18456 │ 3227 │ 34480 │
31. │ owncloud/core │ 18366 │ 6352 │ 9048 │
32. │ andresriancho/w3af │ 18128 │ 394 │ 3713 │
33. │ elastic/elasticsearch │ 18111 │ 5912 │ 48810 │
34. │ dotnet/corefx │ 17693 │ 4157 │ 20702 │
35. │ grafana/grafana │ 17543 │ 7973 │ 39147 │
36. │ pytorch/pytorch │ 16622 │ 7044 │ 47889 │
37. │ FortAwesome/Font-Awesome │ 16438 │ 12745 │ 75924 │
38. │ rg3/youtube-dl │ 16144 │ 8557 │ 51271 │
39. │ hashicorp/terraform │ 15869 │ 8496 │ 26227 │
40. │ CleverRaven/Cataclysm-DDA │ 15797 │ 2927 │ 5293 │
41. │ ethereum/go-ethereum │ 15632 │ 2526 │ 30598 │
42. │ ElemeFE/element │ 15541 │ 8909 │ 53749 │
43. │ atom/atom │ 14919 │ 10357 │ 68396 │
44. │ symfony/symfony │ 14712 │ 6383 │ 27656 │
45. │ Automattic/wp-calypso │ 14616 │ 953 │ 13129 │
46. │ npm/npm │ 14494 │ 11403 │ 16774 │
47. │ laravel/framework │ 14341 │ 8388 │ 25317 │
48. │ rails/rails │ 13904 │ 7909 │ 53620 │
49. │ twbs/bootstrap │ 13599 │ 8526 │ 126371 │
50. │ odoo/odoo │ 12941 │ 4196 │ 22407 │
└─────────────────────────────┴────────┴───────┴────────┘
50 rows in set. Elapsed: 2.285 sec. Processed 343.40 million rows, 7.68 GB (150.29 million rows/s., 3.36 GB/s.)
Now it looks like a reasonable list of the top repositories.
And for the next report, let's sort by the number of issue authors:
WITH (event_type = 'IssuesEvent') AND (action = 'opened') AS issue_created
SELECT
repo_name,
sum(issue_created) AS c,
uniqIf(actor_login, issue_created) AS u,
sum(event_type = 'WatchEvent') AS stars
FROM github_events
WHERE event_type IN ('IssuesEvent', 'WatchEvent')
GROUP BY repo_name
ORDER BY u DESC
LIMIT 50
┌─repo_name───────────────────────────────────────────────────┬──────c─┬─────u─┬──stars─┐
1. │ Microsoft/vscode │ 65122 │ 27038 │ 82043 │
2. │ MicrosoftDocs/azure-docs │ 44131 │ 21036 │ 4888 │
3. │ microsoft/vscode │ 34798 │ 19479 │ 38395 │
4. │ flutter/flutter │ 45920 │ 16677 │ 116303 │
5. │ tensorflow/tensorflow │ 28380 │ 16377 │ 173681 │
6. │ ansible/ansible │ 27774 │ 13050 │ 51144 │
7. │ 996icu/996.ICU │ 22791 │ 12818 │ 354850 │
8. │ FortAwesome/Font-Awesome │ 16438 │ 12745 │ 75924 │
9. │ facebook/react-native │ 19926 │ 11861 │ 105346 │
10. │ npm/npm │ 14494 │ 11403 │ 16774 │
11. │ jlippold/tweakCompatible │ 127029 │ 11243 │ 478 │
12. │ githubschool/open-enrollment-classes-introduction-to-github │ 12094 │ 10730 │ 921 │
13. │ angular/angular │ 21806 │ 10403 │ 80602 │
14. │ atom/atom │ 14919 │ 10357 │ 68396 │
15. │ golang/go │ 31443 │ 10190 │ 92407 │
16. │ ContinuumIO/anaconda-issues │ 11834 │ 9570 │ 688 │
17. │ ant-design/ant-design │ 19149 │ 9312 │ 71552 │
18. │ ElemeFE/element │ 15541 │ 8909 │ 53749 │
19. │ kubernetes/kubernetes │ 31644 │ 8615 │ 68644 │
20. │ rg3/youtube-dl │ 16144 │ 8557 │ 51271 │
21. │ twbs/bootstrap │ 13599 │ 8526 │ 126371 │
22. │ hashicorp/terraform │ 15869 │ 8496 │ 26227 │
23. │ laravel/framework │ 14341 │ 8388 │ 25317 │
24. │ docker/for-win │ 9429 │ 8124 │ 1418 │
25. │ grafana/grafana │ 17543 │ 7973 │ 39147 │
26. │ rails/rails │ 13904 │ 7909 │ 53620 │
27. │ magento/magento2 │ 18884 │ 7452 │ 10987 │
28. │ angular/angular-cli │ 12114 │ 7313 │ 26924 │
29. │ pytorch/pytorch │ 16622 │ 7044 │ 47889 │
30. │ qbittorrent/qBittorrent │ 10988 │ 6762 │ 12064 │
31. │ Microsoft/TypeScript │ 21004 │ 6744 │ 52328 │
32. │ docker/docker │ 12292 │ 6466 │ 33396 │
33. │ symfony/symfony │ 14712 │ 6383 │ 27656 │
34. │ owncloud/core │ 18366 │ 6352 │ 9048 │
35. │ rust-lang/rust │ 28036 │ 6283 │ 53027 │
36. │ facebook/react │ 9651 │ 6271 │ 188575 │
37. │ nodejs/node │ 11756 │ 6221 │ 75477 │
38. │ XX-net/XX-Net │ 12533 │ 6192 │ 36800 │
39. │ electron/electron │ 10445 │ 6059 │ 71394 │
40. │ elastic/elasticsearch │ 18111 │ 5912 │ 48810 │
41. │ Wynncraft/Issues │ 8840 │ 5854 │ 229 │
42. │ spyder-ide/spyder │ 11348 │ 5818 │ 6558 │
43. │ travis-ci/travis-ci │ 10057 │ 5800 │ 9368 │
44. │ saltstack/salt │ 22450 │ 5777 │ 12989 │
45. │ angular/angular.js │ 8785 │ 5757 │ 76251 │
46. │ spring-projects/spring-boot │ 19349 │ 5732 │ 58232 │
47. │ home-assistant/home-assistant │ 11875 │ 5632 │ 30023 │
48. │ gatsbyjs/gatsby │ 11540 │ 5565 │ 51589 │
49. │ godotengine/godot │ 23546 │ 5521 │ 34238 │
50. │ nextcloud/server │ 11355 │ 5471 │ 13428 │
└─────────────────────────────────────────────────────────────┴────────┴───────┴────────┘
50 rows in set. Elapsed: 2.309 sec. Processed 343.40 million rows, 7.68 GB (148.71 million rows/s., 3.32 GB/s.)
:) SELECT repo_name, uniqIf(actor_login, event_type = 'PushEvent') AS u, sum(event_type = 'WatchEvent') AS stars FROM github_events WHERE event_type IN ('PushEvent', 'WatchEvent') AND repo_name != '/' GROUP BY repo_name ORDER BY u DESC LIMIT 50
┌─repo_name───────────────────────────────────────────────────┬────u─┬─stars─┐
1. │ githubschool/open-enrollment-classes-introduction-to-github │ 7869 │ 921 │
2. │ githubschool/on-demand-github-pages │ 1100 │ 87 │
3. │ llvm/llvm-project │ 826 │ 7228 │
4. │ lifo/docrails │ 592 │ 356 │
5. │ HNGInternship/HNGInternship4 │ 518 │ 44 │
6. │ bioconda/bioconda-recipes │ 465 │ 1347 │
7. │ odoo-dev/odoo │ 461 │ 150 │
8. │ cs480-projects/cs480-projects.github.io │ 449 │ 15 │
9. │ hnginterns/hngfun │ 437 │ 33 │
10. │ Automattic/wp-calypso │ 432 │ 13129 │
11. │ tencentyun/qcloud-documents │ 429 │ 1032 │
12. │ HNGInternship/HNGFun │ 419 │ 21 │
13. │ mks65/list │ 409 │ 0 │
14. │ mks65/euler │ 408 │ 1 │
15. │ armory-training/spintroui │ 390 │ 4 │
16. │ mks66/line │ 387 │ 0 │
17. │ mks66/picmaker │ 383 │ 3 │
18. │ mks65/dirinfo │ 382 │ 0 │
19. │ CocoaPods/Specs │ 375 │ 6869 │
20. │ mks66/matrix │ 374 │ 0 │
21. │ mks66/3d │ 373 │ 0 │
22. │ mks66/curves │ 357 │ 0 │
23. │ IGM-RichMedia-at-RIT/430-Git-Exercise-Part2 │ 349 │ 0 │
24. │ mks66/polygons │ 346 │ 0 │
25. │ mks66/cstack │ 343 │ 0 │
26. │ DataDog/documentation │ 339 │ 151 │
27. │ mks65/randfile │ 336 │ 0 │
28. │ mks66/mdl │ 335 │ 0 │
29. │ hmrc/jenkins-jobs │ 334 │ 27 │
30. │ gatsbyjs/gatsby │ 330 │ 51589 │
31. │ edx/edx-platform │ 330 │ 6002 │
32. │ elastic/kibana │ 329 │ 13434 │
33. │ hnginterns/getting-started-h2-2017 │ 317 │ 28 │
34. │ mks66/transform │ 307 │ 0 │
35. │ mks65/signals │ 295 │ 0 │
36. │ mks65/semaphone │ 295 │ 0 │
37. │ Pietia1978/kwztutorial │ 292 │ 9 │
38. │ stuy-softdev/workshop │ 286 │ 8 │
39. │ wix/wix-style-react │ 284 │ 814 │
40. │ mks66/final │ 281 │ 0 │
41. │ mks65/shell │ 275 │ 2 │
42. │ department-of-veterans-affairs/va.gov-team │ 273 │ 103 │
43. │ mongodb/mongo │ 268 │ 21412 │
44. │ pkzhbin1/origin │ 267 │ 64 │
45. │ chscodecamp/github │ 263 │ 12 │
46. │ Automattic/jetpack │ 263 │ 1451 │
47. │ stuycs-softdev/submissions │ 259 │ 9 │
48. │ Jabont/ITM2018 │ 256 │ 32 │
49. │ elastic/elasticsearch │ 252 │ 48810 │
50. │ githubteacher/poetry │ 249 │ 15 │
└─────────────────────────────────────────────────────────────┴──────┴───────┘
50 rows in set. Elapsed: 9.167 sec. Processed 1.83 billion rows, 26.04 GB (200.00 million rows/s., 2.84 GB/s.)
The first two are educational. And I'm happy to see the LLVM project in 3rd place. It's really fantastic to have almost a thousand people pushing to the repository. Maybe it's just a development model where they are giving access to separate branches for a user or organization?
Repositories with the most people who have push access to the main branch.
SELECT
repo_name,
uniqIf(actor_login, (event_type = 'PushEvent') AND match(ref, '/(main|master)$')) AS u,
sum(event_type = 'WatchEvent') AS stars
FROM github_events
WHERE (event_type IN ('PushEvent', 'WatchEvent')) AND (repo_name != '/')
GROUP BY repo_name
ORDER BY u DESC
LIMIT 50
┌─repo_name───────────────────────────────────────────────────┬────u─┬─stars─┐
1. │ githubschool/open-enrollment-classes-introduction-to-github │ 5603 │ 921 │
2. │ llvm/llvm-project │ 824 │ 7228 │
3. │ githubschool/on-demand-github-pages │ 808 │ 87 │
4. │ lifo/docrails │ 592 │ 356 │
5. │ HNGInternship/HNGInternship4 │ 517 │ 44 │
6. │ cs480-projects/cs480-projects.github.io │ 449 │ 15 │
7. │ hnginterns/hngfun │ 437 │ 33 │
8. │ bioconda/bioconda-recipes │ 432 │ 1347 │
9. │ tencentyun/qcloud-documents │ 426 │ 1032 │
10. │ HNGInternship/HNGFun │ 419 │ 21 │
11. │ mks65/list │ 409 │ 0 │
12. │ mks65/euler │ 408 │ 1 │
13. │ mks66/line │ 387 │ 0 │
14. │ mks66/picmaker │ 383 │ 3 │
15. │ mks65/dirinfo │ 382 │ 0 │
16. │ Automattic/wp-calypso │ 378 │ 13129 │
17. │ mks66/matrix │ 374 │ 0 │
18. │ CocoaPods/Specs │ 374 │ 6869 │
19. │ mks66/3d │ 373 │ 0 │
20. │ mks66/curves │ 357 │ 0 │
21. │ IGM-RichMedia-at-RIT/430-Git-Exercise-Part2 │ 349 │ 0 │
22. │ mks66/polygons │ 346 │ 0 │
23. │ mks66/cstack │ 343 │ 0 │
24. │ mks65/randfile │ 336 │ 0 │
25. │ mks66/mdl │ 335 │ 0 │
26. │ hmrc/jenkins-jobs │ 319 │ 27 │
27. │ hnginterns/getting-started-h2-2017 │ 313 │ 28 │
28. │ mks66/transform │ 307 │ 0 │
29. │ edx/edx-platform │ 301 │ 6002 │
30. │ elastic/kibana │ 300 │ 13434 │
31. │ mks65/signals │ 295 │ 0 │
32. │ mks65/semaphone │ 295 │ 0 │
33. │ Pietia1978/kwztutorial │ 292 │ 9 │
34. │ stuy-softdev/workshop │ 286 │ 8 │
35. │ mks66/final │ 281 │ 0 │
36. │ mks65/shell │ 275 │ 2 │
37. │ department-of-veterans-affairs/va.gov-team │ 271 │ 103 │
38. │ pkzhbin1/origin │ 267 │ 64 │
39. │ chscodecamp/github │ 263 │ 12 │
40. │ mongodb/mongo │ 259 │ 21412 │
41. │ stuycs-softdev/submissions │ 259 │ 9 │
42. │ Jabont/ITM2018 │ 256 │ 32 │
43. │ mks65/stat │ 239 │ 0 │
44. │ mks65/array_swap │ 236 │ 0 │
45. │ mks65/stringy │ 231 │ 0 │
46. │ becodeorg/La-Veille │ 228 │ 22 │
47. │ nus-cs2103-AY2021S1/pe-dev-response │ 218 │ 0 │
48. │ coderefinery/centralized-workflow-exercise │ 211 │ 3 │
49. │ mks66/lighting │ 206 │ 0 │
50. │ mozilla-b2g/gaia │ 206 │ 2101 │
└─────────────────────────────────────────────────────────────┴──────┴───────┘
50 rows in set. Elapsed: 9.807 sec. Processed 1.83 billion rows, 42.95 GB (186.93 million rows/s., 4.38 GB/s.)
Almost the same, with LLVM on top.
With a cutoff for the number of stars:
SELECT
repo_name,
uniqIf(actor_login, (event_type = 'PushEvent') AND match(ref, '/(main|master)$')) AS u,
sum(event_type = 'WatchEvent') AS stars
FROM github_events
WHERE (event_type IN ('PushEvent', 'WatchEvent')) AND (repo_name != '/')
GROUP BY repo_name
HAVING stars >= 100
ORDER BY u DESC
LIMIT 50
┌─repo_name───────────────────────────────────────────────────┬────u─┬──stars─┐
1. │ githubschool/open-enrollment-classes-introduction-to-github │ 5603 │ 921 │
2. │ llvm/llvm-project │ 824 │ 7228 │
3. │ lifo/docrails │ 592 │ 356 │
4. │ bioconda/bioconda-recipes │ 432 │ 1347 │
5. │ tencentyun/qcloud-documents │ 426 │ 1032 │
6. │ Automattic/wp-calypso │ 378 │ 13129 │
7. │ CocoaPods/Specs │ 374 │ 6869 │
8. │ edx/edx-platform │ 301 │ 6002 │
9. │ elastic/kibana │ 300 │ 13434 │
10. │ department-of-veterans-affairs/va.gov-team │ 271 │ 103 │
11. │ mongodb/mongo │ 259 │ 21412 │
12. │ mozilla-b2g/gaia │ 206 │ 2101 │
13. │ elastic/elasticsearch │ 202 │ 48810 │
14. │ dotnet/corefx │ 201 │ 20702 │
15. │ cloudfoundry/cloud_controller_ng │ 191 │ 196 │
16. │ guardian/frontend │ 189 │ 6030 │
17. │ flutter/flutter │ 181 │ 116303 │
18. │ department-of-veterans-affairs/vets-website │ 181 │ 203 │
19. │ alphagov/govuk-puppet │ 176 │ 116 │
20. │ NixOS/nixpkgs │ 175 │ 6679 │
21. │ apple/swift │ 171 │ 66350 │
22. │ edx/configuration │ 165 │ 882 │
23. │ dotnet/coreclr │ 162 │ 14326 │
24. │ perl6/doc │ 156 │ 264 │
25. │ alphagov/whitehall │ 154 │ 749 │
26. │ nodejs/node │ 153 │ 75477 │
27. │ department-of-veterans-affairs/vets-api │ 153 │ 113 │
28. │ flutter/engine │ 151 │ 4771 │
29. │ Automattic/jetpack │ 151 │ 1451 │
30. │ alphagov/static │ 150 │ 118 │
31. │ WordPress/gutenberg │ 150 │ 6936 │
32. │ perl6/ecosystem │ 148 │ 110 │
33. │ alphagov/frontend │ 145 │ 288 │
34. │ dotnet/runtime │ 144 │ 4842 │
35. │ greenplum-db/gpdb │ 141 │ 4700 │
36. │ alphagov/smart-answers │ 139 │ 145 │
37. │ web-platform-tests/wpt │ 133 │ 1644 │
38. │ w3c/web-platform-tests │ 130 │ 1662 │
39. │ Microsoft/vsts-tasks │ 130 │ 844 │
40. │ DataDog/documentation │ 125 │ 151 │
41. │ Shopify/quilt │ 124 │ 761 │
42. │ microsoft/azure-pipelines-tasks │ 120 │ 975 │
43. │ edx/ecommerce │ 120 │ 129 │
44. │ Talend/tdi-studio-se │ 117 │ 100 │
45. │ Azure/azure-sdk-for-net │ 117 │ 1845 │
46. │ OfficeDev/office-ui-fabric-react │ 116 │ 7194 │
47. │ facebook/rocksdb │ 114 │ 19694 │
48. │ xbmc/xbmc │ 114 │ 13127 │
49. │ Azure/azure-sdk-for-java │ 112 │ 671 │
50. │ JetBrains/kotlin │ 111 │ 38249 │
└─────────────────────────────────────────────────────────────┴──────┴────────┘
50 rows in set. Elapsed: 10.163 sec. Processed 1.83 billion rows, 42.95 GB (180.39 million rows/s., 4.23 GB/s.)
:) SELECT repo_name, sum(event_type = 'MemberEvent') AS invitations, sum(event_type = 'WatchEvent') AS stars FROM github_events WHERE event_type IN ('MemberEvent', 'WatchEvent') GROUP BY repo_name HAVING stars >= 100 ORDER BY invitations DESC LIMIT 50
┌─repo_name───────────────────────────────────────────────────┬─invitations─┬─stars─┐
1. │ githubschool/open-enrollment-classes-introduction-to-github │ 10453 │ 921 │
2. │ gatsbyjs/gatsby │ 3161 │ 51589 │
3. │ / │ 3144 │ 5496 │
4. │ openjournals/joss-reviews │ 1733 │ 411 │
5. │ InfiniteFlightAirportEditing/Airports │ 684 │ 192 │
6. │ zulip/zulip │ 644 │ 14027 │
7. │ tencentyun/qcloud-documents │ 446 │ 1032 │
8. │ w3c/web-platform-tests │ 343 │ 1662 │
9. │ llvm/llvm-project │ 322 │ 7228 │
10. │ koppi/iso-country-flags-svg-collection │ 318 │ 408 │
11. │ zephyrproject-rtos/zephyr │ 291 │ 4198 │
12. │ Automattic/wp-calypso │ 246 │ 13129 │
13. │ FightPandemics/FightPandemics │ 238 │ 108 │
14. │ magento/magento2 │ 234 │ 10987 │
15. │ dotnet/corefx │ 216 │ 20702 │
16. │ helm/charts │ 213 │ 11693 │
17. │ oppia/oppia │ 210 │ 1802 │
18. │ Code4SocialGood/c4sg-web │ 184 │ 100 │
19. │ pqrs-org/KE-complex_modifications │ 174 │ 701 │
20. │ IBM-Bluemix/docs │ 166 │ 153 │
21. │ OfficeDev/office-ui-fabric-react │ 156 │ 7194 │
22. │ openshiftio/openshift.io │ 155 │ 106 │
23. │ julianguyen/ifme │ 150 │ 547 │
24. │ dotnet/runtime │ 149 │ 4842 │
25. │ Techtonica/curriculum │ 149 │ 437 │
26. │ pytorch/pytorch │ 139 │ 47889 │
27. │ MicrosoftDocs/mixed-reality │ 138 │ 155 │
28. │ ifmeorg/ifme │ 136 │ 728 │
29. │ magento-engcom/msi │ 135 │ 234 │
30. │ MicrosoftDocs/office-docs-powershell │ 132 │ 342 │
31. │ MicrosoftDocs/azure-docs │ 126 │ 4888 │
32. │ leanprover-community/mathlib │ 125 │ 420 │
33. │ pytorch/vision │ 124 │ 8418 │
34. │ pytorch/tutorials │ 121 │ 4838 │
35. │ sigmavirus24/Todo.txt-python │ 120 │ 109 │
36. │ pytorch/text │ 120 │ 2706 │
37. │ MicrosoftDocs/visualstudio-docs │ 120 │ 630 │
38. │ MicrosoftDocs/OfficeDocs-SharePoint │ 119 │ 254 │
39. │ MicrosoftDocs/feedback │ 118 │ 223 │
40. │ pytorch/builder │ 116 │ 156 │
41. │ MicrosoftDocs/windowsserverdocs │ 115 │ 926 │
42. │ MicrosoftDocs/OfficeDocs-SkypeForBusiness │ 115 │ 220 │
43. │ MicrosoftDocs/windows-powershell-docs │ 115 │ 213 │
44. │ cockroachdb/cockroach │ 114 │ 21147 │
45. │ MicrosoftDocs/intellicode │ 114 │ 443 │
46. │ MicrosoftDocs/edge-developer │ 114 │ 157 │
47. │ hoodiehq/camp │ 114 │ 124 │
48. │ MicrosoftDocs/windows-itpro-docs │ 114 │ 952 │
49. │ MicrosoftDocs/microsoft-365-docs │ 113 │ 296 │
50. │ pytorch/examples │ 112 │ 16121 │
└─────────────────────────────────────────────────────────────┴─────────────┴───────┘
50 rows in set. Elapsed: 1.149 sec. Processed 246.44 million rows, 2.15 GB (214.52 million rows/s., 1.87 GB/s.)
:) SELECT repo_name, count() AS forks FROM github_events WHERE event_type = 'ForkEvent' GROUP BY repo_name ORDER BY forks DESC LIMIT 50
┌─repo_name───────────────────────────────┬──forks─┐
1. │ jtleek/datasharing │ 262926 │
2. │ octocat/Spoon-Knife │ 198031 │
3. │ rdpeng/ProgrammingAssignment2 │ 160794 │
4. │ tensorflow/tensorflow │ 98226 │
5. │ twbs/bootstrap │ 92878 │
6. │ github/gitignore │ 84075 │
7. │ SmartThingsCommunity/SmartThingsPublic │ 78551 │
8. │ barryclark/jekyll-now │ 68601 │
9. │ rdpeng/ExData_Plotting1 │ 67182 │
10. │ nightscout/cgm-remote-monitor │ 59420 │
11. │ Pierian-Data/Complete-Python-3-Bootcamp │ 49504 │
12. │ tensorflow/models │ 49502 │
13. │ torvalds/linux │ 47280 │
14. │ jlord/patchwork │ 45136 │
15. │ facebook/react │ 44678 │
16. │ eugenp/tutorials │ 44522 │
17. │ rdpeng/RepData_PeerAssessment1 │ 44449 │
18. │ angular/angular.js │ 43790 │
19. │ jackfrued/Python-100-Days │ 41558 │
20. │ opencv/opencv │ 41545 │
21. │ spring-projects/spring-boot │ 38610 │
22. │ LarryMad/recipes │ 38317 │
23. │ vuejs/vue │ 37441 │
24. │ laravel/laravel │ 36584 │
25. │ jwasham/coding-interview-university │ 36256 │
26. │ udacity/frontend-nanodegree-resume │ 35434 │
27. │ mrdoob/three.js │ 35378 │
28. │ django/django │ 34930 │
29. │ firstcontributions/first-contributions │ 34173 │
30. │ spring-projects/spring-framework │ 33862 │
31. │ Snailclimb/JavaGuide │ 33198 │
32. │ jquery/jquery │ 31858 │
33. │ ant-design/ant-design │ 31286 │
34. │ getify/You-Dont-Know-JS │ 31020 │
35. │ CyC2018/CS-Notes │ 30915 │
36. │ TheAlgorithms/Python │ 30656 │
37. │ rails/rails │ 30602 │
38. │ DefinitelyTyped/DefinitelyTyped │ 30389 │
39. │ mmistakes/minimal-mistakes │ 29852 │
40. │ kubernetes/kubernetes │ 29487 │
41. │ apache/spark │ 28904 │
42. │ facebook/react-native │ 27146 │
43. │ 996icu/996.ICU │ 27101 │
44. │ ansible/ansible │ 27001 │
45. │ scikit-learn/scikit-learn │ 26726 │
46. │ robbyrussell/oh-my-zsh │ 26561 │
47. │ coolsnowwolf/lede │ 26469 │
48. │ bitcoin/bitcoin │ 26356 │
49. │ git/git │ 25929 │
50. │ angular/angular │ 25463 │
└─────────────────────────────────────────┴────────┘
50 rows in set. Elapsed: 0.447 sec. Processed 84.72 million rows, 781.11 MB (189.73 million rows/s., 1.75 GB/s.)
There are 74 million forks on GitHub. This is about half of all repositories. If we filter out repositories that have their main purpose to be forked, the clear winner is Tensorflow, then Bootstrap, React and OpenCV and Linux.
Let's check the proportion of stars to forks.
SELECT
repo_name,
sum(event_type = 'ForkEvent') AS forks,
sum(event_type = 'WatchEvent') AS stars,
round(stars / forks, 3) AS ratio
FROM github_events
WHERE event_type IN ('ForkEvent', 'WatchEvent')
GROUP BY repo_name
ORDER BY forks DESC
LIMIT 50
┌─repo_name───────────────────────────────┬──forks─┬──stars─┬──ratio─┐
1. │ jtleek/datasharing │ 262926 │ 6364 │ 0.024 │
2. │ octocat/Spoon-Knife │ 198031 │ 4601 │ 0.023 │
3. │ rdpeng/ProgrammingAssignment2 │ 160794 │ 990 │ 0.006 │
4. │ tensorflow/tensorflow │ 98226 │ 173681 │ 1.768 │
5. │ twbs/bootstrap │ 92878 │ 126371 │ 1.361 │
6. │ github/gitignore │ 84075 │ 119322 │ 1.419 │
7. │ SmartThingsCommunity/SmartThingsPublic │ 78551 │ 2073 │ 0.026 │
8. │ barryclark/jekyll-now │ 68601 │ 8185 │ 0.119 │
9. │ rdpeng/ExData_Plotting1 │ 67182 │ 271 │ 0.004 │
10. │ nightscout/cgm-remote-monitor │ 59420 │ 1784 │ 0.03 │
11. │ Pierian-Data/Complete-Python-3-Bootcamp │ 49504 │ 14952 │ 0.302 │
12. │ tensorflow/models │ 49502 │ 75206 │ 1.519 │
13. │ torvalds/linux │ 47280 │ 121415 │ 2.568 │
14. │ jlord/patchwork │ 45136 │ 1329 │ 0.029 │
15. │ facebook/react │ 44678 │ 188575 │ 4.221 │
16. │ eugenp/tutorials │ 44522 │ 26055 │ 0.585 │
17. │ rdpeng/RepData_PeerAssessment1 │ 44449 │ 124 │ 0.003 │
18. │ angular/angular.js │ 43790 │ 76251 │ 1.741 │
19. │ jackfrued/Python-100-Days │ 41558 │ 108760 │ 2.617 │
20. │ opencv/opencv │ 41545 │ 45223 │ 1.089 │
21. │ spring-projects/spring-boot │ 38610 │ 58232 │ 1.508 │
22. │ LarryMad/recipes │ 38317 │ 170 │ 0.004 │
23. │ vuejs/vue │ 37441 │ 199731 │ 5.335 │
24. │ laravel/laravel │ 36584 │ 74136 │ 2.026 │
25. │ jwasham/coding-interview-university │ 36256 │ 119797 │ 3.304 │
26. │ udacity/frontend-nanodegree-resume │ 35434 │ 1351 │ 0.038 │
27. │ mrdoob/three.js │ 35378 │ 72597 │ 2.052 │
28. │ django/django │ 34930 │ 66415 │ 1.901 │
29. │ firstcontributions/first-contributions │ 34173 │ 11620 │ 0.34 │
30. │ spring-projects/spring-framework │ 33862 │ 44540 │ 1.315 │
31. │ Snailclimb/JavaGuide │ 33198 │ 97793 │ 2.946 │
32. │ jquery/jquery │ 31858 │ 65497 │ 2.056 │
33. │ ant-design/ant-design │ 31286 │ 71552 │ 2.287 │
34. │ getify/You-Dont-Know-JS │ 31020 │ 144146 │ 4.647 │
35. │ CyC2018/CS-Notes │ 30915 │ 93320 │ 3.019 │
36. │ TheAlgorithms/Python │ 30656 │ 102067 │ 3.329 │
37. │ rails/rails │ 30602 │ 53620 │ 1.752 │
38. │ DefinitelyTyped/DefinitelyTyped │ 30389 │ 27883 │ 0.918 │
39. │ mmistakes/minimal-mistakes │ 29852 │ 8299 │ 0.278 │
40. │ kubernetes/kubernetes │ 29487 │ 68644 │ 2.328 │
41. │ apache/spark │ 28904 │ 32616 │ 1.128 │
42. │ facebook/react-native │ 27146 │ 105346 │ 3.881 │
43. │ 996icu/996.ICU │ 27101 │ 354850 │ 13.094 │
44. │ ansible/ansible │ 27001 │ 51144 │ 1.894 │
45. │ scikit-learn/scikit-learn │ 26726 │ 48654 │ 1.82 │
46. │ robbyrussell/oh-my-zsh │ 26561 │ 107173 │ 4.035 │
47. │ coolsnowwolf/lede │ 26469 │ 15568 │ 0.588 │
48. │ bitcoin/bitcoin │ 26356 │ 53646 │ 2.035 │
49. │ git/git │ 25929 │ 41413 │ 1.597 │
50. │ angular/angular │ 25463 │ 80602 │ 3.165 │
└─────────────────────────────────────────┴────────┴────────┴────────┘
50 rows in set. Elapsed: 1.192 sec. Processed 316.85 million rows, 2.60 GB (265.84 million rows/s., 2.18 GB/s.)
We can see a separation. Some repositories are "for forks" like the "octocat/Spoon-Knife" — they either have fork as a purpose of the repo or some of them represent a template to base something new on. Some repositories are "for stars" — usually it's not software but some text content, like "996icu/996.ICU" and "sindresorhus/awesome".
Let's find out which repositories are the most "for forks" and "for stars".
More stars less forks:
SELECT
repo_name,
sum(event_type = 'ForkEvent') AS forks,
sum(event_type = 'WatchEvent') AS stars,
round(stars / forks, 2) AS ratio
FROM github_events
WHERE event_type IN ('ForkEvent', 'WatchEvent')
GROUP BY repo_name
HAVING (stars > 100) AND (forks > 100)
ORDER BY ratio DESC
LIMIT 50
┌─repo_name─────────────────────────────┬─forks─┬─stars─┬──ratio─┐
1. │ M4cs/BabySploit │ 147 │ 71572 │ 486.88 │
2. │ tipsy/github-profile-summary │ 330 │ 22397 │ 67.87 │
3. │ doctrine/inflector │ 155 │ 10236 │ 66.04 │
4. │ phpDocumentor/ReflectionDocBlock │ 143 │ 8547 │ 59.77 │
5. │ pakastin/open-source-flash │ 128 │ 7326 │ 57.23 │
6. │ egulias/EmailValidator │ 177 │ 9835 │ 55.56 │
7. │ symfony/var-dumper │ 127 │ 6770 │ 53.31 │
8. │ laravel/tinker │ 129 │ 6773 │ 52.5 │
9. │ guzzle/promises │ 126 │ 6596 │ 52.35 │
10. │ paragonie/random_compat │ 157 │ 7941 │ 50.58 │
11. │ fideloper/TrustedProxy │ 147 │ 7201 │ 48.99 │
12. │ charmbracelet/glow │ 120 │ 5838 │ 48.65 │
13. │ woltapp/blurhash │ 168 │ 7875 │ 46.88 │
14. │ dandavison/delta │ 140 │ 6494 │ 46.39 │
15. │ simeji/jid │ 129 │ 5976 │ 46.33 │
16. │ akavel/up │ 111 │ 4873 │ 43.9 │
17. │ evanw/esbuild │ 384 │ 16631 │ 43.31 │
18. │ webmozart/assert │ 158 │ 6795 │ 43.01 │
19. │ tomnomnom/gron │ 200 │ 8597 │ 42.98 │
20. │ sharkdp/hyperfine │ 158 │ 6685 │ 42.31 │
21. │ pikapkg/web │ 119 │ 4986 │ 41.9 │
22. │ mixn/carbon-now-cli │ 116 │ 4778 │ 41.19 │
23. │ CodeHubApp/CodeHub │ 392 │ 15897 │ 40.55 │
24. │ mjswensen/themer │ 102 │ 4091 │ 40.11 │
25. │ react-spring/zustand │ 105 │ 4166 │ 39.68 │
26. │ GoogleChromeLabs/react-adaptive-hooks │ 117 │ 4620 │ 39.49 │
27. │ swc-project/swc │ 254 │ 10011 │ 39.41 │
28. │ Raathigesh/majestic │ 180 │ 7055 │ 39.19 │
29. │ sveinbjornt/Sloth │ 109 │ 4254 │ 39.03 │
30. │ luruke/browser-2020 │ 213 │ 8278 │ 38.86 │
31. │ rhysd/vim.wasm │ 119 │ 4603 │ 38.68 │
32. │ auchenberg/volkswagen │ 306 │ 11730 │ 38.33 │
33. │ resume/resume.github.com │ 1786 │ 68423 │ 38.31 │
34. │ kdeldycke/awesome-falsehood │ 411 │ 15675 │ 38.14 │
35. │ uswds/public-sans │ 102 │ 3890 │ 38.14 │
36. │ GoogleChromeLabs/ndb │ 265 │ 10035 │ 37.87 │
37. │ developit/htm │ 159 │ 5973 │ 37.57 │
38. │ Canop/broot │ 131 │ 4894 │ 37.36 │
39. │ you-dont-need/You-Dont-Need-Momentjs │ 288 │ 10749 │ 37.32 │
40. │ romefrontend/rome │ 131 │ 4843 │ 36.97 │
41. │ jlongster/prettier │ 139 │ 5111 │ 36.77 │
42. │ pojala/electrino │ 118 │ 4333 │ 36.72 │
43. │ symfony/console │ 235 │ 8546 │ 36.37 │
44. │ timqian/chart.xkcd │ 182 │ 6578 │ 36.14 │
45. │ developit/workerize │ 111 │ 4000 │ 36.04 │
46. │ joeycastillo/The-Open-Book │ 122 │ 4362 │ 35.75 │
47. │ framer/motion │ 242 │ 8626 │ 35.64 │
48. │ nearform/node-clinic │ 102 │ 3633 │ 35.62 │
49. │ dylanbeattie/rockstar │ 139 │ 4863 │ 34.99 │
50. │ muesli/duf │ 141 │ 4933 │ 34.99 │
└───────────────────────────────────────┴───────┴───────┴────────┘
50 rows in set. Elapsed: 1.227 sec. Processed 316.85 million rows, 2.60 GB (258.21 million rows/s., 2.12 GB/s.)
More forks less stars:
SELECT
repo_name,
sum(event_type = 'ForkEvent') AS forks,
sum(event_type = 'WatchEvent') AS stars,
round(forks / stars, 2) AS ratio
FROM github_events
WHERE event_type IN ('ForkEvent', 'WatchEvent')
GROUP BY repo_name
HAVING (stars > 100) AND (forks > 100)
ORDER BY ratio DESC
LIMIT 50
┌─repo_name─────────────────────────────────┬──forks─┬─stars─┬──ratio─┐
1. │ rdpeng/RepData_PeerAssessment1 │ 44449 │ 124 │ 358.46 │
2. │ rdpeng/ExData_Plotting1 │ 67182 │ 271 │ 247.9 │
3. │ LarryMad/recipes │ 38317 │ 170 │ 225.39 │
4. │ rdpeng/ProgrammingAssignment2 │ 160794 │ 990 │ 162.42 │
5. │ jenkins-docs/simple-java-maven-app │ 14484 │ 222 │ 65.24 │
6. │ deadlyvipers/dojo_rules │ 13600 │ 213 │ 63.85 │
7. │ google/it-cert-automation-practice │ 15163 │ 288 │ 52.65 │
8. │ jleetutorial/maven-project │ 8191 │ 171 │ 47.9 │
9. │ jenkins-docs/simple-node-js-react-npm-app │ 5946 │ 128 │ 46.45 │
10. │ udacity/fullstack-nanodegree-vm │ 16782 │ 373 │ 44.99 │
11. │ typicode/demo │ 8291 │ 185 │ 44.82 │
12. │ octocat/Spoon-Knife │ 198031 │ 4601 │ 43.04 │
13. │ scm-ninja/starter-web │ 16299 │ 385 │ 42.34 │
14. │ jtleek/datasharing │ 262926 │ 6364 │ 41.31 │
15. │ openshift/nodejs-ex │ 6956 │ 171 │ 40.68 │
16. │ saasbook/hw-ruby-intro │ 4303 │ 109 │ 39.48 │
17. │ SmartThingsCommunity/SmartThingsPublic │ 78551 │ 2073 │ 37.89 │
18. │ saasbook/hw3_rottenpotatoes │ 8813 │ 242 │ 36.42 │
19. │ sclorg/nodejs-ex │ 6243 │ 175 │ 35.67 │
20. │ holbertonschool/your_first_code │ 4153 │ 117 │ 35.5 │
21. │ jlord/patchwork │ 45136 │ 1329 │ 33.96 │
22. │ LambdaSchool/portfolio-website │ 3526 │ 104 │ 33.9 │
23. │ nightscout/cgm-remote-monitor │ 59420 │ 1784 │ 33.31 │
24. │ udacity/OAuth2.0 │ 3779 │ 114 │ 33.15 │
25. │ xudailong/xudailong.github.io │ 3235 │ 101 │ 32.03 │
26. │ zeit/now-github-starter │ 5992 │ 190 │ 31.54 │
27. │ springframeworkguru/spring5-recipe-app │ 3585 │ 118 │ 30.38 │
28. │ udacity/course-git-blog-project │ 4720 │ 166 │ 28.43 │
29. │ cerner/smart-on-fhir-tutorial │ 3031 │ 108 │ 28.06 │
30. │ devart-by-google/devart-template │ 2887 │ 105 │ 27.5 │
31. │ codeschool-projects/HTMLPortfolioProject │ 3588 │ 133 │ 26.98 │
32. │ carbon-design-system/carbon-tutorial │ 2862 │ 109 │ 26.26 │
33. │ udacity/frontend-nanodegree-resume │ 35434 │ 1351 │ 26.23 │
34. │ udacity/create-your-own-adventure │ 16268 │ 630 │ 25.82 │
35. │ MicrosoftDocs/pipelines-dotnet-core │ 4198 │ 166 │ 25.29 │
36. │ Cutwell/Hacktoberfest-Census │ 3641 │ 144 │ 25.28 │
37. │ iloveponies/training-day │ 3274 │ 140 │ 23.39 │
38. │ yankils/Simple-DevOps-Project │ 4408 │ 196 │ 22.49 │
39. │ fluxcd/flux-get-started │ 2868 │ 129 │ 22.23 │
40. │ udacity/mws-restaurant-stage-1 │ 2807 │ 127 │ 22.1 │
41. │ udacity/devops-intro-project │ 4152 │ 190 │ 21.85 │
42. │ udacity/course-JS-and-the-DOM │ 2494 │ 115 │ 21.69 │
43. │ cnfeat/blog.io │ 5354 │ 256 │ 20.91 │
44. │ evanca/quick-portfolio │ 4234 │ 205 │ 20.65 │
45. │ udacity/frontend-nanodegree-styleguide │ 4813 │ 234 │ 20.57 │
46. │ DevMountain/learn-git │ 2279 │ 111 │ 20.53 │
47. │ Nguyen17/Hacktoberfest-Sign-In │ 3111 │ 152 │ 20.47 │
48. │ springframeworkguru/spring5webapp │ 8163 │ 408 │ 20.01 │
49. │ MLH/mlh-localhost-github │ 2056 │ 105 │ 19.58 │
50. │ do-community/cloud_haiku │ 2058 │ 106 │ 19.42 │
└───────────────────────────────────────────┴────────┴───────┴────────┘
50 rows in set. Elapsed: 1.257 sec. Processed 316.85 million rows, 2.60 GB (252.12 million rows/s., 2.07 GB/s.)
It's not easy to find some software products here.
The overall proportion between stars and forks:
:) SELECT sum(event_type = 'ForkEvent') AS forks, sum(event_type = 'WatchEvent') AS stars, round(stars / forks, 2) AS ratio FROM github_events WHERE event_type IN ('ForkEvent', 'WatchEvent')
┌────forks─┬─────stars─┬─ratio─┐
│ 84709181 │ 232118474 │ 2.74 │
└──────────┴───────────┴───────┘
1 rows in set. Elapsed: 0.119 sec. Processed 316.85 million rows, 316.85 MB (2.67 billion rows/s., 2.67 GB/s.)
And it's higher for more popular repositories:
SELECT
sum(stars) AS stars,
sum(forks) AS forks,
round(stars / forks, 2) AS ratio
FROM
(
SELECT
sum(event_type = 'ForkEvent') AS forks,
sum(event_type = 'WatchEvent') AS stars
FROM github_events
WHERE event_type IN ('ForkEvent', 'WatchEvent')
GROUP BY repo_name
HAVING stars > 100
)
┌─────stars─┬────forks─┬─ratio─┐
│ 171567035 │ 44944118 │ 3.82 │
└───────────┴──────────┴───────┘
1 rows in set. Elapsed: 1.173 sec. Processed 316.85 million rows, 2.60 GB (270.02 million rows/s., 2.22 GB/s.)
Total number of issues on GitHub:
:) SELECT count() FROM github_events WHERE event_type = 'IssueCommentEvent'
┌───count()─┐
│ 218460262 │
└───────────┘
1 rows in set. Elapsed: 0.051 sec. Processed 218.47 million rows, 218.47 MB (4.32 billion rows/s., 4.32 GB/s.)
The top repositories by issue comments:
:) SELECT repo_name, count() FROM github_events WHERE event_type = 'IssueCommentEvent' GROUP BY repo_name ORDER BY count() DESC LIMIT 50
┌─repo_name────────────────────────────┬─count()─┐
1. │ kubernetes/kubernetes │ 1450081 │
2. │ apache/spark │ 790480 │
3. │ rust-lang/rust │ 502960 │
4. │ cms-sw/cmssw │ 478607 │
5. │ google-test/signcla-probe-repo │ 477302 │
6. │ openshift/origin │ 445639 │
7. │ brianchandotcom/liferay-portal │ 392474 │
8. │ NixOS/nixpkgs │ 349733 │
9. │ elastic/kibana │ 330168 │
10. │ ansible/ansible │ 312424 │
11. │ everypolitician/everypolitician-data │ 293714 │
12. │ flutter/flutter │ 291260 │
13. │ owncloud/core │ 283295 │
14. │ istio/istio │ 272426 │
15. │ Microsoft/vscode │ 267468 │
16. │ nodejs/node │ 258525 │
17. │ tensorflow/tensorflow │ 255957 │
18. │ MicrosoftDocs/azure-docs │ 245829 │
19. │ JuliaLang/julia │ 229092 │
20. │ golang/go │ 222317 │
21. │ cockroachdb/cockroach │ 222305 │
22. │ tgstation/tgstation │ 221465 │
23. │ joomla/joomla-cms │ 218215 │
24. │ servo/servo │ 213924 │
25. │ dotnet/corefx │ 205892 │
26. │ angular/angular │ 200608 │
27. │ magento/magento2 │ 198987 │
28. │ saltstack/salt │ 196898 │
29. │ docker/docker │ 194622 │
30. │ facebook/react-native │ 188756 │
31. │ dotnet/roslyn │ 187150 │
32. │ apple/swift │ 175983 │
33. │ elastic/elasticsearch │ 175442 │
34. │ symfony/symfony │ 171438 │
35. │ rails/rails │ 170580 │
36. │ godotengine/godot │ 168950 │
37. │ Automattic/wp-calypso │ 165136 │
38. │ openshift/release │ 164215 │
39. │ DefinitelyTyped/DefinitelyTyped │ 164071 │
40. │ odoo/odoo │ 163858 │
41. │ edx/edx-platform │ 157913 │
42. │ pytorch/pytorch │ 151349 │
43. │ microsoft/vscode │ 151038 │
44. │ bitcoin/bitcoin │ 146471 │
45. │ MarlinFirmware/Marlin │ 146463 │
46. │ ManageIQ/manageiq │ 145739 │
47. │ scikit-learn/scikit-learn │ 140762 │
48. │ ceph/ceph │ 138518 │
49. │ CleverRaven/Cataclysm-DDA │ 138073 │
50. │ openshift/console │ 135628 │
└──────────────────────────────────────┴─────────┘
50 rows in set. Elapsed: 0.312 sec. Processed 218.47 million rows, 1.71 GB (700.55 million rows/s., 5.47 GB/s.)
The proportion between issue comments and issues:
SELECT
repo_name,
count() AS comments,
uniq(number) AS issues,
round(comments / issues, 2) AS ratio
FROM github_events
WHERE event_type = 'IssueCommentEvent'
GROUP BY repo_name
ORDER BY count() DESC
LIMIT 50
┌─repo_name────────────────────────────┬─comments─┬─issues─┬─ratio─┐
1. │ kubernetes/kubernetes │ 1450081 │ 85379 │ 16.98 │
2. │ apache/spark │ 790480 │ 26868 │ 29.42 │
3. │ rust-lang/rust │ 502960 │ 58464 │ 8.6 │
4. │ cms-sw/cmssw │ 478607 │ 25416 │ 18.83 │
5. │ google-test/signcla-probe-repo │ 477302 │ 353981 │ 1.35 │
6. │ openshift/origin │ 445639 │ 24421 │ 18.25 │
7. │ brianchandotcom/liferay-portal │ 392474 │ 75732 │ 5.18 │
8. │ NixOS/nixpkgs │ 349733 │ 75080 │ 4.66 │
9. │ elastic/kibana │ 330168 │ 75118 │ 4.4 │
10. │ ansible/ansible │ 312424 │ 56428 │ 5.54 │
11. │ everypolitician/everypolitician-data │ 293714 │ 161780 │ 1.82 │
12. │ flutter/flutter │ 291260 │ 62236 │ 4.68 │
13. │ owncloud/core │ 283295 │ 26167 │ 10.83 │
14. │ istio/istio │ 272426 │ 26903 │ 10.13 │
15. │ Microsoft/vscode │ 267468 │ 64724 │ 4.13 │
16. │ nodejs/node │ 258525 │ 33085 │ 7.81 │
17. │ tensorflow/tensorflow │ 255957 │ 39964 │ 6.4 │
18. │ MicrosoftDocs/azure-docs │ 245829 │ 64346 │ 3.82 │
19. │ JuliaLang/julia │ 229092 │ 24823 │ 9.23 │
20. │ golang/go │ 222317 │ 33491 │ 6.64 │
21. │ cockroachdb/cockroach │ 222305 │ 49166 │ 4.52 │
22. │ tgstation/tgstation │ 221465 │ 29711 │ 7.45 │
23. │ joomla/joomla-cms │ 218215 │ 25276 │ 8.63 │
24. │ servo/servo │ 213924 │ 21934 │ 9.75 │
25. │ dotnet/corefx │ 205892 │ 34315 │ 6 │
26. │ angular/angular │ 200608 │ 38680 │ 5.19 │
27. │ magento/magento2 │ 198987 │ 29518 │ 6.74 │
28. │ saltstack/salt │ 196898 │ 30613 │ 6.43 │
29. │ docker/docker │ 194622 │ 23236 │ 8.38 │
30. │ facebook/react-native │ 188756 │ 28972 │ 6.52 │
31. │ dotnet/roslyn │ 187150 │ 37405 │ 5 │
32. │ apple/swift │ 175983 │ 32754 │ 5.37 │
33. │ elastic/elasticsearch │ 175442 │ 51588 │ 3.4 │
34. │ symfony/symfony │ 171438 │ 25420 │ 6.74 │
35. │ rails/rails │ 170580 │ 22179 │ 7.69 │
36. │ godotengine/godot │ 168950 │ 35601 │ 4.75 │
37. │ Automattic/wp-calypso │ 165136 │ 43411 │ 3.8 │
38. │ openshift/release │ 164215 │ 13736 │ 11.96 │
39. │ DefinitelyTyped/DefinitelyTyped │ 164071 │ 39051 │ 4.2 │
40. │ odoo/odoo │ 163858 │ 49072 │ 3.34 │
41. │ edx/edx-platform │ 157913 │ 18112 │ 8.72 │
42. │ pytorch/pytorch │ 151349 │ 38739 │ 3.91 │
43. │ microsoft/vscode │ 151038 │ 39050 │ 3.87 │
44. │ bitcoin/bitcoin │ 146471 │ 14580 │ 10.05 │
45. │ MarlinFirmware/Marlin │ 146463 │ 14891 │ 9.84 │
46. │ ManageIQ/manageiq │ 145739 │ 19212 │ 7.59 │
47. │ scikit-learn/scikit-learn │ 140762 │ 14127 │ 9.96 │
48. │ ceph/ceph │ 138518 │ 29380 │ 4.71 │
49. │ CleverRaven/Cataclysm-DDA │ 138073 │ 22839 │ 6.05 │
50. │ openshift/console │ 135628 │ 7204 │ 18.83 │
└──────────────────────────────────────┴──────────┴────────┴───────┘
50 rows in set. Elapsed: 0.519 sec. Processed 218.47 million rows, 2.58 GB (420.80 million rows/s., 4.97 GB/s.)
Spark has the most active discussions among the top repositories. In contrast, the repository with the least comments per issue, "google-test/signcla-probe-repo", is clearly not for discussions.
Now let's find the most active issues...
SELECT
repo_name,
number,
count() AS comments
FROM github_events
WHERE (event_type = 'IssueCommentEvent') AND (action = 'created')
GROUP BY
repo_name,
number
ORDER BY count() DESC
LIMIT 50
┌─repo_name───────────────────────────────────────────────────┬─number─┬─comments─┐
1. │ sauron-demo/sauron │ 1 │ 21297 │
2. │ gafusion/regression_notifications │ 1 │ 15677 │
3. │ odoo-mergebot-testing-org/repo │ 1 │ 13315 │
4. │ zeit-github-test/github-utils-test │ 1 │ 9260 │
5. │ odoo-mergebot-testing-org/proj │ 1 │ 7681 │
6. │ tphongio/elasticbox-plugin │ 1 │ 6167 │
7. │ openshift/oc │ 270 │ 5917 │
8. │ openfoodfacts/openfoodfacts-server │ 3767 │ 5098 │
9. │ codecov/ci-repo │ 4 │ 5078 │
10. │ openshift/cluster-resource-override-admission-operator │ 29 │ 4562 │
11. │ odoo-mergebot-testing-org/proj │ 2 │ 4296 │
12. │ ingvagabund/kubernetes │ 58 │ 4136 │
13. │ garethjevans/jenkins-cwp-quickstart01 │ 1 │ 4122 │
14. │ Kitware/CDash │ 80 │ 4005 │
15. │ odoo-mergebot-testing-org/proj │ 3 │ 3767 │
16. │ cockpit-project/cockpit │ 7636 │ 3330 │
17. │ fuszenecker/CSharpDemo │ 5 │ 3329 │
18. │ openshift/tektoncd-pipeline-operator │ 494 │ 3303 │
19. │ xbrianlee/liferay-portal │ 369 │ 3201 │
20. │ D00Med/farlands │ 77 │ 3183 │
21. │ getlantern/forum │ 313 │ 3123 │
22. │ openshift/origin │ 18826 │ 3067 │
23. │ githubschool/open-enrollment-classes-introduction-to-github │ 927 │ 3005 │
24. │ gitalk/gitalk │ 1 │ 2783 │
25. │ MarshalX/yandex-music-api │ 339 │ 2712 │
26. │ MicrosoftDocs/E2E_MicrosoftDocs_Dynamic │ 1 │ 2675 │
27. │ OpenKore/openkore │ 628 │ 2574 │
28. │ MR-M3/Idksomething │ 1 │ 2557 │
29. │ D00Med/farlands │ 108 │ 2554 │
30. │ theapache64/movie-monk-commenter │ 1 │ 2546 │
31. │ openshift/ovn-kubernetes │ 159 │ 2534 │
32. │ openshift/template-service-broker-operator │ 62 │ 2533 │
33. │ sofae/pyscript │ 1 │ 2518 │
34. │ jaecSolutions/Unito-dev │ 10 │ 2512 │
35. │ oskosk/node-wms-client │ 2 │ 2510 │
36. │ Nexus-Mods/Vortex │ 6634 │ 2509 │
37. │ didiladi/mac-build-test │ 8 │ 2504 │
38. │ alistairjcbrown/mock-repo │ 7 │ 2504 │
39. │ jaecSolutions/Unito-dev │ 6 │ 2504 │
40. │ jaecSolutions/Unito-dev │ 9 │ 2502 │
41. │ openfoodfacts/openfoodfacts-server │ 4410 │ 2502 │
42. │ mesosphere-mergebot/mergebot-test-dcos │ 309 │ 2500 │
43. │ threejsworker/threejsworker │ 12 │ 2500 │
44. │ openshift/odo │ 2346 │ 2500 │
45. │ cockpit-project/cockpit │ 3455 │ 2499 │
46. │ jlord/patchwork │ 6762 │ 2499 │
47. │ cockpit-project/cockpit │ 7635 │ 2499 │
48. │ googleapis/google-cloud-go │ 3111 │ 2499 │
49. │ jlord/patchwork │ 4542 │ 2499 │
50. │ PennyDreadfulMTG/perf-reports │ 52349 │ 2498 │
└─────────────────────────────────────────────────────────────┴────────┴──────────┘
50 rows in set. Elapsed: 3.082 sec. Processed 218.47 million rows, 2.77 GB (70.88 million rows/s., 899.91 MB/s.)
The top repositories have the most comments in their first issue. It looks like technical comments made by some script. Nothing interesting here, and actually I did not find these comments on the GitHub website (maybe they have already been deleted).
I aimed to find some "epic bugs". Let's filter out by issue number:
SELECT
repo_name,
number,
count() AS comments
FROM github_events
WHERE (event_type = 'IssueCommentEvent') AND (action = 'created') AND (number > 10)
GROUP BY
repo_name,
number
ORDER BY count() DESC
LIMIT 50
┌─repo_name───────────────────────────────────────────────────┬─number─┬─comments─┐
1. │ openshift/oc │ 270 │ 5917 │
2. │ openfoodfacts/openfoodfacts-server │ 3767 │ 5098 │
3. │ openshift/cluster-resource-override-admission-operator │ 29 │ 4562 │
4. │ ingvagabund/kubernetes │ 58 │ 4136 │
5. │ Kitware/CDash │ 80 │ 4005 │
6. │ cockpit-project/cockpit │ 7636 │ 3330 │
7. │ openshift/tektoncd-pipeline-operator │ 494 │ 3303 │
8. │ xbrianlee/liferay-portal │ 369 │ 3201 │
9. │ D00Med/farlands │ 77 │ 3183 │
10. │ getlantern/forum │ 313 │ 3123 │
11. │ openshift/origin │ 18826 │ 3067 │
12. │ githubschool/open-enrollment-classes-introduction-to-github │ 927 │ 3005 │
13. │ MarshalX/yandex-music-api │ 339 │ 2712 │
14. │ OpenKore/openkore │ 628 │ 2574 │
15. │ D00Med/farlands │ 108 │ 2554 │
16. │ openshift/ovn-kubernetes │ 159 │ 2534 │
17. │ openshift/template-service-broker-operator │ 62 │ 2533 │
18. │ Nexus-Mods/Vortex │ 6634 │ 2509 │
19. │ openfoodfacts/openfoodfacts-server │ 4410 │ 2502 │
20. │ mesosphere-mergebot/mergebot-test-dcos │ 309 │ 2500 │
21. │ openshift/odo │ 2346 │ 2500 │
22. │ threejsworker/threejsworker │ 12 │ 2500 │
23. │ cockpit-project/cockpit │ 3455 │ 2499 │
24. │ jlord/patchwork │ 4542 │ 2499 │
25. │ jlord/patchwork │ 6762 │ 2499 │
26. │ cockpit-project/cockpit │ 7635 │ 2499 │
27. │ googleapis/google-cloud-go │ 3111 │ 2499 │
28. │ PennyDreadfulMTG/perf-reports │ 52349 │ 2498 │
29. │ PennyDreadfulMTG/perf-reports │ 50472 │ 2497 │
30. │ jlord/patchwork │ 8914 │ 2496 │
31. │ wildfly/wildfly │ 8947 │ 2495 │
32. │ PennyDreadfulMTG/perf-reports │ 53234 │ 2494 │
33. │ allencloud/daoker │ 48 │ 2491 │
34. │ RedisDesktop/rdm-debug-symbols │ 63 │ 2491 │
35. │ PennyDreadfulMTG/perf-reports │ 50762 │ 2491 │
36. │ PennyDreadfulMTG/perf-reports │ 50189 │ 2488 │
37. │ tannakartikey/currents │ 29 │ 2487 │
38. │ kubernetes/kubernetes │ 33388 │ 2485 │
39. │ ros-infrastructure/roswiki │ 139 │ 2483 │
40. │ puneet-tm/wirelessone-support │ 147 │ 2483 │
41. │ tannakartikey/currents │ 25 │ 2482 │
42. │ sauron-demo/sauron-demo │ 190 │ 2482 │
43. │ tannakartikey/currents │ 33 │ 2482 │
44. │ vitech-team/mood-feed-frontend │ 50 │ 2480 │
45. │ tannakartikey/currents │ 31 │ 2479 │
46. │ Zeroshi/Docs │ 1370 │ 2475 │
47. │ puneet-tm/wirelessone-support │ 137 │ 2473 │
48. │ PennyDreadfulMTG/perf-reports │ 53515 │ 2468 │
49. │ PennyDreadfulMTG/perf-reports │ 42826 │ 2444 │
50. │ yegor256/cactoos │ 486 │ 2426 │
└─────────────────────────────────────────────────────────────┴────────┴──────────┘
50 rows in set. Elapsed: 2.375 sec. Processed 218.47 million rows, 2.77 GB (92.00 million rows/s., 1.17 GB/s.)
I checked the first one and it also looks like some script gone out of control.
Let's also count the number of comment authors and add a threshold:
SELECT
repo_name,
number,
count() AS comments,
uniq(actor_login) AS authors
FROM github_events
WHERE (event_type = 'IssueCommentEvent') AND (action = 'created') AND (number > 10)
GROUP BY
repo_name,
number
HAVING authors >= 10
ORDER BY count() DESC
LIMIT 50
┌─repo_name───────────────────────────────────────────────────┬─number─┬─comments─┬─authors─┐
1. │ getlantern/forum │ 313 │ 3123 │ 1658 │
2. │ githubschool/open-enrollment-classes-introduction-to-github │ 927 │ 3005 │ 1362 │
3. │ OpenKore/openkore │ 628 │ 2574 │ 373 │
4. │ D00Med/farlands │ 108 │ 2554 │ 11 │
5. │ ros-infrastructure/roswiki │ 139 │ 2483 │ 1427 │
6. │ Kodi-vStream/venom-xbmc-addons │ 2908 │ 2374 │ 12 │
7. │ gcp/leela-zero │ 78 │ 1978 │ 97 │
8. │ andreiw/RaspberryPiPkg │ 12 │ 1940 │ 137 │
9. │ ValveSoftware/halflife │ 387 │ 1716 │ 60 │
10. │ MarlinFirmware/Marlin │ 7076 │ 1695 │ 120 │
11. │ OpenKore/openkore │ 460 │ 1643 │ 239 │
12. │ finndev/PokeBuddy │ 227 │ 1546 │ 119 │
13. │ iiordanov/remote-desktop-clients │ 39 │ 1520 │ 1053 │
14. │ 996icu/996.ICU │ 20 │ 1487 │ 1142 │
15. │ Apostolique/Agar.io-bot │ 380 │ 1482 │ 108 │
16. │ OpenKore/openkore │ 221 │ 1437 │ 323 │
17. │ npm/registry │ 255 │ 1409 │ 420 │
18. │ kubernetes/kubernetes │ 46254 │ 1345 │ 11 │
19. │ BeepIsla/csgo-commend-bot │ 428 │ 1304 │ 114 │
20. │ OPS-E2E-PPE/E2E_DocsBranch_Dynamic │ 11358 │ 1293 │ 11 │
21. │ istio/istio │ 12276 │ 1269 │ 11 │
22. │ reactjs/rfcs │ 68 │ 1262 │ 332 │
23. │ SickChill/SickChill │ 5185 │ 1199 │ 37 │
24. │ kubernetes/kubernetes │ 36895 │ 1198 │ 11 │
25. │ prusa3d/Prusa-Firmware │ 602 │ 1195 │ 168 │
26. │ ValveSoftware/Proton │ 3654 │ 1183 │ 197 │
27. │ kubernetes/kubernetes │ 32214 │ 1155 │ 10 │
28. │ ValveSoftware/Proton │ 37 │ 1139 │ 253 │
29. │ kubernetes/kubernetes │ 91824 │ 1118 │ 12 │
30. │ MiCode/patchrom │ 130 │ 1109 │ 14 │
31. │ kubernetes/kubernetes │ 91592 │ 1075 │ 12 │
32. │ ros-infrastructure/roswiki │ 258 │ 1075 │ 693 │
33. │ isaacs/github │ 18 │ 1068 │ 998 │
34. │ mtdhb/mtdhb │ 101 │ 1063 │ 421 │
35. │ ant-design/ant-design │ 13848 │ 1051 │ 577 │
36. │ openshift/cluster-logging-operator │ 449 │ 1033 │ 10 │
37. │ uku/Unblock-Youku │ 618 │ 990 │ 144 │
38. │ intel/haxm │ 149 │ 961 │ 10 │
39. │ kubernetes/kubernetes │ 50457 │ 948 │ 10 │
40. │ acemod/ACE3 │ 3594 │ 945 │ 294 │
41. │ openshift/installer │ 2745 │ 934 │ 11 │
42. │ Koenkk/zigbee2mqtt │ 1429 │ 911 │ 136 │
43. │ XX-net/XX-Net │ 1977 │ 911 │ 195 │
44. │ openshift/release │ 4340 │ 907 │ 10 │
45. │ rust-lang/rust │ 65590 │ 897 │ 10 │
46. │ ValveSoftware/Proton │ 175 │ 878 │ 184 │
47. │ ValveSoftware/Proton │ 3291 │ 863 │ 167 │
48. │ kubernetes/kubernetes │ 27113 │ 830 │ 12 │
49. │ monero-project/meta │ 316 │ 829 │ 55 │
50. │ golang/go │ 32437 │ 828 │ 210 │
└─────────────────────────────────────────────────────────────┴────────┴──────────┴─────────┘
50 rows in set. Elapsed: 3.008 sec. Processed 218.47 million rows, 3.82 GB (72.63 million rows/s., 1.27 GB/s.)
I found the gem.
SELECT
concat('https://github.com/', repo_name, '/issues/', toString(number)) AS URL,
max(comments),
argMax(authors, comments) AS authors,
argMax(number, comments) AS number,
sum(stars) AS stars
FROM
(
SELECT *
FROM
(
SELECT
repo_name,
number,
count() AS comments,
uniq(actor_login) AS authors
FROM github_events
WHERE (event_type = 'IssueCommentEvent') AND (action = 'created') AND (number > 10)
GROUP BY
repo_name,
number
HAVING authors >= 10
) AS t1
INNER JOIN
(
SELECT
repo_name,
count() AS stars
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY repo_name
HAVING stars > 10000
) AS t2 USING (repo_name)
)
GROUP BY repo_name
ORDER BY stars DESC
LIMIT 50
┌─URL───────────────────────────────────────────────────────────┬─max(comments)─┬─authors─┬─number─┬─────stars─┐
1. │ https://github.com/tensorflow/tensorflow/issues/22 │ 632 │ 156 │ 22 │ 221964318 │
2. │ https://github.com/kubernetes/kubernetes/issues/46254 │ 1345 │ 11 │ 46254 │ 209981996 │
3. │ https://github.com/facebook/react-native/issues/4968 │ 512 │ 305 │ 4968 │ 205108662 │
4. │ https://github.com/flutter/flutter/issues/51752 │ 422 │ 33 │ 51752 │ 202483523 │
5. │ https://github.com/angular/angular/issues/16477 │ 292 │ 101 │ 16477 │ 86727752 │
6. │ https://github.com/rust-lang/rust/issues/65590 │ 897 │ 10 │ 65590 │ 86062821 │
7. │ https://github.com/golang/go/issues/32437 │ 828 │ 210 │ 32437 │ 80301683 │
8. │ https://github.com/facebook/react/issues/13991 │ 301 │ 223 │ 13991 │ 78635775 │
9. │ https://github.com/Microsoft/vscode/issues/224 │ 418 │ 151 │ 224 │ 74084829 │
10. │ https://github.com/nodejs/node/issues/5020 │ 471 │ 79 │ 5020 │ 73816506 │
11. │ https://github.com/996icu/996.ICU/issues/20 │ 1487 │ 1142 │ 20 │ 68131200 │
12. │ https://github.com/FortAwesome/Font-Awesome/issues/7574 │ 410 │ 328 │ 7574 │ 60891048 │
13. │ https://github.com/atom/atom/issues/2956 │ 238 │ 95 │ 2956 │ 45072964 │
14. │ https://github.com/docker/docker/issues/9176 │ 379 │ 77 │ 9176 │ 31559220 │
15. │ https://github.com/vuejs/vue/issues/2873 │ 214 │ 63 │ 2873 │ 29759919 │
16. │ https://github.com/Microsoft/TypeScript/issues/202 │ 272 │ 64 │ 202 │ 27943152 │
17. │ https://github.com/angular/angular-cli/issues/5618 │ 426 │ 177 │ 5618 │ 26708608 │
18. │ https://github.com/ansible/ansible/issues/13262 │ 164 │ 94 │ 13262 │ 26339160 │
19. │ https://github.com/electron/electron/issues/673 │ 211 │ 52 │ 673 │ 26130204 │
20. │ https://github.com/gatsbyjs/gatsby/issues/22991 │ 173 │ 45 │ 22991 │ 25123843 │
21. │ https://github.com/webpack/webpack/issues/9802 │ 508 │ 114 │ 9802 │ 24479445 │
22. │ https://github.com/bitcoin/bitcoin/issues/6312 │ 175 │ 18 │ 6312 │ 23926116 │
23. │ https://github.com/FreeCodeCamp/FreeCodeCamp/issues/8418 │ 101 │ 17 │ 8418 │ 21872530 │
24. │ https://github.com/tensorflow/models/issues/9033 │ 126 │ 25 │ 9033 │ 19628766 │
25. │ https://github.com/twbs/bootstrap/issues/21943 │ 134 │ 112 │ 21943 │ 19461134 │
26. │ https://github.com/godotengine/godot/issues/16863 │ 372 │ 78 │ 16863 │ 19275994 │
27. │ https://github.com/JuliaLang/julia/issues/11004 │ 317 │ 28 │ 11004 │ 18860560 │
28. │ https://github.com/ant-design/ant-design/issues/13848 │ 1051 │ 577 │ 13848 │ 18388864 │
29. │ https://github.com/meteor/meteor/issues/6960 │ 302 │ 99 │ 6960 │ 17604270 │
30. │ https://github.com/XX-net/XX-Net/issues/1977 │ 911 │ 195 │ 1977 │ 17480000 │
31. │ https://github.com/grafana/grafana/issues/2209 │ 299 │ 136 │ 2209 │ 16441740 │
32. │ https://github.com/microsoft/vscode/issues/8017 │ 235 │ 93 │ 8017 │ 16433060 │
33. │ https://github.com/yarnpkg/yarn/issues/2629 │ 170 │ 72 │ 2629 │ 16398114 │
34. │ https://github.com/hashicorp/terraform/issues/1604 │ 165 │ 75 │ 1604 │ 16155832 │
35. │ https://github.com/pytorch/pytorch/issues/494 │ 787 │ 126 │ 494 │ 15851259 │
36. │ https://github.com/home-assistant/home-assistant/issues/20795 │ 378 │ 57 │ 20795 │ 15672006 │
37. │ https://github.com/rails/rails/issues/505 │ 135 │ 20 │ 505 │ 15335320 │
38. │ https://github.com/zeit/next.js/issues/9524 │ 259 │ 89 │ 9524 │ 14943456 │
39. │ https://github.com/magento/magento2/issues/24426 │ 338 │ 17 │ 24426 │ 12349388 │
40. │ https://github.com/elastic/elasticsearch/issues/4915 │ 199 │ 156 │ 4915 │ 11958450 │
41. │ https://github.com/freeCodeCamp/freeCodeCamp/issues/16358 │ 73 │ 60 │ 16358 │ 11202160 │
42. │ https://github.com/facebook/jest/issues/2441 │ 249 │ 153 │ 2441 │ 11140092 │
43. │ https://github.com/facebook/create-react-app/issues/8465 │ 220 │ 146 │ 8465 │ 11013780 │
44. │ https://github.com/dotnet/corefx/issues/23177 │ 271 │ 13 │ 23177 │ 10454510 │
45. │ https://github.com/RocketChat/Rocket.Chat/issues/1112 │ 235 │ 101 │ 1112 │ 10395488 │
46. │ https://github.com/driftyco/ionic/issues/6776 │ 164 │ 50 │ 6776 │ 9308040 │
47. │ https://github.com/laravel/framework/issues/8172 │ 190 │ 37 │ 8172 │ 8987535 │
48. │ https://github.com/getlantern/forum/issues/313 │ 3123 │ 1658 │ 313 │ 8954816 │
49. │ https://github.com/scikit-learn/scikit-learn/issues/9012 │ 233 │ 17 │ 9012 │ 8903682 │
50. │ https://github.com/moby/moby/issues/25526 │ 240 │ 96 │ 25526 │ 8752005 │
└───────────────────────────────────────────────────────────────┴───────────────┴─────────┴────────┴───────────┘
50 rows in set. Elapsed: 3.629 sec. Processed 450.59 million rows, 5.63 GB (124.15 million rows/s., 1.55 GB/s.)
I made this query for you and I hope you will find the most crucial discussions here. Enjoy!
It's rarely the case when a single commit has comments.
Here are the top repositories by the number of commit comments:
SELECT
repo_name,
count() AS comments,
uniq(actor_login) AS authors
FROM github_events
WHERE event_type = 'CommitCommentEvent'
GROUP BY repo_name
ORDER BY count() DESC
LIMIT 50
┌─repo_name────────────────────────────────────────────────────┬─comments─┬─authors─┐
1. │ dcos/dcos │ 99251 │ 16 │
2. │ NREL/EnergyPlus │ 74922 │ 41 │
3. │ miabot/galleries.csv │ 52634 │ 2 │
4. │ siggetest/githubtest │ 52171 │ 1 │
5. │ bambootest-bot/githubtest │ 46876 │ 1 │
6. │ mozilla/rust │ 33708 │ 82 │
7. │ TrinityCore/TrinityCore │ 24386 │ 1550 │
8. │ kubernetes/kubernetes │ 21406 │ 268 │
9. │ xamarin/xamarin-macios │ 18893 │ 28 │
10. │ w4ctech/front-end-rss │ 17979 │ 1 │
11. │ rust-lang/rust │ 16160 │ 238 │
12. │ zeit-github-test/github-e2e-tests-dev │ 15344 │ 3 │
13. │ JuliaLang/julia │ 14932 │ 262 │
14. │ NREL/OpenStudio │ 14375 │ 32 │
15. │ bvcms/bvcms │ 13748 │ 12 │
16. │ mozilla-mobile/android-components │ 13675 │ 10 │
17. │ NixOS/nixpkgs │ 12854 │ 672 │
18. │ mangosR2/mangos │ 10585 │ 194 │
19. │ servo/servo │ 10283 │ 76 │
20. │ jirikuncar/invenio │ 9687 │ 17 │
21. │ rails/rails │ 9514 │ 2383 │
22. │ department-of-veterans-affairs/va.gov-cms │ 9211 │ 10 │
23. │ Wikia/app │ 9118 │ 95 │
24. │ coala-analyzer/coala │ 8878 │ 30 │
25. │ discourse/discourse │ 8625 │ 331 │
26. │ PaddlePaddle/Paddle │ 8623 │ 30 │
27. │ zeit-github-test/github-e2e-tests-dev-alias-v2-without-alias │ 8517 │ 3 │
28. │ zeit-github-test/github-e2e-tests-dev-alias-v2-with-alias │ 8504 │ 3 │
29. │ JetBrains/kotlin │ 8488 │ 153 │
30. │ zeit-github-test/github-e2e-test-dev-forked-repo │ 8364 │ 3 │
31. │ pantheon-systems/documentation │ 8001 │ 38 │
32. │ zeit-github-test/github-e2e-test-dev-domain-issues │ 7704 │ 3 │
33. │ theowenyoung/theowenyoung.github.io │ 7610 │ 3 │
34. │ odoo-dev/odoo │ 7528 │ 210 │
35. │ zeit-github-test/github-utils-test │ 7477 │ 1 │
36. │ mono/MonoGame │ 7394 │ 64 │
37. │ Hack23/cia │ 7099 │ 3 │
38. │ stellar/stellar-core │ 7098 │ 26 │
39. │ mozilla/servo │ 6962 │ 26 │
40. │ xbmc/xbmc │ 6836 │ 416 │
41. │ stdlib-js/stdlib │ 6782 │ 5 │
42. │ MyEtherWallet/MyEtherWallet │ 6583 │ 20 │
43. │ cle-event-calendar/cle-event-calendar.github.io │ 6104 │ 1 │
44. │ advancedtelematic/rvi_sota_server │ 6028 │ 6 │
45. │ ceph/ceph │ 5804 │ 208 │
46. │ mono/CppSharp │ 5790 │ 20 │
47. │ pachoclo/corona-tracker │ 5443 │ 3 │
48. │ xwiki/xwiki-platform │ 5144 │ 43 │
49. │ SthephanShinkufag/Dollchan-Extension-Tools │ 4977 │ 33 │
50. │ netty/netty │ 4759 │ 156 │
└──────────────────────────────────────────────────────────────┴──────────┴─────────┘
50 rows in set. Elapsed: 0.141 sec. Processed 9.96 million rows, 154.01 MB (70.62 million rows/s., 1.09 GB/s.)
If there are many comments but a small number of comment authors, usually they are comments from the CI robot.
Here are the "hottest" commits:
SELECT
concat('https://github.com/', repo_name, '/commit/', commit_id) AS URL,
count() AS comments,
uniq(actor_login) AS authors
FROM github_events
WHERE (event_type = 'CommitCommentEvent') AND notEmpty(commit_id)
GROUP BY
repo_name,
commit_id
HAVING authors >= 10
ORDER BY count() DESC
LIMIT 50
┌─URL─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─comments─┬─authors─┐
1. │ https://github.com/nixxquality/WebMConverter/commit/c1ac0baac06fa7175677a4a1bf65860a84708d67 │ 467 │ 156 │
2. │ https://github.com/torvalds/linux/commit/8a104f8b5867c682d994ffa7a74093c54469c11f │ 410 │ 181 │
3. │ https://github.com/SEI-ATL/select-best-player/commit/fdc6e0e073fa30acf2041ca0b1c9fecd662d88c8 │ 408 │ 31 │
4. │ https://github.com/goagent/goagent/commit/e492ed0283f5cde7cf71d7ac47429f64aa48cd13 │ 378 │ 346 │
5. │ https://github.com/apple/swift/commit/18844bc65229786b96b89a9fc7739c0fc897905e │ 345 │ 331 │
6. │ https://github.com/shadowsocks/shadowsocks/commit/938bba32a4008bdde9c064dda6a0597987ddef54 │ 309 │ 258 │
7. │ https://github.com/mosh-hamedani/vidly-mvc-5/commit/b727a26e1b4b88abe84a8b48208fec537db2ed43 │ 283 │ 157 │
8. │ https://github.com/hklcf/myTV-api/commit/375fa7520455b53fad622c0149a473ebee048e15 │ 272 │ 26 │
9. │ https://github.com/NecronomiconCoding/NecroBot/commit/26e57d9feac57ab372ac466ef6cd66218faa475c │ 262 │ 75 │
10. │ https://github.com/zhaohmng/-21-/commit/0cc2b2a546ad51fb360ba800c67b057ea3270869 │ 249 │ 187 │
11. │ https://github.com/mlouielu/cn_constitution_2018/commit/646c76a573ad49414e708c091393ddb7c437f286 │ 245 │ 181 │
12. │ https://github.com/kohsuke/sandbox-ant/commit/8ae38db0ea5837313ab5f39d43a6f73de3bd9000 │ 233 │ 65 │
13. │ https://github.com/iride2020/iRide-Token/commit/8f810b3700a24de0ee4f945ef84be7b2f1605610 │ 228 │ 183 │
14. │ https://github.com/MrMEEE/bumblebee-Old-and-abbandoned/commit/a047be85247755cdbe0acce6f1dafc8beb84f2ac │ 227 │ 205 │
15. │ https://github.com/yuuwill/1024app-android/commit/0b02f1815686fa657d7b4a9583019fe4d203b80a │ 217 │ 142 │
16. │ https://github.com/3dshax/ctr/commit/bcb3734b9a26d0fe7ef66f7d3814406fee797303 │ 201 │ 137 │
17. │ https://github.com/rails/rails/commit/b83965785db1eec019edf1fc272b1aa393e6dc57 │ 200 │ 123 │
18. │ https://github.com/tmux/tmux/commit/d2b35e19cdd61d163d26c4babccc1550e72a9623 │ 196 │ 188 │
19. │ https://github.com/mosh-hamedani/vidly-mvc-5/commit/93f39efe3ef0ce9c25ef09ebef60ad2dfc1fe7f3 │ 191 │ 107 │
20. │ https://github.com/mosh-hamedani/vidly-mvc-5/commit/acb8c0a0a27255e3f0af85ab2f687e0e9b82b6db │ 165 │ 95 │
21. │ https://github.com/git/git/commit/e83c5163316f89bfbde7d9ab23ca2e25604af290 │ 163 │ 151 │
22. │ https://github.com/easylist/easylist/commit/a4d380ad1a3b33a0fab679a1a8c5a791321622b3 │ 159 │ 68 │
23. │ https://github.com/udacity/ud843-QuakeReport/commit/14541da929b771249ea8209698d324b61bbeee7e │ 153 │ 89 │
24. │ https://github.com/mikeozornin/constitution-of-russia/commit/facbe841eab3f434bd03af3627d4475ea4a671cd │ 138 │ 60 │
25. │ https://github.com/lianshang/code-review-fe/commit/552d837b334b32130a50db389e5ceee1a66f0f37 │ 133 │ 11 │
26. │ https://github.com/BETAFPV/opentx/commit/49a748044b6fd538b581db3bab0c127f6c7959bc │ 128 │ 10 │
27. │ https://github.com/octocat/Spoon-Knife/commit/d0dd1f61b33d64e29d8bc1372a94ef6a2fee76a9 │ 126 │ 93 │
28. │ https://github.com/Unad-BDBasicas/Inicial_Evidencia_1_3/commit/c7ab5c04f438f0221d2beb75a868a82f40adcb56 │ 125 │ 98 │
29. │ https://github.com/Ar1i/PokemonGo-Bot/commit/e3d12abaf0c6ab022ea33c126856cac46c530f6d │ 123 │ 19 │
30. │ https://github.com/ruby/ruby/commit/6b8d4ab840b2d76d356ba30dbccfef4f5fd10767 │ 122 │ 122 │
31. │ https://github.com/tumblr/policy/commit/991c5d73775a0fa0e06e258768f91d040e27a7a5 │ 121 │ 82 │
32. │ https://github.com/ajithcj/Stockfish/commit/e8b409d740e3170aaa52d0304bab330a845e6ffe │ 120 │ 17 │
33. │ https://github.com/torvalds/linux/commit/1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 │ 120 │ 113 │
34. │ https://github.com/mosh-hamedani/vidly-mvc-5/commit/e7fb4e6973bbcf640780e44e18f3c12344460138 │ 117 │ 67 │
35. │ https://github.com/github/dmca/commit/bccf7d0dbfec423c4a967f668be47b6339d15893 │ 114 │ 78 │
36. │ https://github.com/mikegithubber/my-first-github-repository/commit/b5ed1a1bc1a737c13818847529fb21ed615e6e66 │ 109 │ 107 │
37. │ https://github.com/google/brotli/commit/c4f439dbe6007b2a37d50c20419253d5aaa8b46b │ 106 │ 53 │
38. │ https://github.com/londonappbrewery/destini-challenge-completed/commit/69ed867992fc05f13a4fbef452173a956312993d │ 103 │ 78 │
39. │ https://github.com/bkeepers/dear-github/commit/4afa490932578027462f2a8f404a38adace02f16 │ 101 │ 94 │
40. │ https://github.com/summerhoax/summerhoax.github.io/commit/446dbba4e51319fb62c55ef55bfbd12bfd92021f │ 100 │ 10 │
41. │ https://github.com/laboratorioIS/04_CV/commit/4a63b5b50251c59e74560e60054470fc9abc8d27 │ 99 │ 94 │
42. │ https://github.com/IIITSERC/SSAD_2015_A3_Group2_23/commit/404661768e73939c0dbe3fa80ca2752587158e29 │ 99 │ 10 │
43. │ https://github.com/mpv-player/mpv/commit/a20ae0417f2d1e1a2c173f5eaf66a81974df0008 │ 98 │ 25 │
44. │ https://github.com/shadowsocks/shadowsocks/commit/5b450acfaa15cd6c2d3e8ab99f9297542df74025 │ 98 │ 92 │
45. │ https://github.com/linsonder6/Tesla/commit/217ed735d4b568cd02b6a3903b305a622c14a0b1 │ 97 │ 11 │
46. │ https://github.com/PCSX2/pcsx2/commit/f81cf360bce91649a5827967dc5e73c926711611 │ 97 │ 10 │
47. │ https://github.com/mosh-hamedani/vidly-mvc-5/commit/8a852d11f4f8eb95b9f57e296fefb25bc0acd21b │ 96 │ 64 │
48. │ https://github.com/forezp/SpringcloudConfig/commit/a68876a6211369bae723348d5f8c3defe4a55e04 │ 95 │ 66 │
49. │ https://github.com/DrKLO/Telegram/commit/64e8ec3fbd26a876b7683f83a6f59c6b67316421 │ 95 │ 46 │
50. │ https://github.com/imsun/gitment/commit/cb5779f30b603b3431c2ee6e759ae6425d89797e │ 94 │ 30 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────┴─────────┘
50 rows in set. Elapsed: 0.233 sec. Processed 9.96 million rows, 758.60 MB (42.73 million rows/s., 3.25 GB/s.)
First entry does not load (GitHub is showing the angry unicorn). The second entry is something about CoC. This commit to Swift has quite entertaining discussion. But there are also many politically loaded discussions.
SELECT
concat('https://github.com/', repo_name, '/pull/', toString(number)) AS URL,
uniq(actor_login) AS authors
FROM github_events
WHERE (event_type = 'PullRequestReviewCommentEvent') AND (action = 'created')
GROUP BY
repo_name,
number
ORDER BY authors DESC
LIMIT 50
┌─URL─────────────────────────────────────────────────────────────┬─authors─┐
1. │ https://github.com/torvalds/linux/pull/684 │ 66 │
2. │ https://github.com/NixOS/rfcs/pull/49 │ 52 │
3. │ https://github.com/sunpy/sunpy/pull/3391 │ 52 │
4. │ https://github.com/dashbitco/mix_phx_gen_auth_demo/pull/1 │ 51 │
5. │ https://github.com/reactjs/rfcs/pull/2 │ 44 │
6. │ https://github.com/php/php-rfcs/pull/1 │ 40 │
7. │ https://github.com/danielmiessler/SecLists/pull/155 │ 38 │
8. │ https://github.com/symfony/symfony/pull/33553 │ 35 │
9. │ https://github.com/kubernetes/community/pull/306 │ 34 │
10. │ https://github.com/rust-lang/rfcs/pull/2850 │ 33 │
11. │ https://github.com/date-fns/date-fns/pull/1671 │ 32 │
12. │ https://github.com/github/site-policy/pull/1 │ 31 │
13. │ https://github.com/tc39/ecma262/pull/1062 │ 30 │
14. │ https://github.com/996icu/996.ICU/pull/25509 │ 30 │
15. │ https://github.com/symfony/symfony/pull/11882 │ 30 │
16. │ https://github.com/dotnet/designs/pull/92 │ 30 │
17. │ https://github.com/rust-lang/rfcs/pull/517 │ 29 │
18. │ https://github.com/open-telemetry/oteps/pull/97 │ 28 │
19. │ https://github.com/matrix-org/matrix-doc/pull/1772 │ 27 │
20. │ https://github.com/scipy/scipy/pull/11061 │ 27 │
21. │ https://github.com/Samsung/tizen-docs/pull/871 │ 26 │
22. │ https://github.com/rust-lang/rfcs/pull/1566 │ 25 │
23. │ https://github.com/bitcoin/bitcoin/pull/17977 │ 25 │
24. │ https://github.com/opscode/chef-rfc/pull/21 │ 24 │
25. │ https://github.com/rust-lang/rust-www/pull/202 │ 24 │
26. │ https://github.com/babel/babel.github.io/pull/1014 │ 24 │
27. │ https://github.com/bitcoin-core/secp256k1/pull/558 │ 24 │
28. │ https://github.com/rust-lang/rfcs/pull/2395 │ 23 │
29. │ https://github.com/NordicPlayground/fw-nrfconnect-nrf/pull/1854 │ 23 │
30. │ https://github.com/nodejs/modules/pull/23 │ 23 │
31. │ https://github.com/kubernetes/enhancements/pull/686 │ 23 │
32. │ https://github.com/nodejs/node/pull/11533 │ 23 │
33. │ https://github.com/openshift/openshift-docs/pull/18254 │ 23 │
34. │ https://github.com/bitcoin/bitcoin/pull/7910 │ 23 │
35. │ https://github.com/reactjs/rfcs/pull/6 │ 22 │
36. │ https://github.com/PowerShell/PowerShell-RFC/pull/185 │ 22 │
37. │ https://github.com/symfony/symfony/pull/23315 │ 22 │
38. │ https://github.com/EpicGames/Signup/pull/10 │ 22 │
39. │ https://github.com/nodejs/node/pull/5020 │ 22 │
40. │ https://github.com/rust-lang/rfcs/pull/1105 │ 22 │
41. │ https://github.com/tgstation/tgstation/pull/27155 │ 22 │
42. │ https://github.com/kubernetes/community/pull/1629 │ 22 │
43. │ https://github.com/symfony/symfony/pull/24411 │ 22 │
44. │ https://github.com/hannahhch/github-cheatsheet/pull/4 │ 22 │
45. │ https://github.com/whatwg/html/pull/3752 │ 22 │
46. │ https://github.com/apache/kafka/pull/6295 │ 22 │
47. │ https://github.com/bids-standard/pybids/pull/308 │ 21 │
48. │ https://github.com/rust-lang/rfcs/pull/1931 │ 21 │
49. │ https://github.com/kubernetes/enhancements/pull/1367 │ 21 │
50. │ https://github.com/rust-lang/rfcs/pull/2930 │ 21 │
└─────────────────────────────────────────────────────────────────┴─────────┘
50 rows in set. Elapsed: 0.714 sec. Processed 55.94 million rows, 945.90 MB (78.35 million rows/s., 1.32 GB/s.)
The first entry is true insanity. But there are many interesting ones like the proposal to add inline assembly to Rust.
SELECT
actor_login,
count() AS c,
uniq(repo_name) AS repos
FROM github_events
WHERE event_type = 'PushEvent'
GROUP BY actor_login
ORDER BY c DESC
LIMIT 50
┌─actor_login───────────────────────┬────────c─┬───repos─┐
1. │ LombiqBot │ 46195052 │ 143 │
2. │ github-actions[bot] │ 9432261 │ 85127 │
3. │ OpenLocalizationTest │ 4872353 │ 713 │
4. │ pull[bot] │ 4191714 │ 83077 │
5. │ renovate[bot] │ 4180068 │ 30544 │
6. │ direwolf-github │ 3333365 │ 1434177 │
7. │ renovate-bot │ 3184873 │ 647 │
8. │ peter-clifford │ 3139278 │ 3 │
9. │ newstools │ 2559470 │ 868 │
10. │ unitydemo2 │ 2331219 │ 19 │
11. │ tmtmtmtm │ 2313315 │ 1111 │
12. │ dependabot-preview[bot] │ 2262606 │ 42726 │
13. │ ssbattousai │ 2155448 │ 533 │
14. │ grid-bot │ 2138318 │ 42424 │
15. │ shangwoa │ 1817350 │ 6335 │
16. │ everypoliticianbot │ 1791899 │ 59 │
17. │ commit-b0t │ 1688482 │ 3 │
18. │ Dids │ 1679560 │ 162 │
19. │ KenanSulayman │ 1597624 │ 285 │
20. │ CMSQATeam │ 1526332 │ 17 │
21. │ chuan12 │ 1449121 │ 4 │
22. │ othhotro │ 1437714 │ 3 │
23. │ geos4s │ 1415217 │ 4 │
24. │ gugod │ 1343006 │ 241 │
25. │ scriptzteam │ 1314351 │ 103 │
26. │ franck-paul │ 1313061 │ 83 │
27. │ dependabot[bot] │ 1270454 │ 280190 │
28. │ nicopeters │ 1224007 │ 5 │
29. │ himobi │ 1215075 │ 1 │
30. │ shenzhouzd │ 1136430 │ 2 │
31. │ otiny │ 1123422 │ 1 │
32. │ speedtracker-bot │ 1098677 │ 413 │
33. │ CodePipeline-Test │ 1082170 │ 7 │
34. │ bossm │ 1004828 │ 3 │
35. │ willcbaker-ext │ 987090 │ 1 │
36. │ pakeji │ 970635 │ 10 │
37. │ mirror-updates │ 952345 │ 97 │
38. │ liferay-continuous-integration-hu │ 878964 │ 256 │
39. │ breakingheatmap │ 878493 │ 1 │
40. │ yath │ 838607 │ 67 │
41. │ asfgit │ 832330 │ 1647 │
42. │ openstack-gerrit │ 830559 │ 2526 │
43. │ TalibAzir │ 806098 │ 2 │
44. │ funilrys │ 787492 │ 243 │
45. │ ntbpm │ 781965 │ 7 │
46. │ Mr-Steal-Your-Script │ 778357 │ 111 │
47. │ olprod │ 777711 │ 2602 │
48. │ pbaffiliate1 │ 772611 │ 1298 │
49. │ cmsbuild │ 729008 │ 44 │
50. │ supermobiteam2 │ 723349 │ 1 │
└───────────────────────────────────┴──────────┴─────────┘
50 rows in set. Elapsed: 3.560 sec. Processed 1.60 billion rows, 20.17 GB (449.82 million rows/s., 5.67 GB/s.)
Obviously most of them are bots. If someone has pushed to 1 398 330 repositories, it's clearly a bot. Even someone pretending to be a human is clearly a bot.
How can we filter out bots? Let's add a threshold on the number of repositories. Let's also count only those who created at least two issues and gave at least two stars. And also output the favorite repository for every user. And also count only across the top 10k repositories.
SELECT
actor_login,
sum(event_type = 'PushEvent') AS c,
uniqIf(repo_name, event_type = 'PushEvent') AS repos,
sum(event_type = 'IssuesEvent') AS issues,
sum(event_type = 'WatchEvent') AS stars,
anyHeavy(repo_name)
FROM github_events
WHERE (event_type IN ('PushEvent', 'IssuesEvent', 'WatchEvent')) AND (repo_name IN
(
SELECT repo_name
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY repo_name
ORDER BY count() DESC
LIMIT 10000
))
GROUP BY actor_login
HAVING (repos < 10000) AND (issues > 1) AND (stars > 1)
ORDER BY c DESC
LIMIT 50
┌─actor_login──────┬─────c─┬─repos─┬─issues─┬─stars─┬─anyHeavy(repo_name)──────────────────────────┐
1. │ bgamari │ 64829 │ 1 │ 99 │ 11 │ syl20bnr/spacemacs │
2. │ ornicar │ 50108 │ 3 │ 4341 │ 94 │ ornicar/lila │
3. │ dtzWill │ 44031 │ 3 │ 152 │ 125 │ GitSquared/edex-ui │
4. │ Gargron │ 30983 │ 2 │ 2247 │ 152 │ arkency/reactjs_koans │
5. │ markjaquith │ 24605 │ 3 │ 43 │ 17 │ puphpet/puphpet │
6. │ PeterBot │ 23951 │ 1 │ 496 │ 3 │ cdnjs/cdnjs │
7. │ vitorgalvao │ 23817 │ 4 │ 3241 │ 106 │ caskroom/homebrew-cask │
8. │ Roshanjossey │ 22368 │ 2 │ 260 │ 38 │ bmorelli25/Become-A-Full-Stack-Web-Developer │
9. │ taylorotwell │ 21516 │ 9 │ 4508 │ 138 │ php/php-src │
10. │ kripken │ 21238 │ 6 │ 2086 │ 7 │ WebAssembly/binaryen │
11. │ tmorehouse │ 21014 │ 1 │ 1344 │ 25 │ bootstrap-vue/bootstrap-vue │
12. │ fabpot │ 20278 │ 13 │ 6543 │ 49 │ FriendsOfPHP/Goutte │
13. │ stamparm │ 17868 │ 2 │ 4170 │ 42 │ cirosantilli/x86-bare-metal-examples │
14. │ kovidgoyal │ 17593 │ 2 │ 2566 │ 17 │ Lokaltog/powerline │
15. │ radare │ 17516 │ 3 │ 6550 │ 15 │ mawww/kakoune │
16. │ josevalim │ 16490 │ 10 │ 8381 │ 3 │ rails/rails │
17. │ balloob │ 15059 │ 3 │ 2981 │ 79 │ home-assistant/home-assistant │
18. │ alexey-milovidov │ 13892 │ 2 │ 2637 │ 29 │ ClickHouse/ClickHouse │
19. │ mrdoob │ 13816 │ 4 │ 5371 │ 24 │ dataarts/webgl-globe │
20. │ kroitor │ 13593 │ 1 │ 4091 │ 122 │ vespa-engine/vespa │
21. │ jsteemann │ 13265 │ 1 │ 205 │ 9 │ arangodb/arangodb │
22. │ jerryzh168 │ 12872 │ 4 │ 113 │ 80 │ pytorch/pytorch │
23. │ akien-mga │ 12473 │ 1 │ 10464 │ 59 │ godotengine/godot │
24. │ afc163 │ 12416 │ 10 │ 10686 │ 1169 │ ant-design/ant-design │
25. │ timabbott │ 12381 │ 1 │ 5146 │ 8 │ zulip/zulip │
26. │ XhmikosR │ 12261 │ 15 │ 1698 │ 992 │ ethereum-mining/ethminer │
27. │ zachhuff386 │ 12229 │ 1 │ 395 │ 31 │ pritunl/pritunl │
28. │ yyx990803 │ 12204 │ 25 │ 7455 │ 305 │ meteor/meteor │
29. │ timgraham │ 12196 │ 1 │ 73 │ 2 │ django/django │
30. │ bkimminich │ 12009 │ 2 │ 890 │ 285 │ epeli/underscore.string │
31. │ normanmaurer │ 11957 │ 3 │ 3758 │ 8 │ eclipse/vert.x │
32. │ TomasVotruba │ 11891 │ 2 │ 2115 │ 171 │ thephpleague/flysystem │
33. │ thatch45 │ 11775 │ 1 │ 2061 │ 61 │ toml-lang/toml │
34. │ markstory │ 11644 │ 2 │ 3010 │ 79 │ neovim/neovim │
35. │ TimothyGu │ 11472 │ 9 │ 773 │ 105 │ pugjs/pug │
36. │ joaomoreno │ 11251 │ 3 │ 8579 │ 131 │ borgbackup/borg │
37. │ mmoayyed │ 10895 │ 1 │ 842 │ 156 │ apereo/cas │
38. │ eliben │ 10848 │ 5 │ 155 │ 4 │ google/go-cloud │
39. │ bpasero │ 10530 │ 3 │ 11763 │ 32 │ microsoft/vscode │
40. │ jreback │ 10498 │ 2 │ 9350 │ 6 │ pydata/pandas │
41. │ dcramer │ 10452 │ 5 │ 1680 │ 82 │ strapi/strapi │
42. │ maxlazio │ 10435 │ 1 │ 124 │ 4 │ gitlabhq/gitlabhq │
43. │ wing328 │ 10400 │ 2 │ 2847 │ 70 │ OpenAPITools/openapi-generator │
44. │ jjallaire │ 10297 │ 2 │ 109 │ 2 │ rstudio/rstudio │
45. │ mfussenegger │ 10224 │ 1 │ 323 │ 135 │ jkbr/httpie │
46. │ mikejolley │ 10189 │ 3 │ 10595 │ 7 │ Varying-Vagrant-Vagrants/VVV │
47. │ jdalton │ 10088 │ 12 │ 5116 │ 53 │ es-shims/es5-shim │
48. │ stephentoub │ 10041 │ 8 │ 5205 │ 8 │ dotnet/corefx │
49. │ bbatsov │ 10030 │ 7 │ 3644 │ 18 │ clojure-emacs/cider │
50. │ hrydgard │ 10005 │ 2 │ 2858 │ 5 │ hrydgard/ppsspp │
└──────────────────┴───────┴───────┴────────┴───────┴──────────────────────────────────────────────┘
50 rows in set. Elapsed: 4.550 sec. Processed 271.19 million rows, 5.16 GB (59.61 million rows/s., 1.13 GB/s.)
Ok, I see some real people here. I'm definitely sure about at least one of them.
SELECT
lower(substring(repo_name, 1, position(repo_name, '/'))) AS org,
count() AS stars
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY org
ORDER BY stars DESC
LIMIT 50
┌─org──────────────────┬───stars─┐
1. │ google/ │ 1414877 │
2. │ microsoft/ │ 1361303 │
3. │ facebook/ │ 1123380 │
4. │ alibaba/ │ 580707 │
5. │ sindresorhus/ │ 565535 │
6. │ apache/ │ 553204 │
7. │ vuejs/ │ 494824 │
8. │ tensorflow/ │ 425613 │
9. │ freecodecamp/ │ 407610 │
10. │ fossasia/ │ 398255 │
11. │ github/ │ 379901 │
12. │ airbnb/ │ 378280 │
13. │ 996icu/ │ 354957 │
14. │ angular/ │ 314363 │
15. │ square/ │ 296221 │
16. │ tencent/ │ 290161 │
17. │ symfony/ │ 281686 │
18. │ mozilla/ │ 275422 │
19. │ facebookresearch/ │ 269377 │
20. │ twitter/ │ 236120 │
21. │ shadowsocks/ │ 232179 │
22. │ kamranahmedse/ │ 209439 │
23. │ donnemartin/ │ 195645 │
24. │ netflix/ │ 193916 │
25. │ dotnet/ │ 191059 │
26. │ kubernetes/ │ 188557 │
27. │ golang/ │ 178867 │
28. │ googlesamples/ │ 176609 │
29. │ thealgorithms/ │ 174471 │
30. │ spring-projects/ │ 174352 │
31. │ zeit/ │ 174319 │
32. │ apple/ │ 173102 │
33. │ getify/ │ 171049 │
34. │ docker/ │ 170096 │
35. │ laravel/ │ 167122 │
36. │ jwasham/ │ 166419 │
37. │ googlechrome/ │ 162071 │
38. │ twbs/ │ 161475 │
39. │ flutter/ │ 159926 │
40. │ hashicorp/ │ 158643 │
41. │ awslabs/ │ 154676 │
42. │ jakewharton/ │ 143577 │
43. │ reactjs/ │ 142652 │
44. │ kennethreitz/ │ 140109 │
45. │ reactivex/ │ 140062 │
46. │ elastic/ │ 139639 │
47. │ googlecloudplatform/ │ 139094 │
48. │ uber/ │ 137406 │
49. │ atom/ │ 137328 │
50. │ justjavac/ │ 136694 │
└──────────────────────┴─────────┘
50 rows in set. Elapsed: 0.609 sec. Processed 232.13 million rows, 1.81 GB (380.98 million rows/s., 2.97 GB/s.)
You may notice that Google is slightly ahead of Microsoft. Actually, it depends on how you count. Maybe you should sum up Tensorflow, Kubernetes, Flutter, Golang, and Chrome for Google; GitHub and DotNet for Microsoft; Facebook Research and React for Facebook.
SELECT
lower(substring(repo_name, 1, position(repo_name, '/'))) AS org,
uniq(repo_name) AS repos
FROM
(
SELECT repo_name
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY repo_name
HAVING count() >= 10
)
GROUP BY org
ORDER BY repos DESC
LIMIT 50
┌─org──────────────────┬─repos─┐
1. │ microsoft/ │ 3359 │
2. │ google/ │ 1599 │
3. │ openstack/ │ 1412 │
4. │ packtpublishing/ │ 1381 │
5. │ apache/ │ 1017 │
6. │ sindresorhus/ │ 1016 │
7. │ azure/ │ 818 │
8. │ aws-samples/ │ 761 │
9. │ mozilla/ │ 727 │
10. │ googlecloudplatform/ │ 718 │
11. │ awslabs/ │ 712 │
12. │ jenkinsci/ │ 596 │
13. │ substack/ │ 594 │
14. │ ibm/ │ 593 │
15. │ adafruit/ │ 541 │
16. │ mapbox/ │ 494 │
17. │ mafintosh/ │ 481 │
18. │ azure-samples/ │ 473 │
19. │ cyanogenmod/ │ 451 │
20. │ apress/ │ 445 │
21. │ w3c/ │ 397 │
22. │ keijiro/ │ 391 │
23. │ udacity/ │ 387 │
24. │ stackforge/ │ 359 │
25. │ alibaba/ │ 348 │
26. │ facebookresearch/ │ 328 │
27. │ llsourcell/ │ 328 │
28. │ esri/ │ 325 │
29. │ facebook/ │ 311 │
30. │ egoist/ │ 302 │
31. │ github/ │ 299 │
32. │ fossasia/ │ 292 │
33. │ spatie/ │ 292 │
34. │ mattn/ │ 288 │
35. │ heroku/ │ 283 │
36. │ unity-technologies/ │ 278 │
37. │ vim-scripts/ │ 277 │
38. │ ropensci/ │ 275 │
39. │ jonschlinkert/ │ 271 │
40. │ gnome/ │ 267 │
41. │ googlesamples/ │ 264 │
42. │ maxogden/ │ 264 │
43. │ hashicorp/ │ 262 │
44. │ jetbrains/ │ 260 │
45. │ lineageos/ │ 259 │
46. │ eclipse/ │ 257 │
47. │ automattic/ │ 255 │
48. │ officedev/ │ 255 │
49. │ segmentio/ │ 253 │
50. │ aws/ │ 251 │
└──────────────────────┴───────┘
50 rows in set. Elapsed: 1.461 sec. Processed 232.13 million rows, 1.81 GB (158.87 million rows/s., 1.24 GB/s.)
To avoid showing users with a huge number of forked repositories or users with some GitHub misuse, we added a threshold of 10 stars. Microsoft wins by a huge margin. Packt Publishing is quite interesting — they have repositories for articles and books. Sindre Sorhus is also a very notable open-source contributor, mostly for the list of awesome awesomeness.
The size of community that is visible on GitHub.
SELECT
lower(substring(repo_name, 1, position(repo_name, '/'))) AS org,
uniq(actor_login) AS authors,
uniqIf(actor_login, event_type = 'PullRequestEvent') AS pr_authors,
uniqIf(actor_login, event_type = 'IssuesEvent') AS issue_authors,
uniqIf(actor_login, event_type = 'IssueCommentEvent') AS comment_authors,
uniqIf(actor_login, event_type = 'PullRequestReviewCommentEvent') AS review_authors,
uniqIf(actor_login, event_type = 'PushEvent') AS push_authors
FROM github_events
WHERE event_type IN ('PullRequestEvent', 'IssuesEvent', 'IssueCommentEvent', 'PullRequestReviewCommentEvent', 'PushEvent')
GROUP BY org
ORDER BY authors DESC
LIMIT 50
┌─org─────────────────────┬─authors─┬─pr_authors─┬─issue_authors─┬─comment_authors─┬─review_authors─┬─push_authors─┐
│ microsoft/ │ 241626 │ 31468 │ 143654 │ 183013 │ 10727 │ 7020 │
│ facebook/ │ 114166 │ 19094 │ 49710 │ 93241 │ 4966 │ 1168 │
│ google/ │ 99459 │ 29227 │ 49833 │ 69777 │ 6777 │ 3074 │
│ microsoftdocs/ │ 91337 │ 19443 │ 60886 │ 49466 │ 1652 │ 1325 │
│ angular/ │ 79259 │ 9680 │ 34305 │ 66782 │ 2124 │ 199 │
│ docker/ │ 78973 │ 6854 │ 35368 │ 60896 │ 1810 │ 261 │
│ apache/ │ 67317 │ 32971 │ 22574 │ 48848 │ 14148 │ 2016 │
│ azure/ │ 58551 │ 13249 │ 33899 │ 44740 │ 5675 │ 3037 │
│ tensorflow/ │ 54582 │ 7833 │ 26565 │ 46277 │ 2559 │ 515 │
│ dotnet/ │ 47447 │ 7463 │ 27745 │ 36630 │ 3150 │ 875 │
│ aws/ │ 45923 │ 6675 │ 23550 │ 37467 │ 2004 │ 899 │
│ kubernetes/ │ 45667 │ 11699 │ 21882 │ 38617 │ 5407 │ 337 │
│ fortawesome/ │ 45078 │ 648 │ 13436 │ 38262 │ 46 │ 15 │
│ learn-co-students/ │ 40335 │ 40188 │ 858 │ 1775 │ 179 │ 615 │
│ alibaba/ │ 38910 │ 4307 │ 25251 │ 26695 │ 767 │ 747 │
│ atom/ │ 38636 │ 3607 │ 18516 │ 29466 │ 740 │ 113 │
│ elastic/ │ 38434 │ 6551 │ 18754 │ 30945 │ 1739 │ 726 │
│ hashicorp/ │ 37577 │ 7375 │ 19230 │ 29485 │ 1819 │ 480 │
│ flutter/ │ 37463 │ 2824 │ 17944 │ 31666 │ 1105 │ 258 │
│ mozilla/ │ 36014 │ 8653 │ 20173 │ 26490 │ 3062 │ 1363 │
│ vuejs/ │ 34443 │ 7302 │ 16766 │ 25358 │ 1046 │ 81 │
│ npm/ │ 33174 │ 4013 │ 14414 │ 24715 │ 381 │ 105 │
│ ansible/ │ 33000 │ 9272 │ 16673 │ 26281 │ 2745 │ 186 │
│ jlord/ │ 32825 │ 32253 │ 483 │ 2254 │ 22 │ 14 │
│ homebrew/ │ 31386 │ 13938 │ 11753 │ 22836 │ 4102 │ 92 │
│ / │ 30898 │ 1427 │ 1979 │ 1915 │ 195 │ 27879 │
│ laravel/ │ 30572 │ 9513 │ 14307 │ 24452 │ 1519 │ 27 │
│ aspnet/ │ 30071 │ 3840 │ 17213 │ 24046 │ 999 │ 222 │
│ github/ │ 29092 │ 11987 │ 10191 │ 19515 │ 2088 │ 682 │
│ valvesoftware/ │ 27060 │ 552 │ 12612 │ 23736 │ 62 │ 49 │
│ udacity/ │ 26844 │ 21197 │ 4065 │ 8869 │ 429 │ 356 │
│ firebase/ │ 25945 │ 2168 │ 11587 │ 21571 │ 596 │ 277 │
│ freecodecamp/ │ 25859 │ 10681 │ 10117 │ 14189 │ 792 │ 194 │
│ nextcloud/ │ 25701 │ 2376 │ 14521 │ 21043 │ 617 │ 273 │
│ firstcontributions/ │ 25343 │ 25085 │ 143 │ 2198 │ 62 │ 5 │
│ angular-ui/ │ 24559 │ 3293 │ 11466 │ 19615 │ 336 │ 93 │
│ terraform-providers/ │ 24515 │ 5038 │ 11408 │ 19948 │ 1866 │ 350 │
│ awslabs/ │ 24377 │ 5956 │ 12788 │ 17445 │ 1639 │ 1226 │
│ nodejs/ │ 23831 │ 4747 │ 10691 │ 19172 │ 2118 │ 437 │
│ jenkinsci/ │ 23737 │ 9809 │ 7190 │ 17259 │ 2700 │ 2122 │
│ rails/ │ 23513 │ 7010 │ 10397 │ 19602 │ 2299 │ 126 │
│ react-native-community/ │ 23475 │ 2610 │ 8760 │ 19944 │ 652 │ 153 │
│ googlecloudplatform/ │ 23173 │ 6961 │ 12848 │ 17364 │ 2870 │ 1466 │
│ ant-design/ │ 23159 │ 2598 │ 14263 │ 16656 │ 577 │ 129 │
│ spring-projects/ │ 22626 │ 5636 │ 12276 │ 16141 │ 931 │ 136 │
│ home-assistant/ │ 22568 │ 5707 │ 10258 │ 19222 │ 2256 │ 124 │
│ definitelytyped/ │ 21679 │ 13908 │ 4749 │ 13707 │ 4122 │ 60 │
│ golang/ │ 21245 │ 2486 │ 13049 │ 16544 │ 303 │ 53 │
│ automattic/ │ 21196 │ 3513 │ 11292 │ 16248 │ 1065 │ 748 │
│ grafana/ │ 21175 │ 2777 │ 9586 │ 17504 │ 690 │ 143 │
└─────────────────────────┴─────────┴────────────┴───────────────┴─────────────────┴────────────────┴──────────────┘
50 rows in set. Elapsed: 11.110 sec. Processed 2.20 billion rows, 27.55 GB (198.17 million rows/s., 2.48 GB/s.)
Microsoft wins in 4 out of 6 categories. Apache wins for the number of code reviewers (I thought they are using JIRA, but actually it's not always the case). Learn-co-students wins for the number of PR authors. By the way, they also have 250 000 repositories!
Please take it with a grain of salt. Not every team is using GitHub as their issue tracker or code review system. Linux and Postgres are using maillists. LLVM is using Bugzilla and Phabricator.
I want to get the repositories with the most added and removed code over time. If I do it in a naive way, the multiple forks of cdn.js repository will be on top. By the way, cdn.js is the largest repository in GitHub by size in bytes (it contains all popular JavaScript libraries, the total size is 254GB). To get something interesting, I added a threshold for the number of diff size and also for the ratio of added and removed code (in actively developing codebases, the ratio should be close to one).
SELECT
repo_name,
count() AS prs,
uniq(actor_login) AS authors,
sum(additions) AS adds,
sum(deletions) AS dels
FROM github_events
WHERE (event_type = 'PullRequestEvent') AND (action = 'opened') AND (additions < 10000) AND (deletions < 10000)
GROUP BY repo_name
HAVING (adds / dels) < 10
ORDER BY adds + dels DESC
LIMIT 50
┌─repo_name──────────────────────────────────┬────prs─┬─authors─┬─────adds─┬─────dels─┐
│ everypolitician/everypolitician-data │ 150531 │ 18 │ 66782324 │ 71203492 │
│ brianchandotcom/liferay-portal │ 91962 │ 337 │ 29304605 │ 13799025 │
│ googleapis/google-api-java-client-services │ 4689 │ 17 │ 12184021 │ 9874444 │
│ code-dot-org/code-dot-org │ 35776 │ 125 │ 14329448 │ 6810679 │
│ elastic/kibana │ 51783 │ 798 │ 12478455 │ 6092002 │
│ dotnet/roslyn │ 23642 │ 580 │ 12112106 │ 5703835 │
│ shuyangzhou/liferay-portal │ 7334 │ 140 │ 11510525 │ 6287677 │
│ cms-sw/cmssw │ 28087 │ 947 │ 9488742 │ 5076486 │
│ Azure/azure-sdk-for-python │ 10323 │ 339 │ 10692004 │ 3691779 │
│ Azure/azure-sdk-for-net │ 9775 │ 935 │ 9759909 │ 3188655 │
│ Azure/azure-sdk-for-java │ 11486 │ 270 │ 10819433 │ 2093449 │
│ kubernetes/kubernetes │ 50005 │ 4010 │ 8499117 │ 4214536 │
│ Azure/azure-sdk-for-js │ 8559 │ 173 │ 7966391 │ 2764315 │
│ brain-tec/odoo │ 11568 │ 25 │ 6844586 │ 3767385 │
│ rust-lang/rust │ 35150 │ 3253 │ 6333591 │ 4134718 │
│ Azure/azure-sdk-for-go │ 11374 │ 145 │ 8813959 │ 1631936 │
│ ImagicalMine/ImagicalMine │ 1808 │ 494 │ 5657300 │ 4651706 │
│ cockroachdb/cockroach │ 27273 │ 448 │ 6339775 │ 3681529 │
│ AzureSDKAutomation/azure-sdk-for-go │ 11359 │ 5 │ 7347712 │ 2664515 │
│ AzureSDKAutomation/azure-sdk-for-java │ 5625 │ 2 │ 8289316 │ 950276 │
│ dotnet/corefx │ 23909 │ 1077 │ 5551303 │ 3611403 │
│ elastic/elasticsearch │ 35505 │ 1954 │ 6096474 │ 3035768 │
│ sergiogonzalez/liferay-portal │ 3834 │ 160 │ 6082683 │ 2815344 │
│ Azure/azure-powershell │ 6641 │ 1141 │ 6148306 │ 2514765 │
│ DefinitelyTyped/DefinitelyTyped │ 35999 │ 13794 │ 5722804 │ 2630593 │
│ NixOS/nixpkgs │ 84339 │ 3806 │ 5039996 │ 2723533 │
│ ballerina-platform/ballerina-lang │ 11611 │ 200 │ 4683027 │ 2902904 │
│ AzureSDKAutomation/azure-sdk-for-python │ 3169 │ 8 │ 5340322 │ 2059824 │
│ Baystation12/Baystation12 │ 17379 │ 581 │ 4046648 │ 3326499 │
│ Roll20/roll20-character-sheets │ 6955 │ 1090 │ 5360348 │ 2000635 │
│ CleverRaven/Cataclysm-DDA │ 27370 │ 1421 │ 4512562 │ 2775415 │
│ edx/edx-platform │ 24649 │ 721 │ 4436061 │ 2720801 │
│ apple/swift │ 33600 │ 1148 │ 4628287 │ 2429224 │
│ Azure/azure-rest-api-specs │ 9771 │ 1830 │ 6060533 │ 971685 │
│ mozilla-b2g/gaia │ 30192 │ 903 │ 4533023 │ 2275032 │
│ BabylonJS/Babylon.js │ 6378 │ 379 │ 3782068 │ 2994772 │
│ apache/flink │ 13581 │ 1142 │ 4835276 │ 1780093 │
│ Ericsson/llvm-project │ 5273 │ 7 │ 4409783 │ 1959794 │
│ ansible/ansible │ 42368 │ 7278 │ 4943233 │ 1324581 │
│ ceph/ceph │ 37310 │ 1394 │ 4388528 │ 1763042 │
│ odoo/odoo │ 46326 │ 2002 │ 4017686 │ 2067374 │
│ pytorch/pytorch │ 29916 │ 2188 │ 3818920 │ 2261688 │
│ cocos2d/cocos2d-x │ 14981 │ 1125 │ 3764992 │ 2237766 │
│ Azure/azure-sdk-for-ruby │ 1582 │ 93 │ 3334700 │ 2667297 │
│ apache/ignite │ 6894 │ 366 │ 4519341 │ 1470598 │
│ juju/juju │ 11939 │ 145 │ 3949798 │ 1887980 │
│ tensorflow/tensorflow │ 15404 │ 3806 │ 4007607 │ 1503030 │
│ servo/servo │ 14402 │ 1147 │ 3660324 │ 1840343 │
│ AzureSDKAutomation/azure-sdk-for-js │ 2655 │ 4 │ 4085633 │ 1330763 │
│ natecavanaugh/liferay-portal │ 3608 │ 109 │ 3534320 │ 1779103 │
└────────────────────────────────────────────┴────────┴─────────┴──────────┴──────────┘
50 rows in set. Elapsed: 1.290 sec. Processed 214.63 million rows, 5.47 GB (166.42 million rows/s., 4.24 GB/s.)
SELECT
repo_name,
count() AS pushes,
uniq(actor_login) AS authors
FROM github_events
WHERE (event_type = 'PushEvent') AND (repo_name IN
(
SELECT repo_name
FROM github_events
WHERE event_type = 'WatchEvent'
GROUP BY repo_name
ORDER BY count() DESC
LIMIT 10000
))
GROUP BY repo_name
ORDER BY count() DESC
LIMIT 50
┌─repo_name────────────────────┬─pushes─┬─authors─┐
│ CocoaPods/Specs │ 516861 │ 375 │
│ odoo/odoo │ 239208 │ 95 │
│ docker-library/docs │ 237496 │ 6 │
│ openstack/openstack │ 202003 │ 2 │
│ greatfire/wiki │ 169566 │ 4 │
│ pytorch/pytorch │ 147447 │ 212 │
│ / │ 145675 │ 18546 │
│ NixOS/nixpkgs │ 125285 │ 177 │
│ Automattic/wp-calypso │ 120883 │ 432 │
│ edx/edx-platform │ 118395 │ 330 │
│ freebsd/freebsd │ 109816 │ 2 │
│ ghc/ghc │ 79590 │ 6 │
│ servo/servo │ 76171 │ 33 │
│ JetBrains/kotlin │ 75102 │ 119 │
│ boostorg/boost │ 72188 │ 14 │
│ llvm-mirror/llvm │ 71895 │ 2 │
│ gradle/gradle │ 71725 │ 72 │
│ elastic/elasticsearch │ 70432 │ 252 │
│ chromium/chromium │ 67467 │ 3 │
│ getsentry/sentry │ 63600 │ 96 │
│ keybase/client │ 61854 │ 44 │
│ guardian/frontend │ 61601 │ 205 │
│ rust-lang/rust │ 60604 │ 39 │
│ elastic/kibana │ 59447 │ 329 │
│ llvm/llvm-project │ 56714 │ 826 │
│ arangodb/arangodb │ 55932 │ 46 │
│ discourse/discourse │ 55921 │ 47 │
│ cockroachdb/cockroach │ 55384 │ 139 │
│ JuliaLang/julia │ 54396 │ 75 │
│ owncloud/core │ 53321 │ 136 │
│ apple/swift │ 52099 │ 188 │
│ ornicar/lila │ 51919 │ 13 │
│ ceph/ceph │ 51024 │ 144 │
│ Homebrew/homebrew-core │ 50537 │ 50 │
│ tensorflow/tensorflow │ 50499 │ 110 │
│ WordPress/gutenberg │ 49950 │ 179 │
│ sourcegraph/sourcegraph │ 49580 │ 84 │
│ JetBrains/intellij-community │ 48047 │ 15 │
│ cdnjs/cdnjs │ 46219 │ 24 │
│ mongodb/mongo │ 43202 │ 268 │
│ Homebrew/homebrew-cask │ 41360 │ 14 │
│ ansible/ansible │ 41176 │ 74 │
│ dart-lang/sdk │ 40914 │ 104 │
│ php/php-src │ 39739 │ 8 │
│ libretro/RetroArch │ 39628 │ 17 │
│ h2oai/h2o-3 │ 38992 │ 99 │
│ kubernetes/kubernetes │ 38568 │ 94 │
│ crate/crate │ 38146 │ 47 │
│ ruby/ruby │ 37868 │ 31 │
│ mono/monodevelop │ 37843 │ 96 │
└──────────────────────────────┴────────┴─────────┘
50 rows in set. Elapsed: 1.636 sec. Processed 79.04 million rows, 626.05 MB (48.30 million rows/s., 382.62 MB/s.)
SELECT
actor_login,
count(),
uniq(repo_name) AS repos,
uniq(repo_name, number) AS prs,
replaceRegexpAll(substringUTF8(anyHeavy(body), 1, 100), '[\r\n]', ' ') AS comment
FROM github_events
WHERE (event_type = 'PullRequestReviewCommentEvent') AND (action = 'created')
GROUP BY actor_login
ORDER BY count() DESC
LIMIT 50
┌─actor_login──────────────┬─count()─┬─repos─┬───prs─┬─comment─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ houndci-bot │ 991954 │ 5877 │ 78600 │ Trailing Whitespace Violation: Lines should not have trailing whitespace. (trailing_whitespace) │
│ houndci │ 342114 │ 3102 │ 28679 │ Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping. │
│ codacy-bot │ 327712 │ 2313 │ 21581 │ ![Codacy](https://app.codacy.com/assets/images/favicon.png) Issue found: [Use of !important](https:/ │
│ hound[bot] │ 224854 │ 1734 │ 13812 │ Parsing error: 'import' and 'export' may appear only with 'sourceType: module' │
│ codeclimate[bot] │ 90778 │ 1424 │ 11589 │ Similar blocks of code found in 2 locations. Consider refactoring. │
│ github-learning-lab[bot] │ 64453 │ 15296 │ 27480 │ ## Step 10: Add links to your list ✅ Check ✅ That ✅ Off your list! Great job with lists. Let's try │
│ stickler-ci │ 56022 │ 457 │ 7757 │ Unexpected " " (selector-descendant-combinator-no-non-space) [error] │
│ golangcibot │ 40329 │ 885 │ 8305 │ S1012: should use `time.Since` instead of `time.Now().Sub` (from `gosimple`) │
│ github-actions[bot] │ 38216 │ 733 │ 3321 │ **[misspell]** reported by [reviewdog](https://github.com/reviewdog/reviewdog) :dog: "env │
│ accesslint[bot] │ 37459 │ 1121 │ 3889 │ This image is missing a text alternative (`alt` attribute). This is a problem for people using scree │
│ daosbuild1 │ 35275 │ 5 │ 1423 │ (style) trailing whitespace │
│ foreign-sub │ 33407 │ 3 │ 196 │ Codacy found an issue: [Use of assert detected. The enclosed code will be removed when compiling to │
│ jreback │ 30364 │ 40 │ 6184 │ create this as another fixture (e.g. *like* df_strings*, but name something different) and use param │
│ staging-muse-bot[bot] │ 29825 │ 18 │ 104 │ comment ignored │
│ seanlip │ 29027 │ 18 │ 2697 │ Sgtm. Thanks @prasanna08! │
│ CyrusNajmabadi │ 28783 │ 55 │ 3628 │ ah, that's fair. :) │
│ stephentoub │ 28284 │ 69 │ 6214 │ Nit: ```C# Handle?.Dispose(); ``` │
│ TOMARUMARU │ 28211 │ 4228 │ 9736 │ doneとfailで共通する処理はalwaysにまとめましょう。 │
│ monocodus[bot] │ 26457 │ 391 │ 1596 │ _**The function is too complicated**_ `bcon_print( const bcon * bc)` now has [cyclomatic complexity │
│ liggitt │ 23745 │ 134 │ 5488 │ suggest dropping this from the initial backfill merge, and adding it in a later PR │
│ i-date │ 23523 │ 3922 │ 6280 │ > 300円未満の間違いではないでしょうか? > ありがとうございます。修正しました! 上記ですが、一度修正いただいたあとにrevertされて修正前に戻っているようですのでご確認ください。 │
│ vkurennov │ 23280 │ 666 │ 3674 │ Ок, но все равно раздели на 2 метода │
│ sourcery-ai[bot] │ 23201 │ 1177 │ 1652 │ Function `test_steps_are_treated_as_coroutines` refactored with the following changes: - Replace ran │
│ codeschool-kiddo │ 22994 │ 7137 │ 22156 │ Looks good! Could you also please mention your favorite Code School path in your introduction? This │
│ MartinHjelmare │ 21639 │ 48 │ 3957 │ Ok. I think a timeout option seems necessary here. @balloob? │
│ sonarcloud[bot] │ 21452 │ 275 │ 1421 │ ![Code Smell](https://sonarsource.github.io/sonar-github/code_smell.png 'Code Smell') Code Smell: Re │
│ shingoteshima │ 21248 │ 3742 │ 6996 │ groupsテーブルのカラムなのでnameのみで十分かと思います。 │
│ balloob │ 19240 │ 76 │ 4967 │ This should be assigned to a parameter and should only be done if `test_timestamp` was not passed in │
│ cloud-fan │ 18803 │ 18 │ 3582 │ nit: we should build the attribute set ahead, and use `partitionAttrs.contains(cond.references.head) │
│ yurii-litvinov │ 18322 │ 525 │ 3219 │ Тут result-у вовсе не обязательно быть volatile --- сначала выполняется запись в result, затем в sup │
│ deads2k │ 17824 │ 132 │ 4114 │ This is very important. I'm not super familiar with the doc layout, but is there a way to put a gol │
│ rynowak │ 17366 │ 107 │ 3002 │ This is all that's required to get MVC to look in another directory - if you need to change director │
│ DoanVanToan │ 17118 │ 243 │ 1979 │ setText(item.get... │
│ smarterclayton │ 16608 │ 138 │ 4281 │ So another option is to do what support-operator does and just ignore the tags: https://github.co │
│ pedrobaeza │ 16563 │ 194 │ 4644 │ s/it is available at/that it's available at │
│ WikiaTech │ 16347 │ 3 │ 1594 │ ![MAJOR](https://raw.githubusercontent.com/SonarCommunity/sonar-github/master/images/severity-major. │
│ Nitpick-CI │ 16113 │ 323 │ 1923 │ Spaces must be used to indent lines; tabs are not allowed │
│ boegel │ 15907 │ 102 │ 3951 │ Time will tell. Going forward we can always easily overrule the easyblock by including `sanity_ch │
│ sindresorhus │ 15829 │ 834 │ 4432 │ https://github.com/sindresorhus/path-key/blob/d60207f9ab9dc9e60d49c87faacf415a4946287c/index.js#L8-L │
│ jkotas │ 15805 │ 54 │ 4336 │ (for bytes). For `_hex` - not yet. │
│ codelingo[bot] │ 15768 │ 35 │ 14662 │ Local variable "err" with a limited scope (less than 10 lines) should be short rather than long [as │
│ actcat-bot │ 15661 │ 4 │ 421 │ __[CoffeeLint]__ object かクラスにおいて重複したキーが定義されています。 │
│ Hixie │ 15030 │ 52 │ 2909 │ When changing this number, update the error message below as well describing all expected license ty │
│ hellovietnam93 │ 14919 │ 257 │ 2567 │ validate validates :follower. em tìm hiểu vì sao giúp anh nhé │
│ sttts │ 14900 │ 160 │ 2875 │ Implied by "must be pruned", which depends on `x-preserve-unknown-fields`. │
│ dev-dotcms │ 14860 │ 1 │ 293 │ Codacy found an issue: [Avoid empty catch blocks](https://app.codacy.com/manual/dotCMS/core/pullRequ │
│ chriseth │ 14672 │ 59 │ 2746 │ This sounds like we are criticising their tool based on the conclusions they draw from using it. We │
│ SonarTech │ 14625 │ 71 │ 2397 │ ![Code Smell](https://sonarsource.github.io/sonar-github/code_smell.png 'Code Smell') Code Smell: Co │
│ mattmoor │ 14513 │ 109 │ 2364 │ maybe you need to pull HEAD? │
│ stof │ 14421 │ 400 │ 5355 │ you should switch to PSR-4 rather than PSR-0 + target-dir. target-dir is considered a legacy setting │
└──────────────────────────┴─────────┴───────┴───────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
50 rows in set. Elapsed: 1.069 sec. Processed 55.94 million rows, 11.94 GB (52.34 million rows/s., 11.17 GB/s.)
Most of the most active reviewers are robots that do style checking and static analysis. But there are real people: Jeff Reback. You can also find teachers who review code from students.
What are the most popular labels for issues and pull requests?
SELECT
arrayJoin(labels) AS label,
count() AS c
FROM github_events
WHERE (event_type IN ('IssuesEvent', 'PullRequestEvent', 'IssueCommentEvent')) AND (action IN ('created', 'opened', 'labeled'))
GROUP BY label
ORDER BY c DESC
LIMIT 50
┌─label─────────────────────────┬────────c─┐
1. │ bug │ 11193148 │
2. │ enhancement │ 10885289 │
3. │ dependencies │ 7752896 │
4. │ question │ 3086354 │
5. │ help wanted │ 2280160 │
6. │ greenkeeper │ 2225510 │
7. │ approved │ 1880281 │
8. │ cla: yes │ 1802662 │
9. │ cncf-cla: yes │ 1541233 │
10. │ lgtm │ 1526217 │
11. │ feature │ 1198997 │
12. │ Bug │ 1186836 │
13. │ in progress │ 1049065 │
14. │ size/XS │ 1044363 │
15. │ size/L │ 915358 │
16. │ good first issue │ 803696 │
17. │ documentation │ 747209 │
18. │ size/M │ 730813 │
19. │ kind/bug │ 709181 │
20. │ release-note-none │ 686971 │
21. │ size/S │ 639025 │
22. │ size/XXL │ 486264 │
23. │ Feature │ 475471 │
24. │ javascript │ 452995 │
25. │ discussion │ 437075 │
26. │ Enhancement │ 428585 │
27. │ feature request │ 428393 │
28. │ orp-pending │ 409974 │
29. │ pending-signatures │ 368777 │
30. │ CLA Signed │ 363443 │
31. │ type: bug │ 346937 │
32. │ release-note │ 341192 │
33. │ kind/feature │ 336981 │
34. │ do-not-merge/work-in-progress │ 333276 │
35. │ ok-to-test │ 320553 │
36. │ bugzilla/valid-bug │ 319332 │
37. │ cla-already-signed │ 313960 │
38. │ do-not-merge/hold │ 308090 │
39. │ wontfix │ 301826 │
40. │ comparison-pending │ 297687 │
41. │ review │ 297014 │
42. │ invalid │ 292288 │
43. │ triaged │ 291425 │
44. │ Gitalk │ 289986 │
45. │ needs-ok-to-test │ 277338 │
46. │ duplicate │ 274092 │
47. │ size/XL │ 269963 │
48. │ ci:test:sf - success │ 267646 │
49. │ Type: Bug │ 247042 │
50. │ needs-priority │ 239657 │
└───────────────────────────────┴──────────┘
50 rows in set. Elapsed: 0.672 sec. Processed 544.36 million rows, 7.71 GB (810.65 million rows/s., 11.49 GB/s.)
There are more bugs than enhancements. Fortunately, only by a little. "Javascript" is the only programming-language related label at the top.
The diversity of bugs and features is overwhelming:
SELECT
arrayJoin(labels) AS label,
count() AS c
FROM github_events
WHERE (event_type IN ('IssuesEvent', 'PullRequestEvent', 'IssueCommentEvent')) AND (action IN ('created', 'opened', 'labeled')) AND ((label ILIKE '%bug%') OR (label ILIKE '%feature%'))
GROUP BY label
ORDER BY c DESC
LIMIT 50
┌─label──────────────────────┬────────c─┐
1. │ bug │ 11193148 │
2. │ feature │ 1198997 │
3. │ Bug │ 1186836 │
4. │ kind/bug │ 709181 │
5. │ Feature │ 475471 │
6. │ feature request │ 428393 │
7. │ type: bug │ 346937 │
8. │ kind/feature │ 336981 │
9. │ bugzilla/valid-bug │ 319332 │
10. │ Type: Bug │ 247042 │
11. │ feature-request │ 222841 │
12. │ Feature Request │ 205900 │
13. │ type:bug │ 172758 │
14. │ new feature │ 145957 │
15. │ [Type] Bug │ 96927 │
16. │ type: feature │ 93263 │
17. │ bugzilla/invalid-bug │ 88513 │
18. │ bugzilla/severity-medium │ 87248 │
19. │ type/bug │ 86832 │
20. │ bugzilla/severity-high │ 83727 │
21. │ type-bug │ 64255 │
22. │ Feature request │ 63680 │
23. │ bugfix │ 60863 │
24. │ type: feature request │ 60534 │
25. │ Type: Feature │ 60205 │
26. │ type:feature │ 59464 │
27. │ bug fix │ 55374 │
28. │ C-bug │ 54114 │
29. │ BUG │ 53977 │
30. │ New Feature │ 51552 │
31. │ bug_report │ 50716 │
32. │ doc-bug │ 47805 │
33. │ T: bug │ 46117 │
34. │ bug report │ 43876 │
35. │ 🐞 bug │ 38060 │
36. │ debug │ 34964 │
37. │ dummy import from bugzilla │ 33362 │
38. │ Type: bug │ 33067 │
39. │ type: bug/fix │ 32971 │
40. │ t/bug :bug: │ 30793 │
41. │ Type:Bug │ 30091 │
42. │ Type: Feature Request │ 28620 │
43. │ bugzilla/severity-low │ 28188 │
44. │ feature_request │ 28084 │
45. │ 0.kind: bug │ 27833 │
46. │ BZ-BUG-STATUS: RESOLVED │ 27735 │
47. │ type.DocumentationBug │ 26486 │
48. │ Issue-Bug │ 25833 │
49. │ type.FunctionalityBug │ 25390 │
50. │ bugzilla/severity-urgent │ 24975 │
└────────────────────────────┴──────────┘
50 rows in set. Elapsed: 0.620 sec. Processed 544.36 million rows, 7.71 GB (877.75 million rows/s., 12.44 GB/s.)
Actually, the "bugzilla" label is not only about bugs.
WITH arrayJoin(labels) AS label
SELECT
sum(label ILIKE '%bug%') AS bugs,
sum(label ILIKE '%feature%') AS features,
bugs / features AS ratio
FROM github_events
WHERE (event_type IN ('IssuesEvent', 'PullRequestEvent', 'IssueCommentEvent')) AND (action IN ('created', 'opened', 'labeled')) AND ((label ILIKE '%bug%') OR (label ILIKE '%feature%'))
┌─────bugs─┬─features─┬──────────────ratio─┐
│ 17430607 │ 4828695 │ 3.6097966427782247 │
└──────────┴──────────┴────────────────────┘
1 rows in set. Elapsed: 0.586 sec. Processed 544.36 million rows, 7.71 GB (928.19 million rows/s., 13.15 GB/s.)
Every feature generates 3.64 bugs on average.
:) SELECT count(), repo_name FROM github_events WHERE event_type = 'WatchEvent' GROUP BY repo_name ORDER BY length(repo_name) DESC LIMIT 50
┌─count()─┬─repo_name────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 1 │ accounts-inheritance-finders-of-america/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee │
│ 15 │ fkdjfkdkfjskfjsldkfjvndkslakfjgkdlskfjg/fksfekgkslgkkdjfkdlskfjkfkdlslakdjfkekjlklfkjdkslkjfojksjdfaskdjfsalkdfj-laskfjls-kdajfoasjfpoawefjk │
│ 1 │ joooooooooooooooooooooooooooooooooooooj/jooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo │
│ 1 │ UDtUf4aRyjv7nU3YTg056xCJw1ghAJYXgad7oB5/hJJRB7rpa3IXwX8HRsA1B4jCDmlZBY9fAzXZWNPyhrsXYG5kCeC4RPFqKQ4I9sAu1aNzX2G6wAkBjm8BjPfKjdubEqkmeAIkIwgu │
│ 1 │ a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a/a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a- │
│ 1 │ JonathanJonathanJonathanJonathanJonatha/JonathanJonathanJonathanJonathanJonathanJonathanJonathanJonathanJonathanJonathanJonathanJonathanJona │
│ 3 │ uwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwu/uwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuuwuwuwuwuwuwuwuwuwuwuw │
│ 4 │ joooooooooooooooooooooooooooooooooooooj/jooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooj │
│ 1 │ dldadnasjndfsjafajnsdfadnfadfkoafasdklk/fjkdasifjasdiofasiodfasjdogasdgajsdpgasdpgasjkdgo-askdpgspkdgvjaspgpasjidpgjiajisdgpjioasdpigasgpjia │
│ 2 │ yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy/yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy │
│ 11 │ yingyingyingyingyingyingyingyingying/yingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingyingying │
│ 2 │ Gremling-Machine-Learning-Study-Group/A-Simple-Deep-Neural-Network-Approach-to-Estimate-the-Reproduction-Number-for-Pandemic-Influenza-R0 │
│ 1 │ reeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/reeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee │
│ 3 │ Artificial-Intelligence-Big-Data-Lab/A-Multi-Layer-and-Multi-Ensembled-Stock-Trader-Using-Deep-Learning-and-Deep-Reinforcement-Learning │
│ 1 │ Gremling-Machine-Learning-Study-Group/Extracting-the-Ideality-Factor-of-a-Diode-on-the-Single-Diode-Solar-Cell-Model-with-Deep-Learning │
│ 1 │ 1b3634f6-9166-46d7-a43a-51d575159cf0/Name-INDIVIDUAL-ENTREPRENEURSAPRYGIN-SERGEY-ALEXANDROVICH-Legal-address-152300-RUSSIA-YAROSLAVSKAY │
│ 89 │ uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu/uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu │
│ 14 │ bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb/bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb │
│ 8466 │ eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee │
│ 3 │ kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk/kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk │
│ 1 │ 1b3634f6-9166-46d7-a43a-51d575159cf0/Correspondent-account-30101810945250000297-TIN-KPP-of-the-bank-7706092528-770543003-BIC-0445252 │
│ 1 │ Smithju1986-org-Gibbstersgame-ca/windows10-Systems-Android-10-phone-latin-reset-my-Enterprise-lib-set-IT-NETWORKING-smithju1986.org │
│ 1 │ abortionpillsforsaleintembisa/0734408121-Abortion-Pills-For-Sale-In-Tembisa-Mamelodi-Ssoshanguve-Mabopane-Witbank-Rustenburg-Mafik │
│ 2 │ hmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm/hmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm │
│ 570 │ reeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee/reeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee │
│ 1 │ Artificial-Intelligence-for-NLP/Source-Code-of-Pragmatic-Artificial-Intelligence-Algorithms-Based-on-Natrual-Language-Processing │
│ 1 │ Gremling-Machine-Learning-Study-Group/Deep-Learning-na-analise-de-imagens-de-raio-x-para-auxilio-na-decis-o-clinica-de-COVID-19 │
│ 4 │ 1b3634f6-9166-46d7-a43a-51d575159cf0/t.1FqJlUEmmyCmGBMs1HHLszGztmmESHtJfcGQQN0V4Abbj-KUGfc4s8hX2ID89sfAlR9ogNwxscREeDrfkiqzZA │
│ 4 │ HCIT-Computing-Intelligence/Source-Code-of-Pragmatic-Artificial-Intelligence-Algorithms-Based-on-Natrual-Language-Processing │
│ 1 │ EduardoMartinezEnriquez/Optimized-Update-Prediction-Assignment-for-Lifting-Transforms-on-Graphs-III-Multiresolution-example │
│ 4 │ Data-Science-and-Data-Analytics-Courses/MITx---Machine-Learning-with-Python-From-Linear-Models-to-Deep-Learning-Jun-11-2019 │
│ 1 │ deepakrajpurushothaman/Solution-to-Trajectory-planning-problem-using-dynamic-optimization-method-and-meta-heuristic-solver- │
│ 1 │ ArthurEmanuelRodriguesCosta/PIBITI-2015-2016-VISUALIZA-O-DO-GRAFO-DE-CORRELA-O-ENTRE-DISCIPLINAS-NO-CURSO-DE-COMPUTA-O-UFCG │
│ 1 │ globalmigrateimmigration/The-Australia-working-holiday-visa-subclass-417-is-a-visa-in-a-perfect-world-appropriate-for-young │
│ 1 │ BurakKahramanHacettepe/Drug-Sensitivity-Prediction-for-Cancer-Cell-lines-with-Pairwise-Input-Graph-Convolutional-Neural-Net │
│ 4 │ DeligenceTechnologies/Home-Automation-to-monitor-heat-fire-smoke-CO-and-room-by-room-movement-of-people-using-Raspberry-Pi │
│ 1 │ Md-Samiul-Abid-Chowdhury/inputting-from-and-outputting-in-any-positive-integer-in-dos-console-using-assembly-language-8086 │
│ 1 │ todo-sobre-el-universo/todo-sobre-el-universo.github.io-Te-presentamos-todo-acerca-del-universo-El-Sistema-Solar-Agujeros │
│ 5 │ during-master-degree/M.E-Energy-Efficient-Power-and-Subcarrier-Allocation-for-OFDMA-Systems-with-Value-Function-Approxima │
│ 1 │ Money-Maker-Research/commmodity-tipsCommodities-are-leveraged-products-where-one-needs-to-pay-initial-margin-to-take-posi │
│ 6 │ nirmalsenthilnathan/Open-Set-Domain-Adaptation-for-Hyperspectral-Image-Classification-Using-Generative-Adversarial-Netwo │
│ 1 │ SmartPracticeschool/llSPS-INT-3437-Predicting-the-Energy-Output-of-Wind-Turbine-Based-on-Weather-Conditions-Watson-Auto- │
│ 1 │ InternetofThings2017/is-open-is-pr-author-Eskimo2016-comments-50-user-InternetofThings2017-sort-updated-desc-is-private- │
│ 1 │ thomasrussellmurphy/eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee │
│ 5 │ Esmaeili-Najafabadi/Designing-Sequence-With-Minimum-PSL-Using-Chebyshev-Distance-and-its-Application-for-Chaotic-MIMO-Ra │
│ 1 │ DevExpress-Examples/Reporting_aspxreportdesigner-how-to-create-an-aspnet-end-user-reporting-application-with-the-t227679 │
│ 3 │ SmartPracticeschool/SBSPS-Challenge-4528-Creating-The-Twitter-Sentiment-Analysis-Program-in-Python-with-Naive-Bayes-Clas │
│ 3 │ DevExpress-Examples/XAF_how-to-use-google-facebook-and-microsoft-accounts-in-aspnet-xaf-applications-oauth2-demo-t535280 │
│ 3 │ DevExpress-Examples/how-to-generate-a-sequential-number-for-a-persistent-object-within-a-database-transaction-xaf-e2829 │
│ 6 │ Computing-Intelligence/Source-Code-of-Pragmatic-Artificial-Intelligence-Algorithms-Based-on-Natrual-Language-Processing │
└─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
50 rows in set. Elapsed: 0.761 sec. Processed 232.13 million rows, 1.81 GB (305.07 million rows/s., 2.38 GB/s.)
The most favorite of the insane is the "132 e's". Did you miss the story?
:) SELECT repo_name, count() FROM github_events WHERE event_type = 'WatchEvent' AND repo_name LIKE '%_/_%' GROUP BY repo_name ORDER BY length(repo_name) ASC LIMIT 50
┌─repo_name─┬─count()─┐
│ 2/t │ 1 │
│ s/s │ 1 │
│ e/m │ 1 │
│ f/f │ 7 │
│ 69/a │ 3 │
│ nd/e │ 1 │
│ 19/x │ 2 │
│ Xe/h │ 7 │
│ f/pq │ 162 │
│ as/a │ 289 │
│ tj/n │ 13477 │
│ 9H/Z │ 1 │
│ 9H/F │ 2 │
│ Xe/x │ 32 │
│ f/r3 │ 6 │
│ 3x/n │ 2 │
│ mg/i │ 5 │
│ i/AI │ 1 │
│ 7f/h │ 2 │
│ f/hy │ 1 │
│ hf/q │ 6 │
│ cv/t │ 23 │
│ 520/b │ 8 │
│ kjk/u │ 12 │
│ 6/Dex │ 3 │
│ now/u │ 1 │
│ yg/yg │ 6 │
│ hx/js │ 1 │
│ ry/hl │ 92 │
│ a0/ep │ 3 │
│ Vh9/Z │ 1 │
│ amj/t │ 1 │
│ knu/p │ 1 │
│ 17h/h │ 1 │
│ MD4/I │ 2 │
│ Xe/Xe │ 6 │
│ yg/ng │ 1 │
│ oe/cv │ 2 │
│ f/fka │ 11 │
│ cv/sd │ 13 │
│ h5/h5 │ 2 │
│ sjl/t │ 733 │
│ 51c/6 │ 1 │
│ t/gas │ 1 │
│ hl/hl │ 1 │
│ g-9/7 │ 1 │
│ g7v/e │ 1 │
│ as/xo │ 1 │
│ ex/js │ 1 │
│ ab/ip │ 3 │
└───────────┴─────────┘
50 rows in set. Elapsed: 0.827 sec. Processed 232.13 million rows, 1.81 GB (280.76 million rows/s., 2.19 GB/s.)
I'm surprised that tj/n is a real thing. f/pq is also a real thing and also related to node.js. Maybe node.js developers are addicted to short names?
:) SELECT repo_name, count() FROM github_events WHERE body ILIKE '%ClickHouse%' GROUP BY repo_name ORDER BY count() DESC LIMIT 50
┌─repo_name──────────────────────────┬─count()─┐
│ ClickHouse/ClickHouse │ 12661 │
│ yandex/ClickHouse │ 7412 │
│ traceon/ClickHouse │ 2339 │
│ kokizzu/ClickHouse │ 1794 │
│ skirdey/ClickHouse │ 881 │
│ getsentry/snuba │ 764 │
│ Mu-L/ClickHouse │ 644 │
│ Vertamedia/clickhouse-grafana │ 538 │
│ Altinity/clickhouse-operator │ 532 │
│ mymarilyn/clickhouse-driver │ 475 │
│ yandex/clickhouse-jdbc │ 428 │
│ ClickHouse/clickhouse-odbc │ 365 │
│ housepower/ClickHouse-Native-JDBC │ 331 │
│ kshvakov/clickhouse │ 303 │
│ PostHog/posthog │ 299 │
│ DataDog/integrations-core │ 286 │
│ AlexAkulov/clickhouse-backup │ 284 │
│ Infinidat/infi.clickhouse_orm │ 261 │
│ yandex/clickhouse-odbc │ 260 │
│ xzkostyan/clickhouse-sqlalchemy │ 227 │
│ Mattlk13/ClickHouse │ 217 │
│ getsentry/onpremise │ 205 │
│ lomik/graphite-clickhouse │ 204 │
│ smi2/phpClickHouse │ 203 │
│ ibis-project/ibis │ 196 │
│ getsentry/sentry │ 190 │
│ killwort/ClickHouse-Net │ 188 │
│ NixOS/nixpkgs │ 182 │
│ housepower/clickhouse_sinker │ 181 │
│ microfleet/clickhouse-adapter │ 164 │
│ timberio/vector │ 162 │
│ ClickHouse/clickhouse-jdbc │ 161 │
│ adjust/clickhouse_fdw │ 149 │
│ ClickHouse/clickhouse-go │ 143 │
│ lomik/carbon-clickhouse │ 137 │
│ sentry-kubernetes/charts │ 123 │
│ apla/node-clickhouse │ 122 │
│ InterestingLab/waterdrop │ 122 │
│ apache/incubator-superset │ 122 │
│ suharev7/clickhouse-rs │ 112 │
│ sysown/proxysql │ 110 │
│ artpaul/clickhouse-cpp │ 109 │
│ dbeaver/dbeaver │ 105 │
│ IMSMWU/RClickhouse │ 104 │
│ childe/gohangout │ 103 │
│ go-graphite/carbonapi │ 100 │
│ enqueue/metabase-clickhouse-driver │ 100 │
│ DarkWanderer/ClickHouse.Client │ 100 │
│ Altinity/ClickHouse │ 98 │
│ BayoNet/ClickHouse │ 98 │
└────────────────────────────────────┴─────────┘
50 rows in set. Elapsed: 20.424 sec. Processed 3.12 billion rows, 509.31 GB (152.75 million rows/s., 24.94 GB/s.)
There are 1430 of those. The most popular is Sentry Snuba, then ClickHouse Kubernetes Operator, ClickHouse Grafana, then Python, Go, ODBC, JDBC drivers.
Repositories with ClickHouse-related comments, sorted by stars:
SELECT
repo_name,
sum(event_type = 'WatchEvent') AS num_stars,
sum(body ILIKE '%ClickHouse%') AS num_comments
FROM github_events
WHERE (body ILIKE '%ClickHouse%') OR (event_type = 'WatchEvent')
GROUP BY repo_name
HAVING num_comments > 0
ORDER BY num_stars DESC
LIMIT 50
┌─repo_name───────────────────┬─num_stars─┬─num_comments─┐
1. │ 996icu/996.ICU │ 354850 │ 1 │
2. │ golang/go │ 92407 │ 6 │
3. │ nodejs/node │ 75477 │ 1 │
4. │ kubernetes/kubernetes │ 68644 │ 1 │
5. │ avelino/awesome-go │ 64700 │ 2 │
6. │ rails/rails │ 53620 │ 2 │
7. │ rust-lang/rust │ 53027 │ 3 │
8. │ ansible/ansible │ 51144 │ 5 │
9. │ elastic/elasticsearch │ 48810 │ 1 │
10. │ grafana/grafana │ 39147 │ 77 │
11. │ Kickball/awesome-selfhosted │ 37934 │ 2 │
12. │ prometheus/prometheus │ 35949 │ 5 │
13. │ apache/spark │ 32616 │ 3 │
14. │ fffaraz/awesome-cpp │ 31297 │ 2 │
15. │ rethinkdb/rethinkdb │ 28628 │ 2 │
16. │ pingcap/tidb │ 28099 │ 4 │
17. │ netty/netty │ 27854 │ 2 │
18. │ getsentry/sentry │ 27255 │ 190 │
19. │ composer/composer │ 26787 │ 2 │
20. │ Homebrew/brew │ 26658 │ 2 │
21. │ alibaba/druid │ 24851 │ 26 │
22. │ sequelize/sequelize │ 24415 │ 4 │
23. │ metabase/metabase │ 23835 │ 91 │
24. │ moby/moby │ 22615 │ 2 │
25. │ pandas-dev/pandas │ 22473 │ 5 │
26. │ dmlc/xgboost │ 21377 │ 3 │
27. │ spf13/cobra │ 20509 │ 1 │
28. │ jinzhu/gorm │ 20443 │ 4 │
29. │ hasura/graphql-engine │ 20060 │ 2 │
30. │ facebook/rocksdb │ 19694 │ 1 │
31. │ netdata/netdata │ 19605 │ 9 │
32. │ zhangdaiscott/jeecg-boot │ 19119 │ 2 │
33. │ alibaba/easyexcel │ 18773 │ 2 │
34. │ palantir/blueprint │ 18294 │ 2 │
35. │ apache/incubator-superset │ 18265 │ 122 │
36. │ protocolbuffers/protobuf │ 18210 │ 1 │
37. │ StevenBlack/hosts │ 18004 │ 1 │
38. │ sebastianbergmann/phpunit │ 17932 │ 1 │
39. │ alibaba/canal │ 17810 │ 1 │
40. │ yiisoft/yii2 │ 17490 │ 5 │
41. │ getredash/redash │ 17471 │ 45 │
42. │ celery/celery │ 16593 │ 2 │
43. │ openssl/openssl │ 15814 │ 1 │
44. │ apache/flink │ 15690 │ 6 │
45. │ taosdata/TDengine │ 15322 │ 9 │
46. │ brettwooldridge/HikariCP │ 14983 │ 4 │
47. │ rancher/k3s │ 14664 │ 3 │
48. │ influxdata/influxdb │ 14374 │ 2 │
49. │ dbeaver/dbeaver │ 13757 │ 105 │
50. │ requests/requests │ 12997 │ 4 │
└─────────────────────────────┴───────────┴──────────────┘
50 rows in set. Elapsed: 17.669 sec. Processed 3.12 billion rows, 567.41 GB (176.57 million rows/s., 32.11 GB/s.)
Even the "996.ICU" repository has something about ClickHouse. I cannot find it on the GitHub website, but I can do it with our dataset!
:) SELECT * FROM github_events WHERE body ILIKE '%ClickHouse%' AND repo_name = '996icu/996.ICU' \G
Row 1:
──────
event_type: IssuesEvent
actor_login: garyelephant
repo_name: 996icu/996.ICU
created_at: 2019-03-28 05:52:55
updated_at: 2019-03-28 05:52:54
action: opened
comment_id: 0
body: > 去了解和学习一个真正的大数据项目吧,一技在手,走遍天下。
这里有一个基于Spark的项目,可以让我们不写spark代码,用最简单的配置,迅速跑起来流式streaming或离线的数据处理或分析的spark程序,大家可以玩一玩。它有丰富的数据输入,输出插件,比如kafka, elasticsearch, mongodb, mysql, hdfs, hive,clickhouse,TiDB 还可以直接用sql做数据处理。如果觉得功能不够还可以开发自己的插件,挺方便的。目前有微博,新浪,永辉超市等多家公司在线上使用。
* 项目地址:https://github.com/InterestingLab/waterdrop
* 文档地址: https://interestinglab.github.io/waterdrop/
* 附一篇用waterdrop流式处理kafka数据写入ES的介绍:
https://interestinglab.github.io/waterdrop/#/zh-cn/case_study/3
* 用Spark支持分布式,大规模的数据写入ES:
https://interestinglab.github.io/waterdrop/#/zh-cn/case_study/3
path:
position: 0
line: 0
ref:
ref_type: none
creator_user_login: garyelephant
number: 7029
title: 除了进ICU,大数据从业人员还能做什么?
labels: []
state: open
locked: 0
assignee:
assignees: []
comments: 0
author_association: NONE
closed_at: 1970-01-01 00:00:00
merged_at: 1970-01-01 00:00:00
merge_commit_sha:
requested_reviewers: []
requested_teams: []
head_ref:
head_sha:
base_ref:
base_sha:
merged: 0
mergeable: 0
rebaseable: 0
mergeable_state: unknown
merged_by:
review_comments: 0
maintainer_can_modify: 0
commits: 0
additions: 0
deletions: 0
changed_files: 0
diff_hunk:
original_position: 0
commit_id:
original_commit_id:
push_size: 0
push_distinct_size: 0
member_login:
release_tag_name:
release_name:
review_state: none
1 rows in set. Elapsed: 0.126 sec. Processed 753.66 thousand rows, 281.05 MB (5.96 million rows/s., 2.22 GB/s.)
It's a job offer from "Waterdrop" company that is using ClickHouse.
:) SELECT body, count() FROM github_events WHERE notEmpty(body) AND length(body) < 100 GROUP BY body ORDER BY count() DESC LIMIT 50
┌─body────────────────────────────────────────────────────────────────────────────────────────────────┬─count()─┐
│ Body of PR │ 705016 │
│ LGTM │ 513739 │
│ Thanks! │ 462334 │
│ done │ 449041 │
│ +1 │ 423541 │
│ Done │ 418402 │
│ :+1: │ 391435 │
│ First from flow in UK South │ 379375 │
│ Done. │ 320999 │
│ 👍 │ 317840 │
│ Automatically generated by Netlify CMS │ 267369 │
│ test │ 250036 │
│ fixed │ 206232 │
│ Superseded by #6. │ 199058 │
│ Fixed │ 183537 │
│ Superseded by #3. │ 177251 │
│ Superseded by #4. │ 168627 │
│ Superseded by #5. │ 166942 │
│ Superseded by #7. │ 164186 │
│ Can one of the admins verify this patch? │ 161123 │
│ ok │ 159759 │
│ /retest │ 151079 │
│ Superseded by #2. │ 150775 │
│ Fixed. │ 149072 │
│ /lgtm │ 148622 │
│ Superseded by #8. │ 139940 │
│
│ 134119 │
│ first description │ 131816 │
│ :+1: │ 124157 │
│ I'm having a problem with this. │ 113431 │
│ testing-sauron-webhooks │ 112181 │
│ Thank you! │ 111932 │
│ corp-pass-fork: test succeeded │ 106839 │
│ Merged build finished. Test PASSed. │ 106484 │
│ unsigned-fail-fork: test succeeded │ 106290 │
│ @dependabot rebase │ 94526 │
│ New description for issue │ 94466 │
│ retest this please │ 90816 │
│ Superseded by #9. │ 90301 │
│ second description │ 86353 │
│ lgtm │ 85635 │
│ Thanks │ 85219 │
│ Superseded by #10. │ 74560 │
│ Done! │ 73060 │
│ @dependabot merge │ 71230 │
│ bors r+ │ 66152 │
│ @z-kasparov new game │ 65620 │
│ s │ 62689 │
│ Prefer double-quoted strings unless you need single quotes to avoid extra backslashes for escaping. │ 61822 │
│ Looks like lodash is up-to-date now, so this is no longer needed. │ 60106 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┘
50 rows in set. Elapsed: 17.794 sec. Processed 3.12 billion rows, 508.79 GB (175.33 million rows/s., 28.59 GB/s.)
(among comments shorter than 100 bytes)
Most of the comments are very positive, except the passive-aggressive "Can one of the admins verify this patch?" from Jenkins.
If we roll a dice and SELECT 50 random repositories with probability proportional to the number of stars, what will we get?
:) SELECT repo_name FROM github_events WHERE event_type = 'WatchEvent' ORDER BY rand() LIMIT 50
┌─repo_name───────────────────────────────────────┐
│ sonata-project/cache │
│ puikinsh/gentelella │
│ orbingol/NURBS-Python │
│ Flipboard/FLAnimatedImage │
│ v8/v8 │
│ littlefoot32/symmetrical-octo-tribble │
│ londonappbrewery/mi_card_flutter │
│ mtnviewmark/barley │
│ brianpritt/TravelTreats-blog │
│ webtorrent/webtorrent │
│ kenberkeley/react-demo │
│ SpiderOak/SpiderOakMobileClient │
│ hcy1996/vue-toutiao │
│ casatwy/CTJSBridge │
│ ustroetz/log-road │
│ gelvezz23/kahaat │
│ mkovacek/Web-app-for-veterinary-clinics │
│ apple/swift-log │
│ oldratlee/translations │
│ laserwave/topic_models │
│ Live-Charts/Live-Charts │
│ soimort/you-get │
│ appsecco/bugcrowd-levelup-subdomain-enumeration │
│ pirate/ArchiveBox │
│ segmentio/nightmare │
│ jhen0409/react-native-debugger │
│ dengyuhan/magnetW │
│ nfuruya/tensorflow │
│ prometheus/promdash │
│ aerokube/selenoid │
│ maekawatoshiki/ferrugo │
│ saitejach127/PayAll │
│ Genymobile/scrcpy │
│ diyhue/diyHue │
│ danielmiessler/SecLists │
│ okfn/recline │
│ expectocode/telegram-export │
│ codingforentrepreneurs/Reactify-Django │
│ ZzXxL1994/Machine-Learning-Papers │
│ Tomasz-Mankowski/ADS7846-X11-daemon │
│ sgr-ksmt/PullToDismiss │
│ lcatro/Hacker_Document │
│ Suwings/MCSManager │
│ agermanidis/autosub │
│ wbkd/awesome-d3 │
│ Adien-galen/SelenJA │
│ thednp/bootstrap.native │
│ carlhuda/janus │
│ donwa/oneindex │
│ CaicoLeung/CaicoLeung.github.io │
└─────────────────────────────────────────────────┘
50 rows in set. Elapsed: 0.216 sec. Processed 232.13 million rows, 1.81 GB (1.08 billion rows/s., 8.38 GB/s.)
This is the true "face" of GitHub. Run a few queries and you can estimate and taste the Octoverse this way.
Follow the instruction.
You can download the dataset as a file and then insert into ClickHouse or any other DBMS.
Here is the dataset in ClickHouse Native format (73 GB).
Alternatively you can download the same dataset in tab-separated format (85 GB).
To decompress the dataset from .xz it's recommended to install and use the pixz (parallel xz) tool. The ordinary xz tool can be used as well.
To insert the downloaded dataset into ClickHouse, create a table:
$ clickhouse-client ClickHouse client version 20.11.3.3 (official build). Connecting to localhost:9000 as user default. Connected to ClickHouse server version 20.11.3 revision 54442. :) CREATE TABLE github_events ( file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4, 'blocked' = 5), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5), PROJECTION max_file_time ( SELECT max(file_time) ) ) ENGINE = MergeTree ORDER BY (event_type, repo_name, created_at)
And insert it with the command:
pixz -d < github_events.native.xz | clickhouse-client --query "INSERT INTO github_events FORMAT Native"
You can transform the datasets between various formats with the clickhouse-local tool. Example to convert data from Native to JSONEachRow (aka jsonlines aka ndjson):
xz -d < github_events.native.xz | clickhouse-local --input-format Native --output-format JSONEachRow --query "SELECT * FROM table" --structure "event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19), comment_id UInt64, body String, path String, position Int32, line UInt32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5)"
This is an option to import the dataset into ClickHouse without a separate download and decompression step.
It requires minimum ClickHouse version 20.12 to support xz decompression.
Create a foreign table that will read data from a URL:
CREATE TABLE github_events_url ENGINE = URL('https://clickhouse-public-datasets.s3.amazonaws.com/github_events_v2.native.xz');
Create the destination table and insert data:
CREATE TABLE github_events ENGINE = MergeTree ORDER BY (event_type, repo_name, created_at) AS SELECT * FROM github_events_url;
This section explains how to recreate the dataset from the raw data (1.2 TB json.gz). You don't have to follow these steps because we have already prepared a preprocessed structured dataset for download (just 65 GB).
There is a wonderful project called GH Archive. It provides dumps of events on GitHub to download. Data is split into files per hour like https://data.gharchive.org/2015-01-01-15.json.gz.
You can download all the files with the command:
wget --continue https://data.gharchive.org/{2015..2020}-{01..12}-{01..31}-{0..23}.json.gz
If you see the Argument list too long message, split to multiple commands by years.
If download fails, just run the command again (--continue will skip already downloaded files).
The total number of files is 84 264 and the total size is about 1.2 TB.
Typical download speed is ~16 MB/sec. The download will take about a day.
There is no point parallelizing the download within one server, because rate limits will apply and the effective download speed will remain the same. You can parallelize by using different servers... but it's better not to bother because it is just one terabyte and you just need to wait from one day to several days at most. Prepare a server with at least 1.5 terabytes of space, run the download, leave for the weekend and that's it.
The raw dataset is a bunch of gzipped files containing data in JSON Lines format (JSON object on every line). The structure of the JSON object is the same as what the GitHub API provides. Every JSON object contains one event and all the related info. For example, let's look at the PullRequestEvent:
gzip -cd < 2020-11-13-9.json.gz | grep -F 'PullRequestEvent' | head -n1 | jq '.' { "id": "14180283452", "type": "PullRequestEvent", "actor": { "id": 10810283, "login": "direwolf-github", "display_login": "direwolf-github", "gravatar_id": "", "url": "https://api.github.com/users/direwolf-github", "avatar_url": "https://avatars.githubusercontent.com/u/10810283?" }, "repo": { "id": 312523815, "name": "direwolf-github/ephemeral-ci-ff8a7037", "url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037" }, "payload": { "action": "opened", "number": 1, "pull_request": { "url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls/1", "id": 520447878, "node_id": "MDExOlB1bGxSZXF1ZXN0NTIwNDQ3ODc4", "html_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037/pull/1", "diff_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037/pull/1.diff", "patch_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037/pull/1.patch", "issue_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues/1", "number": 1, "state": "open", "locked": false, "title": "Direwolf review apps test branch-b928a7ac", "user": { "login": "direwolf-github", "id": 10810283, "node_id": "MDQ6VXNlcjEwODEwMjgz", "avatar_url": "https://avatars0.githubusercontent.com/u/10810283?v=4", "gravatar_id": "", "url": "https://api.github.com/users/direwolf-github", "html_url": "https://github.com/direwolf-github", "followers_url": "https://api.github.com/users/direwolf-github/followers", "following_url": "https://api.github.com/users/direwolf-github/following{/other_user}", "gists_url": "https://api.github.com/users/direwolf-github/gists{/gist_id}", "starred_url": "https://api.github.com/users/direwolf-github/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/direwolf-github/subscriptions", "organizations_url": "https://api.github.com/users/direwolf-github/orgs", "repos_url": "https://api.github.com/users/direwolf-github/repos", "events_url": "https://api.github.com/users/direwolf-github/events{/privacy}", "received_events_url": "https://api.github.com/users/direwolf-github/received_events", "type": "User", "site_admin": false }, "body": "Direwolf review apps test branch-b928a7ac", "created_at": "2020-11-13T08:59:59Z", "updated_at": "2020-11-13T08:59:59Z", "closed_at": null, "merged_at": null, "merge_commit_sha": null, "assignee": null, "assignees": [], "requested_reviewers": [], "requested_teams": [], "labels": [], "milestone": null, "draft": false, "commits_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls/1/commits", "review_comments_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls/1/comments", "review_comment_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls/comments{/number}", "comments_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues/1/comments", "statuses_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/statuses/d48d634ff6ea6973baa91724836e797232cd4424", "head": { "label": "direwolf-github:branch-b928a7ac", "ref": "branch-b928a7ac", "sha": "d48d634ff6ea6973baa91724836e797232cd4424", "user": { "login": "direwolf-github", "id": 10810283, "node_id": "MDQ6VXNlcjEwODEwMjgz", "avatar_url": "https://avatars0.githubusercontent.com/u/10810283?v=4", "gravatar_id": "", "url": "https://api.github.com/users/direwolf-github", "html_url": "https://github.com/direwolf-github", "followers_url": "https://api.github.com/users/direwolf-github/followers", "following_url": "https://api.github.com/users/direwolf-github/following{/other_user}", "gists_url": "https://api.github.com/users/direwolf-github/gists{/gist_id}", "starred_url": "https://api.github.com/users/direwolf-github/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/direwolf-github/subscriptions", "organizations_url": "https://api.github.com/users/direwolf-github/orgs", "repos_url": "https://api.github.com/users/direwolf-github/repos", "events_url": "https://api.github.com/users/direwolf-github/events{/privacy}", "received_events_url": "https://api.github.com/users/direwolf-github/received_events", "type": "User", "site_admin": false }, "repo": { "id": 312523815, "node_id": "MDEwOlJlcG9zaXRvcnkzMTI1MjM4MTU=", "name": "ephemeral-ci-ff8a7037", "full_name": "direwolf-github/ephemeral-ci-ff8a7037", "private": false, "owner": { "login": "direwolf-github", "id": 10810283, "node_id": "MDQ6VXNlcjEwODEwMjgz", "avatar_url": "https://avatars0.githubusercontent.com/u/10810283?v=4", "gravatar_id": "", "url": "https://api.github.com/users/direwolf-github", "html_url": "https://github.com/direwolf-github", "followers_url": "https://api.github.com/users/direwolf-github/followers", "following_url": "https://api.github.com/users/direwolf-github/following{/other_user}", "gists_url": "https://api.github.com/users/direwolf-github/gists{/gist_id}", "starred_url": "https://api.github.com/users/direwolf-github/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/direwolf-github/subscriptions", "organizations_url": "https://api.github.com/users/direwolf-github/orgs", "repos_url": "https://api.github.com/users/direwolf-github/repos", "events_url": "https://api.github.com/users/direwolf-github/events{/privacy}", "received_events_url": "https://api.github.com/users/direwolf-github/received_events", "type": "User", "site_admin": false }, "html_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037", "description": null, "fork": false, "url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037", "forks_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/forks", "keys_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/keys{/key_id}", "collaborators_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/collaborators{/collaborator}", "teams_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/teams", "hooks_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/hooks", "issue_events_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues/events{/number}", "events_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/events", "assignees_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/assignees{/user}", "branches_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/branches{/branch}", "tags_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/tags", "blobs_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/blobs{/sha}", "git_tags_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/tags{/sha}", "git_refs_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/refs{/sha}", "trees_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/trees{/sha}", "statuses_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/statuses/{sha}", "languages_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/languages", "stargazers_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/stargazers", "contributors_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/contributors", "subscribers_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/subscribers", "subscription_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/subscription", "commits_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/commits{/sha}", "git_commits_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/commits{/sha}", "comments_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/comments{/number}", "issue_comment_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues/comments{/number}", "contents_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/contents/{+path}", "compare_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/compare/{base}...{head}", "merges_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/merges", "archive_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/{archive_format}{/ref}", "downloads_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/downloads", "issues_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues{/number}", "pulls_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls{/number}", "milestones_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/milestones{/number}", "notifications_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/notifications{?since,all,participating}", "labels_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/labels{/name}", "releases_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/releases{/id}", "deployments_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/deployments", "created_at": "2020-11-13T08:59:50Z", "updated_at": "2020-11-13T08:59:55Z", "pushed_at": "2020-11-13T08:59:59Z", "git_url": "git://github.com/direwolf-github/ephemeral-ci-ff8a7037.git", "ssh_url": "git@github.com:direwolf-github/ephemeral-ci-ff8a7037.git", "clone_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037.git", "svn_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037", "homepage": null, "size": 0, "stargazers_count": 0, "watchers_count": 0, "language": null, "has_issues": true, "has_projects": true, "has_downloads": true, "has_wiki": true, "has_pages": false, "forks_count": 0, "mirror_url": null, "archived": false, "disabled": false, "open_issues_count": 1, "license": null, "forks": 0, "open_issues": 1, "watchers": 0, "default_branch": "master" } }, "base": { "label": "direwolf-github:master", "ref": "master", "sha": "e445f055327daff1e9f9909dbfd062c461646601", "user": { "login": "direwolf-github", "id": 10810283, "node_id": "MDQ6VXNlcjEwODEwMjgz", "avatar_url": "https://avatars0.githubusercontent.com/u/10810283?v=4", "gravatar_id": "", "url": "https://api.github.com/users/direwolf-github", "html_url": "https://github.com/direwolf-github", "followers_url": "https://api.github.com/users/direwolf-github/followers", "following_url": "https://api.github.com/users/direwolf-github/following{/other_user}", "gists_url": "https://api.github.com/users/direwolf-github/gists{/gist_id}", "starred_url": "https://api.github.com/users/direwolf-github/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/direwolf-github/subscriptions", "organizations_url": "https://api.github.com/users/direwolf-github/orgs", "repos_url": "https://api.github.com/users/direwolf-github/repos", "events_url": "https://api.github.com/users/direwolf-github/events{/privacy}", "received_events_url": "https://api.github.com/users/direwolf-github/received_events", "type": "User", "site_admin": false }, "repo": { "id": 312523815, "node_id": "MDEwOlJlcG9zaXRvcnkzMTI1MjM4MTU=", "name": "ephemeral-ci-ff8a7037", "full_name": "direwolf-github/ephemeral-ci-ff8a7037", "private": false, "owner": { "login": "direwolf-github", "id": 10810283, "node_id": "MDQ6VXNlcjEwODEwMjgz", "avatar_url": "https://avatars0.githubusercontent.com/u/10810283?v=4", "gravatar_id": "", "url": "https://api.github.com/users/direwolf-github", "html_url": "https://github.com/direwolf-github", "followers_url": "https://api.github.com/users/direwolf-github/followers", "following_url": "https://api.github.com/users/direwolf-github/following{/other_user}", "gists_url": "https://api.github.com/users/direwolf-github/gists{/gist_id}", "starred_url": "https://api.github.com/users/direwolf-github/starred{/owner}{/repo}", "subscriptions_url": "https://api.github.com/users/direwolf-github/subscriptions", "organizations_url": "https://api.github.com/users/direwolf-github/orgs", "repos_url": "https://api.github.com/users/direwolf-github/repos", "events_url": "https://api.github.com/users/direwolf-github/events{/privacy}", "received_events_url": "https://api.github.com/users/direwolf-github/received_events", "type": "User", "site_admin": false }, "html_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037", "description": null, "fork": false, "url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037", "forks_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/forks", "keys_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/keys{/key_id}", "collaborators_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/collaborators{/collaborator}", "teams_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/teams", "hooks_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/hooks", "issue_events_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues/events{/number}", "events_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/events", "assignees_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/assignees{/user}", "branches_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/branches{/branch}", "tags_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/tags", "blobs_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/blobs{/sha}", "git_tags_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/tags{/sha}", "git_refs_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/refs{/sha}", "trees_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/trees{/sha}", "statuses_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/statuses/{sha}", "languages_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/languages", "stargazers_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/stargazers", "contributors_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/contributors", "subscribers_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/subscribers", "subscription_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/subscription", "commits_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/commits{/sha}", "git_commits_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/git/commits{/sha}", "comments_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/comments{/number}", "issue_comment_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues/comments{/number}", "contents_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/contents/{+path}", "compare_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/compare/{base}...{head}", "merges_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/merges", "archive_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/{archive_format}{/ref}", "downloads_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/downloads", "issues_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues{/number}", "pulls_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls{/number}", "milestones_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/milestones{/number}", "notifications_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/notifications{?since,all,participating}", "labels_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/labels{/name}", "releases_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/releases{/id}", "deployments_url": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/deployments", "created_at": "2020-11-13T08:59:50Z", "updated_at": "2020-11-13T08:59:55Z", "pushed_at": "2020-11-13T08:59:59Z", "git_url": "git://github.com/direwolf-github/ephemeral-ci-ff8a7037.git", "ssh_url": "git@github.com:direwolf-github/ephemeral-ci-ff8a7037.git", "clone_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037.git", "svn_url": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037", "homepage": null, "size": 0, "stargazers_count": 0, "watchers_count": 0, "language": null, "has_issues": true, "has_projects": true, "has_downloads": true, "has_wiki": true, "has_pages": false, "forks_count": 0, "mirror_url": null, "archived": false, "disabled": false, "open_issues_count": 1, "license": null, "forks": 0, "open_issues": 1, "watchers": 0, "default_branch": "master" } }, "_links": { "self": { "href": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls/1" }, "html": { "href": "https://github.com/direwolf-github/ephemeral-ci-ff8a7037/pull/1" }, "issue": { "href": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues/1" }, "comments": { "href": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/issues/1/comments" }, "review_comments": { "href": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls/1/comments" }, "review_comment": { "href": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls/comments{/number}" }, "commits": { "href": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/pulls/1/commits" }, "statuses": { "href": "https://api.github.com/repos/direwolf-github/ephemeral-ci-ff8a7037/statuses/d48d634ff6ea6973baa91724836e797232cd4424" } }, "author_association": "OWNER", "active_lock_reason": null, "merged": false, "mergeable": null, "rebaseable": null, "mergeable_state": "unknown", "merged_by": null, "comments": 0, "review_comments": 0, "maintainer_can_modify": false, "commits": 1, "additions": 1, "deletions": 0, "changed_files": 1 } }, "public": true, "created_at": "2020-11-13T09:00:00Z" }
It's quite a rich JSON. It contains the data about the pull request itself and all the associated objects (actor — who has modified the pull request, the author of the pull request, the repository where the pull request exists, etc.)
There are multiple ways to analyze a bunch of .json.gz files without any preprocessing. For example, you can use the jq tool, though it will be very inefficient.
A more efficient way to analyze data without preprocessing is by using the clickhouse-local tool. It is like a replacement for awk, sed, grep with the ClickHouse SQL engine:
clickhouse-local --query " SELECT count() FROM file('*.json.gz', LineAsString, 'data String') WHERE JSONExtractString(data, 'actor', 'login') = 'alexey-milovidov'"
Here we parse input data as a single String field per line and then process it with JSON functions.
clickhouse-local will do all the heavy-lifting:
— select files by wildcard pattern;
— decompress gzip data on the fly (with zlib-ng);
— parallelize data processing by files and by chunks within each file;
— process JSON with the help of simdjson — the fastest JSON library in existence;
Data processing will be limited by disk speed. If we assume 1 GB/sec sequential read speed, it will take about 20 minutes per query. You may think this is okay, but it isn't. It is just the best possible speed of processing raw .json.gz data, but we can do better if we preprocess our data.
If we want to analyze our data faster, we have to transform it from semistructured JSON format to a set of well-structured relational tables. The principle: structured is always more efficient than semistructured.
First let's find out what data we have by reading the GitHub API docs.
There are a bunch of event types:
— CommitCommentEvent;
— CreateEvent;
— DeleteEvent;
— ForkEvent;
— GollumEvent;
— IssueCommentEvent;
— IssuesEvent;
— MemberEvent;
— PublicEvent;
— PullRequestEvent;
— PullRequestReviewCommentEvent;
— PushEvent;
— ReleaseEvent;
— SponsorshipEvent;
— WatchEvent;
And the list is incomplete... while analyzing the data I have found a few "secret" or obsolete event types.
Every event type has its own properties.
What is the best way to structure all this data?
We can consider two options:
1. Every event type in its own table.
For example, WatchEvent is about stars assignment. We can create a table for these events. And when we need to calculate the number of stars per repository, the graph of stars or top starred repositories per year, we will just do SELECT from this table. But if we need to compare the ratio of stars and the number of pull request contributors, we will have to JOIN several tables.
2. All event types in a single table.
The table works like a "discriminated union". It will contain a field named event_type to discriminate various events. It will contain all columns that are common for different events like "event date and time", "actor name", "repository name". It will also contain the union of all different properties of different event types. We will have a wide table with many columns, but some columns will be very sparse (filled with values only for certain rare event types).
I can strongly recommend the second option: all events in one table.
The table will contain many columns... and that's exactly what column-oriented databases are for! Sparse columns will compress well, so you don't have to worry that they are sparse. As an added benefit, you can avoid doing queries with JOINs when you need to analyze multiple event types at once.
I analyzed the properties of all event types and ended up with the following table structure in ClickHouse:
CREATE TABLE github_events ( file_time DateTime, event_type Enum('CommitCommentEvent' = 1, 'CreateEvent' = 2, 'DeleteEvent' = 3, 'ForkEvent' = 4, 'GollumEvent' = 5, 'IssueCommentEvent' = 6, 'IssuesEvent' = 7, 'MemberEvent' = 8, 'PublicEvent' = 9, 'PullRequestEvent' = 10, 'PullRequestReviewCommentEvent' = 11, 'PushEvent' = 12, 'ReleaseEvent' = 13, 'SponsorshipEvent' = 14, 'WatchEvent' = 15, 'GistEvent' = 16, 'FollowEvent' = 17, 'DownloadEvent' = 18, 'PullRequestReviewEvent' = 19, 'ForkApplyEvent' = 20, 'Event' = 21, 'TeamAddEvent' = 22), actor_login LowCardinality(String), repo_name LowCardinality(String), created_at DateTime, updated_at DateTime, action Enum('none' = 0, 'created' = 1, 'added' = 2, 'edited' = 3, 'deleted' = 4, 'opened' = 5, 'closed' = 6, 'reopened' = 7, 'assigned' = 8, 'unassigned' = 9, 'labeled' = 10, 'unlabeled' = 11, 'review_requested' = 12, 'review_request_removed' = 13, 'synchronize' = 14, 'started' = 15, 'published' = 16, 'update' = 17, 'create' = 18, 'fork' = 19, 'merged' = 20), comment_id UInt64, body String, path String, position Int32, line Int32, ref LowCardinality(String), ref_type Enum('none' = 0, 'branch' = 1, 'tag' = 2, 'repository' = 3, 'unknown' = 4), creator_user_login LowCardinality(String), number UInt32, title String, labels Array(LowCardinality(String)), state Enum('none' = 0, 'open' = 1, 'closed' = 2), locked UInt8, assignee LowCardinality(String), assignees Array(LowCardinality(String)), comments UInt32, author_association Enum('NONE' = 0, 'CONTRIBUTOR' = 1, 'OWNER' = 2, 'COLLABORATOR' = 3, 'MEMBER' = 4, 'MANNEQUIN' = 5), closed_at DateTime, merged_at DateTime, merge_commit_sha String, requested_reviewers Array(LowCardinality(String)), requested_teams Array(LowCardinality(String)), head_ref LowCardinality(String), head_sha String, base_ref LowCardinality(String), base_sha String, merged UInt8, mergeable UInt8, rebaseable UInt8, mergeable_state Enum('unknown' = 0, 'dirty' = 1, 'clean' = 2, 'unstable' = 3, 'draft' = 4, 'blocked' = 5), merged_by LowCardinality(String), review_comments UInt32, maintainer_can_modify UInt8, commits UInt32, additions UInt32, deletions UInt32, changed_files UInt32, diff_hunk String, original_position UInt32, commit_id String, original_commit_id String, push_size UInt32, push_distinct_size UInt32, member_login LowCardinality(String), release_tag_name String, release_name String, review_state Enum('none' = 0, 'approved' = 1, 'changes_requested' = 2, 'commented' = 3, 'dismissed' = 4, 'pending' = 5) ) ENGINE = MergeTree ORDER BY (event_type, repo_name, created_at);
A few details:
1. Table is ordered by event_type. This means that when you SELECT with the condition on the event type, it will process as efficiently as if the selected events were located in a separate table.
2. Next to event_type we have repo_name. When you SELECT data for a certain repo, it will process almost instantly because only a few contiguous ranges of data will be read — only the ranges that contain data for this repo. Basically, ORDER BY expression in a table is the most efficient index possible. It's also worth noting that data sorted by repository name will compress better due to better data locality.
3. Next to repo_name we have created_at — the time of event. We put it as the last column in the ORDER BY expression, because proper data ordering will improve compression.
4. ENGINE = MergeTree is used as the recommended choice for medium to large sized datasets.
5. We use Enum data types when we know the set of values in advance.
6. We don't use Nullable data types, because they are less efficient than just using empty string / zeros for absence of values.
7. LowCardinality(String) is used to apply additional dictionary compression when strings are expected to repeat frequently. Data is always compressed in ClickHouse. But LowCardinality acts better than generic compression, because it allows to process data without decompression.
8. We use Array data type for labels, assignees, etc. It's quite convenient to have this option.
The most simple way is to use jq.
find . -name '*.json.gz' | xargs -P$(nproc) -I{} bash -c " gzip -cd {} | jq -c ' [ (\"{}\" | scan(\"[0-9]+-[0-9]+-[0-9]+-[0-9]+\")), .type, .actor.login? // .actor_attributes.login? // (.actor | strings) // null, .repo.name? // (.repository.owner? + \"/\" + .repository.name?) // null, .created_at, .payload.updated_at? // .payload.comment?.updated_at? // .payload.issue?.updated_at? // .payload.pull_request?.updated_at? // null, .payload.action, .payload.comment.id, .payload.review.body // .payload.comment.body // .payload.issue.body? // .payload.pull_request.body? // .payload.release.body? // null, .payload.comment?.path? // null, .payload.comment?.position? // null, .payload.comment?.line? // null, .payload.ref? // null, .payload.ref_type? // null, .payload.comment.user?.login? // .payload.issue.user?.login? // .payload.pull_request.user?.login? // null, .payload.issue.number? // .payload.pull_request.number? // .payload.number? // null, .payload.issue.title? // .payload.pull_request.title? // null, [.payload.issue.labels?[]?.name // .payload.pull_request.labels?[]?.name], .payload.issue.state? // .payload.pull_request.state? // null, .payload.issue.locked? // .payload.pull_request.locked? // null, .payload.issue.assignee?.login? // .payload.pull_request.assignee?.login? // null, [.payload.issue.assignees?[]?.login? // .payload.pull_request.assignees?[]?.login?], .payload.issue.comments? // .payload.pull_request.comments? // null, .payload.review.author_association // .payload.issue.author_association? // .payload.pull_request.author_association? // null, .payload.issue.closed_at? // .payload.pull_request.closed_at? // null, .payload.pull_request.merged_at? // null, .payload.pull_request.merge_commit_sha? // null, [.payload.pull_request.requested_reviewers?[]?.login], [.payload.pull_request.requested_teams?[]?.name], .payload.pull_request.head?.ref? // null, .payload.pull_request.head?.sha? // null, .payload.pull_request.base?.ref? // null, .payload.pull_request.base?.sha? // null, .payload.pull_request.merged? // null, .payload.pull_request.mergeable? // null, .payload.pull_request.rebaseable? // null, .payload.pull_request.mergeable_state? // null, .payload.pull_request.merged_by?.login? // null, .payload.pull_request.review_comments? // null, .payload.pull_request.maintainer_can_modify? // null, .payload.pull_request.commits? // null, .payload.pull_request.additions? // null, .payload.pull_request.deletions? // null, .payload.pull_request.changed_files? // null, .payload.comment.diff_hunk? // null, .payload.comment.original_position? // null, .payload.comment.commit_id? // null, .payload.comment.original_commit_id? // null, .payload.size? // null, .payload.distinct_size? // null, .payload.member.login? // .payload.member? // null, .payload.release?.tag_name? // null, .payload.release?.name? // null, .payload.review?.state? // null ]' | clickhouse-client --input_format_null_as_default 1 --date_time_input_format best_effort --query 'INSERT INTO github_events FORMAT JSONCompactEachRow' || echo 'File {} has issues' "
jq is slow. gzip decompression is slow. Here we parallelize it with the xargs -P trick.
A quick intro into jq syntax:
[...] means to collect into a JSON array.
.actor.login is the navigation inside the JSON object.
.elem.field? means to emit null instead of error if the elem is not an object.
// means "or" (Python-like). If the left-hand side is not found, the right-hand side will be selected.
[]?.name means to select the object's name field if it exists, for each element of the array.
This jq expression will create an array for every event. We pipe these arrays into clickhouse-client to INSERT INTO github_events parsing the data with JSONCompactEachRow format.
The JSONCompactEachRow is one of the input/output formats supported by ClickHouse. Basically it means "JSON array for every table row".
The option --input_format_null_as_default will transform JSON nulls (when the field is not found) to default values (like 0 or empty string). The option --date_time_input_format best_effort allows to parse the DateTime field in various formats, including ISO-8601.
Data loading will take a few hours, and the bottleneck is jq.
Congratulations, now we are able to recreate the dataset from the raw data! (Assuming the GHArchive continues to exist).
If you want to continuously update the database, just run the script in cron:
# Assuming raw data is located in './gharchive' directory. mkdir gharchive_new cd gharchive_new ls -1 ../gharchive | clickhouse-local --structure 'file String' --query "WITH (SELECT max(parseDateTimeBestEffort(extract(file, '^(.+)\.json\.gz$'), 'UTC')) FROM table) AS last SELECT toString(toDate(last + INTERVAL arrayJoin(range(0, 24)) - 12 HOUR AS t)) || '-' || toString(toHour(t)) || '.json.gz' WHERE t < now()" | xargs -I{} bash -c "[ -f ../gharchive/{} ] || wget --continue 'https://data.gharchive.org/{}'" find . -name '*.json.gz' | xargs -P$(nproc) -I{} bash -c " gzip -cd {} | jq -c ' [ (\"{}\" | scan(\"[0-9]+-[0-9]+-[0-9]+-[0-9]+\")), .type, .actor.login? // .actor_attributes.login? // (.actor | strings) // null, .repo.name? // (.repository.owner? + \"/\" + .repository.name?) // null, .created_at, .payload.updated_at? // .payload.comment?.updated_at? // .payload.issue?.updated_at? // .payload.pull_request?.updated_at? // null, .payload.action, .payload.comment.id, .payload.review.body // .payload.comment.body // .payload.issue.body? // .payload.pull_request.body? // .payload.release.body? // null, .payload.comment?.path? // null, .payload.comment?.position? // null, .payload.comment?.line? // null, .payload.ref? // null, .payload.ref_type? // null, .payload.comment.user?.login? // .payload.issue.user?.login? // .payload.pull_request.user?.login? // null, .payload.issue.number? // .payload.pull_request.number? // .payload.number? // null, .payload.issue.title? // .payload.pull_request.title? // null, [.payload.issue.labels?[]?.name // .payload.pull_request.labels?[]?.name], .payload.issue.state? // .payload.pull_request.state? // null, .payload.issue.locked? // .payload.pull_request.locked? // null, .payload.issue.assignee?.login? // .payload.pull_request.assignee?.login? // null, [.payload.issue.assignees?[]?.login? // .payload.pull_request.assignees?[]?.login?], .payload.issue.comments? // .payload.pull_request.comments? // null, .payload.review.author_association // .payload.issue.author_association? // .payload.pull_request.author_association? // null, .payload.issue.closed_at? // .payload.pull_request.closed_at? // null, .payload.pull_request.merged_at? // null, .payload.pull_request.merge_commit_sha? // null, [.payload.pull_request.requested_reviewers?[]?.login], [.payload.pull_request.requested_teams?[]?.name], .payload.pull_request.head?.ref? // null, .payload.pull_request.head?.sha? // null, .payload.pull_request.base?.ref? // null, .payload.pull_request.base?.sha? // null, .payload.pull_request.merged? // null, .payload.pull_request.mergeable? // null, .payload.pull_request.rebaseable? // null, .payload.pull_request.mergeable_state? // null, .payload.pull_request.merged_by?.login? // null, .payload.pull_request.review_comments? // null, .payload.pull_request.maintainer_can_modify? // null, .payload.pull_request.commits? // null, .payload.pull_request.additions? // null, .payload.pull_request.deletions? // null, .payload.pull_request.changed_files? // null, .payload.comment.diff_hunk? // null, .payload.comment.original_position? // null, .payload.comment.commit_id? // null, .payload.comment.original_commit_id? // null, .payload.size? // null, .payload.distinct_size? // null, .payload.member.login? // .payload.member? // null, .payload.release?.tag_name? // null, .payload.release?.name? // null, .payload.review?.state? // null ]' | clickhouse-client --input_format_null_as_default 1 --date_time_input_format best_effort --query 'INSERT INTO github_events FORMAT JSONCompactEachRow' || echo 'File {} has issues' " && mv *.json.gz ../gharchive
Install the pixz tool for faster xz compression and run one of the following commands:
clickhouse-client --progress --max_threads 1 --query "SELECT * FROM github_events FORMAT TSV" | pixz > github_events.xz clickhouse-client --progress --max_threads 1 --query "SELECT * FROM github_events FORMAT Native" | pixz > github_events.native.xz
Connect with clickhouse-client:
clickhouse-client --secure --host play.clickhouse.com --user explorer
HTTPS interface: https://play.clickhouse.com/ (port 443)
Minimal web UI: https://play.clickhouse.com/play?user=play
Keep in mind that this dataset is actually very small for ClickHouse. There are multiple companies that use distributed multi-petabyte ClickHouse setups for various heavy duty tasks, e.g. to analyze a significant share of all internet traffic in real time. This article is not intended to advertise ClickHouse, but at least now you know where to look.
We encourage you to create your own research and tools based on the dataset. This article is open-source and the content is available under the CC-BY-4.0 license or Apache 2 license. Attribution is required. You can propose changes, extend the article, and share ideas by pull requests and issues on GitHub repository. Please notify us about interesting usages of the dataset. We also encourage the application of the dataset for DBMS benchmarks.
If you need to cite this article, please do it as follows:
"Milovidov A., 2020. Everything You Ever Wanted To Know About GitHub (But Were Afraid To Ask), https://ghe.clickhouse.tech/"
The authors don't own any rights to the dataset. The dataset includes material that may be subject to third party rights. The query results from the dataset are published under section 107 of the Copyright Act of 1976; allowance is made for "fair use" for purposes such as criticism, comment, news reporting, teaching, scholarship, education and research. Fair use is a use permitted by copyright statute that might otherwise be infringing.
This website does not use cookies. Discuss at Hacker News.
© 2020-2024 Alexey Milovidov