GitHub rankings and its impact on the local free software development community

Creating rankings might seem like a vain exercise in belly-button gazing, even more so for people so unlike that kind of things as programmers. However, in this paper we will try to prove how creating city (or province) based on rankings in Spain has led to all kinds of interesting eﬀects, including increased productivity and community building.


INTRODUCTION
One of the keys to create a community is to actually identify who is part of it and how they participate.As part of the effort by the Free Software Office of the University of Granada, we have tried, through the years, to know who is involved in the creation of open source projects.However, the only way of finding out who was is to make them come to any of our events or contact us through any means.
So the initial intention for creating a ranking of FLOSS (Free/Libre/Open/Source Software) was to know who is out there and the kind of things they are doing, be them part of the academic world or outside it, in business; creating a census would allow us to discover new FLOSS developers in our own city and even to collaborate with them.
So we used initially the GitHub top-1000 generation script by Paul Miller to achieve that, making small modifications to the source and creating our own version, which was eventually moved to a new repository.But this had several effects.First, as soon as the ranking was published some people contacted us and the first GitHub meet-up in Granada took place.More modifications and changes were added and new data was obtained.As part of the code tests, more cities were tested and we ended up with lots of data.And data begs for analysis, which we eventually started to do.And, along the way, we built a community of users that previously had not known each other.We discovered that the only fact that a census exists does not imply that there is a community, but it definitely helps.We have had some experience with this kind of reactions in the past.In (Tricas, Ruiz, and Merelo 2003) we did some studies on social network analysis and other measures of the Spanish-speaking blogosphere.Then, the reactions were two fold: on one side, people showed big interest in the index in order to be listed there; some blog providers provided also data.On the other side, people that were expecting to appear in better positions were a bit angry about it [i].Anyway we feel that selfconsciousness is always a good thing and this work can serve as a driving force for more code sharing and increase relations among developers.In this paper, using the tool that we have created for searching for the users in particular cities or provinces, we will show how the GitHub activity in these cities or provinces compare with each other and what kind of characteristics they have, including basic metrics.We will also delve into the effect of publishing the ranking itself, which has surprisingly increased the productivity of all communities measured.Finally, we will try to draw some conclusions on how measuring activity affects that activity and what are the general characteristics of open source developers in the provinces measured, which are the top 20 in population in Spain.
Coming up next we will review different papers that deal with creating lists and rankings of contributions and trying to measure or explain the dynamics of communities.In section 3 we will show the methodology for obtaining the users in a particular province in Spain; next we will analyze data obtained and show how different provinces stack up in terms of contributions and finally we will draw some conclusions.

STATE OF THE ART
Geographically based community metrics have had some attention in the last years (Gonzalez-Barahona et al. 2008, Takhteyev and Hilts 2010, Sebastian von, Andreas, and Christoph 2013) but they do not seem to have arisen a lot of interest in the FLOSS metrics (Herraiz et al. 2009) community.
Most efforts seem to be focused in creating tools for actually measuring repositories for activity; for instance, Laura Arjona describes the Debian Contributors tool in (Arjona-Reina July 2014) whose results are dumped to a website that includes information on the projects that every user has contributed and some other data such as how users are identified in the databases.However, geography seems to be relevant in human interactions even when we are in Internet, where one could expect this factor to be less important.See, for example, 'Visualizing Friendships' where the authors studied interactions inside the Facebook social network, or (Ratti et al. 2010) where the subject of analysis are phone calls.We can see that even when there is technology-mediated communication, the geography seems to be an important driver.
It is interesting to note that some models (Robles, Merelo-Guervos, and Gonzalez-Barahona 2005) use the concept of stigmergy, that is, interaction using the environment, to model the dynamics of libre software projects; the mere existence of these tools can be a catalyst of this interaction and the harbinger of new software projects.In fact, this seems to be what has happened in the community (or communities) under observation: the mere creation of a document that mentions many different users acts as a substrate that allows the creation and growth of the community through the stigmergy paradigm.
Next we will briefly explain the tool that was designed to search for geographically based GitHub users.

GITHUB CITY RANKINGS, THE TOOL
It is quite unlikely that if you are reading this you do not know what is GitHub.GitHub is a web-based git repository that has a number of "social" features, including the declaration of a profile and @ mentions in commit messages and issues.The profile page includes information on the number of followers, as well as the repositories and the number of contributions every person has made during the last year.Besides this easily-scrapeable information, GitHub has a REST API that can be accessed from any language.Some other web-based repos do have many of the characteristics, and, besides, are based in free software themselves, like Gitorious, SourceForge, Google Code, just to name a few of them.However the number (and the activity) of users of these repositories is quite small compared to GitHub, which has become the tool of choice for FLOSS developers.That is why GitHub was chosen, apart from the availability of tools to mine profile information: it provides an API that allows us to study things in an easier way.Notice also that all the projects in GitHub can not be considered FLOSS as we can see at (Williamson 2014).Nevertheless, people seems to be following the web culture where sharing and broadcasting are the usual ways of relation and they are not paying attention to licensing issues.The tool used initially, by Paul Miller was written in CoffeeScript and designed for creating a ranking of the top 1000 users with more than an (arbitrary) numbers of followers equal to 256.The tool used the GitHub REST API to make requests, and saved them in a human-readable form in Markdown and also CSV and JSON.It was separated in three scripts that were called from a Makefile.Some utility functions were written in Node.js; the node.jsmodule was called from several scripts.
Our intention was to look for users in a particular location, that is, to limit them not by minimum number of users but for the location declared in their profiles.Every run of the program required 10 API requests which is limited to 20 per hour, so we modified it (Merelo 2015) so that it used authenticated requests, besides.Finally, we had to rearrange the whole code so that it counted the number of stars and could be filtered according to regular expressions, since the country a city or province is might be ambiguous (you know Toledo, Ohio, but there is also a Toledo in Spain).Besides, Markdown handling was kind of clumsy so it was moved to a template-based solution.
One of the main problem we found in Spain was the ambiguity of the province name.Besides the fact that people write it in any of the official languages (Spanish and, in some cases, national languages like Basque or Catalan) and, well, sometimes with typos (with or without tildes), some people do not mention their province when writing their location.To make a long story short, we had to provide a configuration file (in JSON) which lists several possible names that might be used by people in a particular province; for instance, for Majorca we had to include this: "location": ["Balears","Baleares","Palma de Mallorca"].This, of course, excludes those that simply do not care about listing their location, but more on this later on.
The script is then run with a city name (if there is no particular configuration option, Madrid, for instance) or a configuration file name granada.This can be launched weekly, or simply at a particular moment or under request.
The results for each user list the number of followers, contributions, the number of stars his/her repositories have received, the longest and the current contribution streak as well as the predominant language and avatar.Some of these metrics are shown in the Markdown rankings; for instance, see the one for Madrid.
Data is saved to a different repository and is aggregated and processed using R and Perl scripts.All of them are included in the same repository.As part of our commitment to free/open science, all graphics and data were published as soon as they were produced in Twitter from my @jjmerelo account.

RESULTS AND ANALYSIS
We reduced the search to the 20 most populated provinces in Spain, for which appropriate search strings and filters were created66Population data was obtained from the National Statistics Institute http://www.ine.es/.All data was downloaded during January 2015, and is available from the already mentioned repo.Aggregate data for these 20 provinces is shown in table 1.
The range of users shown in the table hovers around the hundreds, with the one in the biggest provinces (and cities) approaching 1000.However, population and users/contributions are not directly related.We can already see some differences in figure 1, that shows the number of contributions (left) and users (right) in decreasing order.The first two, Madrid and Barcelona, should only be expected, but then Granada (17th in population) and Zaragoza also occupy a place that does not correspond exactly to the population they have; same as Bilbao (actually Vizcaya), but in the opposite direction.So let us look at the distribution of contributions and users looking for an explanation of the dancing places in the ranking.A users and contributions vs. rank plot is shown in figure 2; it shows different slopes which imply different distribution, but there is a clear indication that a Zipf-like distribution is taking place in all cases.So let us compute the Zipf exponent and objective, which we show in table 2.

Table 2: Zipf coefficients for all 6 "big" cities, with exponent and objective
This can be interpreted in a different way by plotting the Lorenz curve, which is the accumulated normalized sum of contributions for these six cities.This is shown in figure 3; this Lorenz curve tends to represent the inequality between those that contribute more and those that contribute less and is usually represented by the Gini index, which is shown in table 3.

Table 3: Gini coefficients for all 6 "big" cities
The Gini coefficient measures inequality in the sense of share of, in this case, contributions between those with the most contributions and those with the least.An index equal to 1 would mean a single person did all the contributions, while the rest did 0, and index equal to 0 would mean all users make the same number of contributions.The table 3 ranks the cities from least equal (Madrid) to most egalitarian, Valencia.However, there is no big range of variation, hovering around 0.70, which is way over the inequality of the most unequal country in the world, the Seychelles.However, this is meaningless in absolute terms; in relative terms, it means roughly that the top contributors contribute roughly 70% more than the bottom contributors, and that there is no big variation among the different cities/provinces.It is quite clear, however, that the contributions by the top contributor, as well as those made by the average one, are quite different from place to place.So we represent in figure 4 the average number of contributions, that is, the number of contributions divided by the number of users.Figure 4 shows that average contributions have a bigger range than the Gini coefficient; Valencia is right in the middle, around 100; in fact, the top contributor (pakozm has 1462 contributions and 25% have more than 100).In Madrid, however, the top 10 have more than 2000 contributions and there are 200 users (25% with more than 150), so that accounts for the bigger inequality.But productivity is highest in Madrid, Zaragoza and, once again, in Granada, if we consider productivity exclusively the number of contributions.
Finally, it is interesting to find out if the publication of these rankings had any kind of impact.This is shown in figure 5 for the three cities for which we have the most tests, including Granada.The first time the census for Granada was published it included around 140 users.A month (approximately) later, there were more than 180, a 28% increase, more or less the same than for Malaga and more than for Seville, where a small increase was noted.Take into account that this is not absolute number of users, but only active users; in fact, it might decrease due to some user becoming inactive (no contribution) in the last year (this happens in some of the cases in Granada).So, in general, it should be expected to go every which way, depending on the city.The fact that all cities whose rankings have been published have increased the number of active users after diffusion, mainly in Twitter, might be an indication more users becoming active, more mentioning their city/province in their profile, or small competitions taking place locally to scale up the rankings if there is a chance to do so.All hypothesis are equally valid lacking other evidence, but we would say that at least some increase will be due to the fact that the rankings exist.

CONCLUSIONS
In this paper we have shown what happened when city/province based rankings were created using GitHub search API and what conclusions can be extracted from measuring the number of users and contributions made by these users.
In general, it is interesting to note that Spain hosts a vibrant, abundant and diverse community of developers.Looking at the raw numbers, most of them are based in the big cities, Madrid, Barcelona and Valencia, but some smaller cities like Zaragoza and Granada also host a numerous and productive group of developers.The publishing of the rankings has created a lively discussion in Twitter, and also allowed the discovery of many developers in many areas.In Granada GitHub monthly meetings have started, and many interesting projects, including this paper, have been started.
There are many things that remain to be done.The first one is to check the ability of GitHub to act as a social network; we would like to analyze how developers in a city connect to each other and how these actual communities change with time and what is their background, companies, academia or user groups.Other productivity measures could also be taken, including number of lines; besides, a differentiation between code and artifacts could be made, in the same way it was done by Robles et al. in (Robles, Gonzalez-Barahona, and Merelo 2006).

Figure 1 :
Figure 1: Provinces ranked by number of contributions (left) and users (right)

Figure 2 :
Figure 2: Zipf graph, rank vs. number of contributions (left) or followers (right) for the top 6 provinces in number of users and contributions: Madrid, Barcelona, Valencia, Granada, Sevilla, Zaragoza

Figure 3 :
Figure 3: Lorenz graph, that is, accumulated number of contributions vs rank for the top 6 provinces in number of users and contributions: Madrid, Barcelona, Valencia, Granada, Sevilla, Zaragoza

Figure 4 :
Figure 4: Average number of contributions, sorted from top to bottom, for the 20 provinces with the most inhabitants.

Figure 5 :
Figure 5: Number of users in each ranking; every point indicates simply a new ranking and are not uniformly distributed in time.