Global language

This time I have prepared a short analysis comparing languages on Wikipedia. We will check what languages are most widely used around the world and which countries have the most Internet users.

Nowadays English is a linga franca, as we can find English speakers in every country. On the other hand there is huge number of languages that are used only locally (in one country or even region).

I can imagine, that languages used globally will have smaller differences between night and day, because in every time zone there are some speakers. Compare English to Czech page view patterns to get the idea:

Global language vs. language used locally -- page view patterns on Wikipedia

Of course, this approach is far from being perfect. Mainly because it is based on time, so it doesn’t include migration within the same time zone, i.e. from Latin America to U.S. Moreover Spanish is definitely a global language as it is used in many countries. But, as you will see, this method doesn’t prove it. The population of Spanish native speakers in Europe to small when compared to Latin America to significantly raise page views for “European” daily hours.

I decided to compute the index as follows:

min number of PV per hour on a given day / average number of PV per hour on a given day

I tried several variants of it (i.e. using 20th percentile instead of min), but the results were similar.

This is the language global index for 20 most popular languages on Wikipedia. Higher values mean small differences between night and day:

                    language  glob_idx
1                    English 0.7981971
2                  Norwegian 0.5851168
3                    Persian 0.5177893
4                     Arabic 0.4909870
5                    Chinese 0.4857803
6                     Korean 0.4816817
7                    Swedish 0.4765534
8                  Ukrainian 0.4445452
9                     Polish 0.4108401
10                   Spanish 0.4013519
11                     Dutch 0.3936351
12                    French 0.3602967
13                   Italian 0.3544782
14                Portuguese 0.3493829
15                   Turkish 0.3481315
16                  Japanese 0.3357735
17                     Czech 0.3345046
18                Indonesian 0.3308224
19                   Russian 0.3130690
20                    German 0.2860656

English as the most popular around the world — no surprise here. Then we can see Norwegian and Persian, which I find hard to explain. Anyway, these values are much lower (0.5) than English (0.8) meaning that they can’t really be compared.

On the other end of the scales we have German and Russian, which also can also be surprising. In case of Russia it probably means that most of the Internet users are in one part of the country.

Of course, there are more controversies about this measure — I mentioned Spanish, which definitely is used widely in many countries, but here it’s in the middle of the scale. In fact it has similar value to Polish (which definitely is not a global language).

Internet users

But there is also something else I wanted to check — number of page views compared with number of native speakers. This could tell as about Internet users in a given country. From the chart below we can see, that this relation is quite linear: the more native speakers, the more daily page views. But some languages are closer to right-bottom corner meaning that there are not so many page views on Wikipedia (and probably not so many Internet users). The colour is the “language global index” described above — lighter means it is more “spread” around the globe.

Wikipedia usage -- daily page views vs. natives speakers

Leave a Reply