Deaths of players in the game Just Cause 2 visualized
Square-enix, the creators of the huge sandbox game "Just Cause 2" have created a little video to visualize where players died in the game. Every white dot is the death of a player, and there are 11 million deaths in total. I think it's amazing that you can see the map just by these dots. Definitely worth checking out:
The first data!
The import just finished, and I'm running some queries now(it looks like they take quite a while to run though). Here is some general data of the database:
SELECT COUNT(*) FROM words
The amount of words in the database: 7.380.256
That's more than 7 million unique words!
SELECT COUNT(*) FROM wordcounts;
The amount of entries in the wordcounts table(which is basically for each year where a word occurs at least once the amount of times the word occurs in that year): 472.764.897
SELECT SUM(total_count) FROM wordcounts;
The amount of words(non-unique) in all books ever read by Google books: 359.675.008.445
The world record for most words said in a minute is currently set at 650 words per minute, which are probably just short words. But even at that rate would take the world record holder 1052,0689 years to read everything in Google books out loud.
SELECT word_id,SUM(total_count) FROM wordcounts GROUP BY word_id ORDER BY SUM(total_count) DESC;
The most used word top 15:
| # | Word | Used |
|---|---|---|
| 1 | " | 21.396.850.115 times |
| 2 | the | 18.399.669.358 times |
| 3 | . | 16.834.514.285 times |
| 4 | of | 12.042.045.526 times |
| 5 | and | 8.588.851.162 times |
| 6 | to | 7.305.545.226 times |
| 7 | in | 5.956.669.421 times |
| 8 | a | 5.422.911.334 times |
| 9 | """\x15\x12" | 3.312.106.937 times <-- WTF?? |
| 10 | - | 3.285.590.930 times |
| 11 | is | 3.211.700.708 times |
| 12 | that | 2.992.927.085 times |
| 13 | for | 2.298.892.030 times |
| 14 | ( | 2.129.958.825 times |
| 15 | ) | 2.128.706.986 times |
SELECT word, LENGTH(word) FROM words ORDER BY LENGTH(word) DESC;
The longest words in the dataset, what I noticed here is that the dataset isn't very clean. First the raw top 10:
- bababadalgharaghtakamminarronnkonnbronntonnerronntuonnthunntrovarrhounawnskawntoohoohoordenenthurnuk
- ____________________________________________________________________________________
- _________________________________________________________________________________
- 44444444444444444444444444444444444444444444444444444444444444444444444444444444
- Illlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
- 77777777777777777777777777777777777777777777777777777777777777777777777777777777
- _______________________________________________________________________________
- ______________________________________________________________________________
- _____________________________________________________________________________
- ____________________________________________________________________________
And now cleaned:
- bababadalgharaghtakamminarronnkonnbronntonnerronntuonnthunntrovarrhounawnskawntoohoohoordenenthurnuk
- Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch
- osseocarnisanguineoviscericartilaginonervomedullary
- Pneumonoultramicroscopicsilicovolcanoconiosises
- Chargoggagoggmanchauggagoggchaubunagungamaugg
- pneumonoultramicroscopicsilicovolcanokoniosis
- pneumonoultramicroscopicsilicovolcanoconiosis
- Kummogkodonattoottummooetiteaongannunnonash
- reachunderadjustscrewdownreachunderadjust
- phosphoribosylaminoimidazolecarboxamide
Fun fact: my spellchecker actually doesn't count the longest word wrong, but it does for the rest.
More data is coming, these are just the queries I can run today
Google ngram dataset
And the first dataset I'll analyse will be: The Google Ngram dataset! I'm currently importing the entire 1-gram set into a PostgreSQL database(which takes 50 hours in total unfortunately) and then I'll analyse. I'll find out the simple things, like the longest word ever used in a book, or the top 100 words most used in books. But I'll also try to uncover stuff laying deeper in the dataset, for example: how does a big war affect the words most used in books? Or an economic crisis? You will know these things soon enough
.
Little extra, the code I use to import everything:
JAVA
public class ToPostgres {
public static void main(String[] args) throws Exception {
String filePath = "./";
List<String> files = new ArrayList<String>();
for (int i =0; i < 10; i++) {
files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv");
}
Connection c = null;
try {
c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks",
"postgres", "xxxxxx");
} catch (SQLException e) {
e.printStackTrace();
}
if (c != null) {
c.setAutoCommit(false);
try {
PreparedStatement wordInsert = c.prepareStatement(
"INSERT INTO words (id, word) VALUES (?,?)"
);
PreparedStatement countInsert = c.prepareStatement(
"INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " +
"VALUES (?,?,?,?,?)"
);
String lastWord = "";
Long id = 0L;
for (String filename: files) {
BufferedReader input = new BufferedReader(new FileReader(new File(filename)));
String line = "";
int i = 0;
while ((line = input.readLine()) != null) {
String[] data = line.split("\t");
if (!lastWord.equals(data[0])) {
id++;
wordInsert.setLong(1, id);
wordInsert.setString(2, data[0]);
wordInsert.executeUpdate();
}
countInsert.setLong(1, id);
countInsert.setInt(2, Integer.parseInt(data[1]));
countInsert.setInt(3, Integer.parseInt(data[2]));
countInsert.setInt(4, Integer.parseInt(data[3]));
countInsert.setInt(5, Integer.parseInt(data[4]));
countInsert.executeUpdate();
lastWord = data[0];
if (i % 10000 == 0) {
c.commit();
}
if (i % 100000 == 0) {
System.out.println(i+" mark file "+filename);
}
i++;
}
c.commit();
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
}
What this blog will be about
This blog will be about what you can do with datasets. There are a lot of huge datasets available on the internet, if you know where to look. Analyzing this data can reveal some useful, fun and interesting stuff. The way I'll post is probably me taking a dataset, writing an introduction about it, and in the following weeks I post the data I found out using it and how I found that out.
Suggestions are always welcome.
The first dataset I'm going to analyze will be posted tomorrow.