I <3 Data All about data-analysis

13May/110

Deaths of players in the game Just Cause 2 visualized

Square-enix, the creators of the huge sandbox game "Just Cause 2" have created a little video to visualize where players died in the game. Every white dot is the death of a player, and there are 11 million deaths in total. I think it's amazing that you can see the map just by these dots. Definitely worth checking out:

11Apr/110

The first data!

The import just finished, and I'm running some queries now(it looks like they take quite a while to run though). Here is some general data of the database:

SELECT COUNT(*) FROM words

The amount of words in the database: 7.380.256
That's more than 7 million unique words!

 

SELECT COUNT(*) FROM wordcounts;

The amount of entries in the wordcounts table(which is basically for each year where a word occurs at least once the amount of times the word occurs in that year): 472.764.897

 

SELECT SUM(total_count) FROM wordcounts;

The amount of words(non-unique) in all books ever read by Google books: 359.675.008.445
The world record for most words said in a minute is currently set at 650 words per minute, which are probably just short words. But even at that rate would take the world record holder 1052,0689 years to read everything in Google books out loud.

 

SELECT word_id,SUM(total_count)
FROM wordcounts
GROUP BY word_id
ORDER BY SUM(total_count) DESC;

The most used word top 15:

# Word Used
1 " 21.396.850.115 times
2 the 18.399.669.358 times
3 . 16.834.514.285 times
4 of 12.042.045.526 times
5 and 8.588.851.162 times
6 to 7.305.545.226 times
7 in 5.956.669.421 times
8 a 5.422.911.334 times
9 """\x15\x12" 3.312.106.937 times <-- WTF??
10 - 3.285.590.930 times
11 is 3.211.700.708 times
12 that 2.992.927.085 times
13 for 2.298.892.030 times
14 ( 2.129.958.825 times
15 ) 2.128.706.986 times
SELECT word, LENGTH(word)
FROM words
ORDER BY LENGTH(word) DESC;

The longest words in the dataset, what I noticed here is that the dataset isn't very clean. First the raw top 10:

  1. bababadalgharaghtakamminarronnkonnbronntonnerronntuonnthunntrovarrhounawnskawntoohoohoordenenthurnuk
  2. ____________________________________________________________________________________
  3. _________________________________________________________________________________
  4. 44444444444444444444444444444444444444444444444444444444444444444444444444444444
  5. Illlllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
  6. 77777777777777777777777777777777777777777777777777777777777777777777777777777777
  7. _______________________________________________________________________________
  8. ______________________________________________________________________________
  9. _____________________________________________________________________________
  10. ____________________________________________________________________________

And now cleaned:

  1. bababadalgharaghtakamminarronnkonnbronntonnerronntuonnthunntrovarrhounawnskawntoohoohoordenenthurnuk
  2. Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch
  3. osseocarnisanguineoviscericartilaginonervomedullary
  4. Pneumonoultramicroscopicsilicovolcanoconiosises
  5. Chargoggagoggmanchauggagoggchaubunagungamaugg
  6. pneumonoultramicroscopicsilicovolcanokoniosis
  7. pneumonoultramicroscopicsilicovolcanoconiosis
  8. Kummogkodonattoottummooetiteaongannunnonash
  9. reachunderadjustscrewdownreachunderadjust
  10. phosphoribosylaminoimidazolecarboxamide

Fun fact: my spellchecker actually doesn't count the longest word wrong, but it does for the rest.

More data is coming, these are just the queries I can run today

Filed under: Uncategorized No Comments
10Apr/110

Google ngram dataset

And the first dataset I'll analyse will be: The Google Ngram dataset! I'm currently importing the entire 1-gram set into a PostgreSQL database(which takes 50 hours in total unfortunately) and then I'll analyse. I'll find out the simple things, like the longest word ever used in a book, or the top 100 words most used in books. But I'll also try to uncover stuff laying deeper in the dataset, for example: how does a big war affect the words most used in books? Or an economic crisis? You will know these things soon enough :) .

Little extra, the code I use to import everything:
JAVA

public class ToPostgres {

    public static void main(String[] args) throws Exception {
        String filePath = "./";
        List<String> files = new ArrayList<String>();
        for (int i =0; i < 10; i++) {
            files.add(filePath+"googlebooks-eng-all-1gram-20090715-"+i+".csv");
        }
        Connection c = null;
        try {
            c = DriverManager.getConnection("jdbc:postgresql://localhost/googlebooks",
                    "postgres", "xxxxxx");
        } catch (SQLException e) {
            e.printStackTrace();
        }

        if (c != null) {
            c.setAutoCommit(false);
            try {
                PreparedStatement wordInsert = c.prepareStatement(
                    "INSERT INTO words (id, word) VALUES (?,?)"
                );
                PreparedStatement countInsert = c.prepareStatement(
                    "INSERT INTO wordcounts (word_id, \"year\", total_count, total_pages, total_books) " +
                    "VALUES (?,?,?,?,?)"
                );
                String lastWord = "";
                Long id = 0L;
                for (String filename: files) {
                    BufferedReader input =  new BufferedReader(new FileReader(new File(filename)));
                    String line = "";
                    int i = 0;
                    while ((line = input.readLine()) != null) {
                        String[] data = line.split("\t");
                        if (!lastWord.equals(data[0])) {
                            id++;
                            wordInsert.setLong(1, id);
                            wordInsert.setString(2, data[0]);
                            wordInsert.executeUpdate();
                        }
                        countInsert.setLong(1, id);
                        countInsert.setInt(2, Integer.parseInt(data[1]));
                        countInsert.setInt(3, Integer.parseInt(data[2]));
                        countInsert.setInt(4, Integer.parseInt(data[3]));
                        countInsert.setInt(5, Integer.parseInt(data[4]));
                        countInsert.executeUpdate();
                        lastWord = data[0];
                        if (i % 10000 == 0) {
                            c.commit();
                        }
                        if (i % 100000 == 0) {
                            System.out.println(i+" mark file "+filename);
                        }
                        i++;
                    }
                    c.commit();
                }
            } catch (SQLException e) {
                e.printStackTrace();
            }
        }
    }
}
9Apr/110

What this blog will be about

This blog will be about what you can do with datasets. There are a lot of huge datasets available on the internet, if you know where to look. Analyzing this data can reveal some useful, fun and interesting stuff. The way I'll post is probably me taking a dataset, writing an introduction about it, and in the following weeks I post the data I found out using it and how I found that out.

Suggestions are always welcome.

The first dataset I'm going to analyze will be posted tomorrow.