Wikipedia Scraping Experiments

These examples show how to read a webpage and process its HTML data to extract valuable information.

National capitals by population

We want to show a pie chart for visualizing capitals population
  • The name displayed on each pie slice (see how label is constructed later)
  • The actual number of habitants
  • outer width, in pixels
  • outer height, in pixels
Let's first download the wikipedia page that list capitals and their population We use jsoup to parse the page so we can easily query data from it
Now we want to save the data we just built, so we can re-use later
Let's display only the most populated capitals in the chart
Prepare a custom label to use in the chart
We also want to display our list of countries in a table
We add all the columns we need, starting with an index
The country flag, that we want to display as an image
The country name
The country capital
The country population, with a custom display based on a regex
It's time to feed the table with our previously saved list of countries
We want to display it in descending order of population

GDP (USD million) by country

Now we will do the same but for the GDP
  • We want to see the name of the country on each pie slice
  • The value should be the GDP (see below how it is extracted)
  • outer width, in pixels
  • outer height, in pixels
Let's download the wikipedia page and convert it to a list of countries
We save again so we can display it in the next table
How about exporting the list to a CSV file?
Well for the chart we want to limit to countries with a GDP greater than $1M
FInally let's display the full list of countries by GDP in a table
Index of the row
Country flag as an image
Country name
Actual GDP, formated
Time to load the list to feed the table
We want it sorted by GDP, obviously