Using “separators and labels” in Outwit Hub pro – open notes

N.B Separator and labels are only available to users of Outwit Hub Pro – similar and probably better results can be achieved if you use regex (regular expressions) so I would try that out if you have a lot of time prior to covering an event. But if you are like I am (with limited regex knowledge) and wanting to get this information quickly, separator and labels are incredibly useful.

I wanted all the names of all the candidates in the Lincolnshire election. The council’s website includes this snazzy interactive map with all the information you want but it requires a lot of clickthroughs and copying and pasting if you were to get each name manually.

Naturally my first inclination was scraping with Outwit Hub. This is how I got on:

1. Click on any part of the map – I went for East Lindsey, a district bigger than the counties of Surrey, Hertfordshire, and Buckinghamshire.

Screen Shot 2013-05-03 at 23.25.36

Cripes.

2. You’ll see that you get a page with another map – click on one of the wards within it. I went for Louth Wolds

Screen Shot 2013-05-03 at 23.39.58

3. Now you can boot up Outwit Hub. Copy the link url: http://www.lincolnshire.gov.uk/ElectionsResultsDetail.aspx?division=6&locationGroup=144 and paste it into the browser bar. The same page should come up. On the left hand column, underneath automators, is an option saying scrapers. Click that and you should get this:
Screen Shot 2013-05-04 at 00.11.54

That’s HTML – almost all of you know that that is the code which says what to display on the page. If you don’t know anything about HTML…don’t panic.  You really don’t need to know any of what that code does to do the next part. All you need to know is that bits of that contain the bits of the text that you want.
Screen Shot 2013-05-04 at 00.27.54

Beneath the code is something that looks like this. Like I have done, write ward in row 1 column 1, just to try it out.

6. You need to find the bits of the page that you want. So we want the name of the ward, the name of all the candidates, their parties, their results and the overall turnout. Let’s start with the ward. Press CMD/CTRL+F to bring up the search box and look up “Louth Wolds” (minus inverted commas).

This will be what comes up first:

Screen Shot 2013-05-04 at 00.23.25

Now Louth Wolds does not stand out here among the wards and it should look like a very important place because it is the subject of the page. Press next on the search.

Screen Shot 2013-05-04 at 00.25.44

That that looks like the Louth Worlds you can see on the page, prefaced by Electoral Division Map and above the election results themselves.

7. As you can see at the bottom of the screen there is a column saying name – which you should have written ward in – marker before, marker after. If you put one bit of HTML in marker before and one bit of HTML in marker after, Outwit will scrape anything between the two. Powerful stuff.

Scraping is about knowing exactly what bit of the page you want. As all the electoral ward pages look pretty similar it should work on every page to get the right bit of information you want.

Here we want the ward name first – as you can see what comes before the “Louth Wolds” that we want is the following:

<div class="flash container sleeve">
<h2>Electoral Division Map -

Copy and paste that from the HTML (not from here) into the “marker before” column.

You generally don’t need to be as precise about the “marker after” but, just to be safe”, do the same and copy the

</h2>
<div class="elecflash">

after Louth Worlds.

8. If you press “Execute” now you should just get the name “Louth Wolds”. Minor success!

9. Now, let’s find the winners and their percentage. Search the page for “Marfleet” (i.e the winner of the seat – H. Marfleet).

Screen Shot 2013-05-04 at 08.33.21

There he is!

But this is where it gets a bit tricky. As you can see the code around B.P. Burnett is more or less identical to the coding around H. (Hugo) Marfleet. If we want Marfleet’s party, percentage etc. then we need to use another feature of Outwit Hub – separators and labels. If you’ve used delimeters in Excel they work in a similar way, splitting the information into the columns that you want. Start off by marking your second scraper row “Candidate”.
Put the bit before the candidates start in marker before:

<li=”first”>
<strong>

Look at the bottom of the list of candiates and you should find this code:

<li>
</ul>
</div>
</div>
</div>
</div>
<div id=”elec_map”>

Paste that in to marker after.

Now what Outwit Hub is currently doing is looking at all the code between what we have said are the markers before and markers after. The only unique identifier between each candidate is an <li> or line break (you can see it on code line 1294 in the picture above).

Simply put <li> in the separator column.

Labels will mark each column which you have separated. As the candidates are graded in descending order of vote we can put

“winner”, “runner up”

Press execute, name your scraper and and you should have something like this:

Screen Shot 2013-05-04 at 10.52.15

Success (we have all the information we want for each candidate in a way that we can clean up later. I also got the overall turnout/electorate but that its up to you.

The next bit is pretty specific for the Lincolnshire page but read on if you would like to find out how I got a workaround

10. To get all the links I needed was a little trickier than first anticipated. If you had a normal page with a series of links inside you should be able to pull out those links and scrape them automatically – unfortunately this is an animation and the links to each individual page are not available in the HTML. However, there is a way:

If you search for one of the other wards in the same original page we were scraping “Boston Coastal” – you will see it underneath some HTML saying “locationid” with each ward having an option value. So Boston Coastal’s option value is <option value=”99″>.

Look at the URL we had originally:

http://www.lincolnshire.gov.uk/ElectionsResultsDetail.aspx?division=6&locationGroup=144

Change that to simply:

http://www.lincolnshire.gov.uk/ElectionsResultsDetail.aspx?locationGroup=144

And you get the same result! Exciting. But what is more exciting is that the location ids are sequential from 94-170. I put a list of urls together (using Excel and concatenate but you can choose your own poison) and then published it on this website.

I then put the website page I had created into Outwit Hub and in the left hand menu column selected “links”

Screen Shot 2013-05-04 at 11.31.03

You get quite a few URLs but make good use of the “Catch” option that lies at the bottom of OutwitHub and put in:

“ElectionsResultsDetail.aspx”

All the links you want should now be selected in a lovely shade of lime green. Right click on them, press fast-scrape and apply the scraper you created earlier.

Now make a cup of tea while all 76 are scraped. Clean ’em up, put ’em together and here they are:

Get the data

…Well I had to clear it up a bit

Advertisements

2 thoughts on “Using “separators and labels” in Outwit Hub pro – open notes

  1. A handy tip I picked up the other day – if you want to import a list of URLs into Outwit Hub, you don’t necessarily need to save them on a webpage somewhere (I did this with 11,000 URLs in a public Google Doc and it repeatedly crashed my browser).

    You can save an excel document of URLs as a .html web page, then open that up in Outwit Hub and click on the ‘tables’ section on the left to import them natively.

  2. Thanks for this great tutorial. Thanks to you, I discovered OutWit a few months ago and I have become an total addict. I have found that there are many ways to import data or URLs in OutWit Hub. You don’t really have to save them as HTML. My two favorite are the simplest: 1) drag them from any application (text editor or other) to the Hub, then, without releasing, to the queries tab, then, when the query editor appears, drop them in it. 2) save them as a text or csv file and open the file with the Hub.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s