Flyleaf fetches events data from a number of London bookshops (currently four: Burley Fisher Books, Pages of Hackney and Pages Cheshire Street, and Libreria) – a number which I intend to add to as and when I have time. The column on the right is just placeholder copy for now, but it’ll probably be used for a couple of news items in publishing, maybe a recently published or forthcoming title from a press I tend to follow.

How it works

The site gathers its data by the (slightly unfortunately named) method of web scraping, ‘visiting’ chosen sites to check for listed events, then storing them as data that can be accessed and printed out to the website. I’ll run through a method that scrapes the Burley Fisher bookshop site by way of an example.

First of all, we need to locate the page where all the events are listed, in this case https://burleyfisherbooks.com/events/. Then we need to access the page using HTTP, which can be done in a number of ways, but I’m using the httparty gem as it’s the only one I’m really familiar with. We make a request to the page:

unparsed_page = HTTParty.get(url)

then, using the nokogiri gem, we parse the response from the get request so we have something to query:

parsed_page = Nokogiri::HTML(unparsed_page)

Now, using nokogiri’s .css method, we locate the desired information on the page in the same way it is targeted by CSS. Here I’m using Firefox’s inspector tool to find the classname of the event items:

Burley Fisher Events Page
Burley Fisher events page, event item

It might also be worth making this more targeted by also specifying the container classname, which when put together gives us (unless we want to chain methods or nest classnames) something like:

events_list = parsed_page.css("div.tribe-events-loop")
events_list_items = events_list.css("div.tribe-clearfix")

At this point we can do a check to make sure that there are items available by querying the events array with the .count method:

[1] pry(main)> events_list_items.count
=> 1

NB I’m using the pry gem as my developer console here.

Once we know that events have been detected, we use the same nokogiri .css method to target specific items of information within the event div. For example, the titles of the events are all wrapped in <h3> tags with the classname .tribe-events-list-event-title. We’ll check in a pry session that this returns what we’re looking for:

[3] pry(main)> events_list_items[0].css("h3.tribe-events-list-event-title")
=> [#<Nokogiri::XML::Element:0x3fc50fe4f87c name="h3" attributes=[#<Nokogiri::XML::Attr:0x3fc50fe4f818 name="class" value="tribe-events-list-event-title">] children=[#<Nokogiri::XML::Text:0x3fc50fe4f408 "\n\t">, #<Nokogiri::XML::Element:0x3fc50fe4f354 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fc50fe4f2f0 name="class" value="tribe-event-url">, #<Nokogiri::XML::Attr:0x3fc50fe4f2dc name="href" value="https://burleyfisherbooks.com/event/launch-minor-detail-by-abania-shibli-fitzcarraldo-editions/">, #<Nokogiri::XML::Attr:0x3fc50fe4f2c8 name="title" value="Launch: MINOR DETAIL by Abania Shibli: Fitzcarraldo Editions">, #<Nokogiri::XML::Attr:0x3fc50fe4f2b4 name="rel" value="bookmark">] children=[#<Nokogiri::XML::Text:0x3fc50fe4e88c "\n\t\tLaunch: MINOR DETAIL by Abania Shibli: Fitzcarraldo Editions\t">]>, #<Nokogiri::XML::Text:0x3fc50fe4e6e8 "\n">]>]

and if so, access the text by nokogiri’s .text method:

[4] pry(main)> events_list_items[0].css("h3.tribe-events-list-event-title").text
=> "\n\t\n\t\tLaunch: MINOR DETAIL by Abania Shibli: Fitzcarraldo Editions\t\n"

We only want the text so we’ll strip any unwanted returns and tabs (this is what the \ns and \t\s are all about) with Ruby’s #strip method:

[5] pry(main)> events_list_items[0].css("h3.tribe-events-list-event-title").text.strip
=> "Launch: MINOR DETAIL by Abania Shibli: Fitzcarraldo Editions"

Now that we’ve found the information in the console session, we can put it into our program and store the information. We’ll initialise an array to do this, then pass in the information as we iterate over the events items (in this case currently just one):

events = Array.new
events_list_items.each_with_index do |event_list_item|
  event = {
    title: event_list_item.css("h3.tribe-events-list-event-title").text.strip
  }
  events << event
end

If we then write this to a JSON file for retrieval later, we can check the information is there:

[
  {
    "title": "Launch: MINOR DETAIL by Abania Shibli: Fitzcarraldo Editions"
  }
]

Using the same method we used to locate the event title, we can find other pieces of information we want to store. Here is an example of pieces of information extracted from the Burley Fisher site:

events = Array.new
events_list_items.each_with_index do |event_list_item, index|
  event = {
    index: index,
    bookshop: "Burley Fisher Books",
    category: "East",
    title: event_list_item.css("h3.tribe-events-list-event-title").text.strip,
    date_string: event_list_item.css("div.tribe-event-schedule-details").text.strip,
    datetime: DateTime.parse(event_list_item.css("div.tribe-event-schedule-details").text.strip, "%d %B @ %l:%M %P"),
    url: event_list_item.css("a.tribe-event-url")[0].attributes["href"].value,
    summary: event_list_item.css("div.tribe-events-list-event-description").text.strip.split("\n").first,
    img_src: event_list_item.css("div.tribe-events-event-image img")[0].attributes["src"].value
  }
  events << event
end

For more information on how to parse a response using Nokogiri see the Nokogiri documentation as a useful starting point.

All the key/value pairs that make up an event are specific to the site that is being scraped. What this means is, if the site is redesigned and the classnames or HTML structure change, the code will break and have to be rewritten to adapt to the new structure. Hopefully that won’t happen anytime soon, but you never know. In any case, for now we can put this together to create a working method:

def burley_fisher
  base_url = "https://burleyfisherbooks.com"
  slug = "/events/"
  url = base_url + slug
  unparsed_page = HTTParty.get(url)
  parsed_page = Nokogiri::HTML(unparsed_page)
  events_list = parsed_page.css("div.tribe-events-loop")
  events_list_items = events_list.css("div.tribe-clearfix")
  puts "Found #{events_list_items.count} events at #{url}"
  unless events_list_items.count == 0 do
    events = Array.new
    events_list_items.each_with_index do |event_list_item, index|
      event = {
        index: index,
        bookshop: "Burley Fisher Books",
        category: "East",
        title: event_list_item.css("h3.tribe-events-list-event-title").text.strip,
        date_string: event_list_item.css("div.tribe-event-schedule-details").text.strip,
        datetime: DateTime.parse(event_list_item.css("div.tribe-event-schedule-details").text.strip, "%d %B @ %l:%M %P"),
        url: event_list_item.css("a.tribe-event-url")[0].attributes["href"].value,
        summary: event_list_item.css("div.tribe-events-list-event-description").text.strip.split("\n").first,
        img_src: event_list_item.css("div.tribe-events-event-image img")[0].attributes["src"].value
      }
      puts "#{event[:index]+1} #{event[:title]}"
      events << event
    end
  end
  return events
end

Update 14/04/2020 – I have stopped working on this site for the foreseeable as due to COVID-19 there are currently no events listed at most of the sites I’d planned to scrape for events.

Written by Jamie Bowman
Last updated 3rd September 2020