Lesson 6: Filtering a Web Page

We will now integrate what we learnt in the last two lessons into a rails controller that allows us to filter out everything from a wikipedia page, except the main text content.

We start as usual with

script/generate controller wikifilter index

then edit app/views/wikifilter/index.html.erb to contain the single line:

<%= @display %>

Our job in the controller portion, is to put something into @display. Towards this end, we edit app/controllers/wikifilter_controller.rb as follows:

class WikifilterController < ApplicationController require 'net/http' require 'rubygems' require 'hpricot' def index w=params[:id] begin page=Net::HTTP.get('en.wikipedia.org','/wiki/'+ w) doc = Hpricot(page) bc=doc.search('#bodyContent') ds=bc/:p @display=ds.to_html rescue @display= 'Is the Internet on?' end end end

New are the lines w=params[:id], which puts into w the part of the url that follows http://localhost:3000/wikifilter/index/ so if we seek to access the url http://localhost:3000/wikifilter/index/Ski, w will contain 'Ski'.

The other new line is @display=ds.to_html, which puts into @display not the text, as we had in lesson 5, but the actual html of the paragraphs, after suitable slicing out of whatever we do not need.