Scrapit is an API for scrapping webpages for keywords. Using Scrapit you can extract important keywords from webpages. That are quite relevant to the page that has been scrapped. Scrapit is builton Python. Since Python has some great libraries for html and text parsing.
Scrapit uses lxml along with BeautifulSoup for processing and parsing html.
Using lxml is significantly caused increase in speed.
It also makes use of Topia.termextract for extracting keywords from the heaps of text from webpages and filtering it to remove stopwords.
Using the API:
You need to make calls to
http://scrapit.herokuapp.com/q/?q={url}
Parameters:
q: (required)urlto be fetchedoccurs: (optional) Will only return the words that are repeated more that once on the webpage. Set to '1' while you want to enable itpretty: (optional) Used forprettyprinting the response. Set to '1' while you want to enable it
Examples:
- http://scrapit.herokuapp.com/q/?q=http://imdb.com
- http://scrapit.herokuapp.com/q/?q=http://imdb.com&pretty=1
- http://scrapit.herokuapp.com/q/?q=http://imdb.com&pretty=1&occurs=1
(Please note that the API is still under development so the results might not be as were expected)