How to code a scrapy to grab differently formatted html tables

How to code a scrapy to grab differently formatted html tables - python

I've used scrapy before, but only scraping information from one site. I want to use scrapy to grab information from directories on different sites. On each of these sites the information is stored in a simple html table, with the same titles. How do I calibrate scrapy to grab data from each html table even if the table classes may differ from site to site? On a larger scale, what I'm asking is how to use scrapy when I want to hit different websites that may be formatted differently. I'll include below pictures of the html source and xpaths of several of the sites.
The fields of the table, more or less the same for each site directory
The xpath for site 1 for the name column
the xpath for site 2 for the name column
general html formatting of site 1, with phone number blurred out
general html formatting of site 2
General formatting for a third site, which is different than the first 2 but still in a table with 4 columns

Yes - it's a bit of a pain to have to write a spider for every site, especially if there are 100's and the Items are the same for all of them.
If it fits your need, you might like to store XPaths for each site on a file e.g. a csv file. Then you can fetch URLs and expressions from the csv and use them in your spider (adapted from here):
def start_requests(self):
with open(getattr(self, "file", "todo.csv"), "rU") as f:
reader = csv.DictReader(f)
for line in reader:
request = Request(line.pop('url'))
request.meta['fields'] = line
yield request
def parse(self, response):
xpath = response.meta['fields']['tablexpath']
... use xpath it to extract your table
If you need to release your spider to e.g. scrapyd or scrapinghub, you will need to package your .csv file along with your code. To do so you will have to edit the setup.py that shub deploy or scrapyd-client generate and add:
setup(
...
package_data={'myproject': ['my_csv.csv']}
)
Also in your spider, instead of opening your file directly with open, you should use this:
from pkg_resources import resource_stream
f = resource_stream('myproject', 'my_csv.csv')
Here's an example. If you don't deploy your spider, just ignore the above. If you do this will save you a few hours of debugging.

I did that by creating a scrapy project with one spider per site and using the same item class for all the different spiders.

Related

How to use one bot for different websites

I want to scrape 2 different website. One of them is plain html and the other one javascript (for which I need splash to scrape it).
So I have several questions about it:
Can I scrape two different types of websites with only one bot (html and javascript one)? I did two html websites before and it worked but I wonder if this also works if one of them is javascript
If the first question is possible can I export json separately? Like for url1 output1.json, for url2 output2.json?
As you can see from my code, code need edited and I dont know how can I do that when two different types of websites need to be scraped.
Is there any tool of scrapy to compare json? (The two different websites have almost the same content. I want make output1.json the base and check if some value are different in output2.json or not.
My code:
class MySpider(scrapy.Spider):
name = 'mybot'
allowed_domains = ['url1','url2']
def start_requests(self):
urls = (
(self.parse1, 'url1'),
(self.parse2, 'url2'),
)
for callbackfunc, url in urls:
yield scrapy.Request(url, callback=callbackfunc)
#In fact url2 must for javascript website so I need clearly splash here
def parse1(self, response):
pass
def parse2(self,response):
pass

Yes, you can scrape more than one site with the same spider, but it doesn't make sense if they are too different. The way to do that you have already figured out: allowed_domains and start_requests (or start_urls). However, exporting to different files won't be straightforward. You will have to write your export code.
IMHO having one spider per site is the way to go. If they share some code, you can have a BaseSpider class from where your spiders can inherit.
And regarding the javascript site you mentioned, are you sure you can not request its API directly?

Scrapy: How to reproduce results without downloading the html again?

Having downloaded the HTML to my harddisk with Scrapy (e.g., using the builtin Item Exporters with a field HTML, or storing all HTML files to a folder), how can I use Scrapy to read the data from my harddisk again and execute the next step in the pipeline? Is there something like an Item Importer?

If the HTML pages are stored on the local PC, where you run Scrapy from, you can scrape the URIs like:
file:///tmp/page1.html
using Scrapy. In this example, I assume one such page is stored in the file /tmp/page1.html.
The second option is to use whatever way to get the content of the files and manually build a Selector object like this:
import scrapy
# read the content of the page into page_content variable
root_sel = scrapy.Selector(text=page_content)
You can then normally process the root_sel selector, e.g.
title = root_sel.css('h1.title').extract_first()

Scraping specific elements from page

I am new to python, and I was looking into using scrapy to scrape specific elements on a page.
I need to fetch the Name and phone number listed on a members page.
This script will fetch the entire page, what can I add/change to fetch only those specific elements?
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["fali.org"]
start_urls = [
"http://www.fali.org/members/",
]
def parse(self, response):
filename = response.url.split("/?id=")[-2] + '%random%'
with open(filename, 'wb') as f:
f.write(response.body)

I cannot see a page:
http://www.fali.org/members/
instead it redirects to the home page.
That makes it impossible to give specifics.
Here is an example:
article_title = response.xpath("//td[#id='HpWelcome']/h2/text()").extract()
That parses "Florida Association of Licensed Investigators (FALI)" from their homepage.You can get browser plugins to help you figure out xpaths. XPath Helper on chrome makes it easy.
That said -- go through the tutorials posted above. Because you are gonna have more questions I'm sure and broad questions like this aren't taken well on stack-overflow.

As shark3y states in his answer the start_url gets redirected to the main page.
If you have read the documentation you should know that Scrapy starts scraping from the start_url and it does not know what you want to achieve.
In your case you need to start from http://www.fali.org/search/newsearch.asp which returns the search results for all members. Now you can set-up a Rule to go through the result list and call a parse_detail method for every member found and follow the links through the result pagination.
In the parse_detail method you can go through the site of the member and extract every information you need. I guess you do not need the whole site as you do in your example in your question because it would generate a lot of data on your computer -- and at the end you have to parse it anyway.

python - pull pdfs from webpage and convert to html

My goal is to have a python script that will access particular webpages, extract all pdf files on each page that have a certain word in their filename, convert them into html/xml, then go through the html files to read data from the pdfs' tables.
So far I have imported mechanize (for browsing the pages/finding the pdf files) and I have pdfminer, however I'm not sure how to use it in a script to perform the same functionality it does on the command line.
What is the most effective group of libraries for accomplishing my task, and how would you recommend approaching each step? I apologize if this is too specific for stackoverflow, but I'm having trouble using google searches and sparse documentation to piece together how to code this. Thanks!
EDIT:
So I've decided to go with Scrapy on this one. I'm really liking it so far, but now I have a new question. I've defined a PDFItem() class to use with my spider with fields title and url. I have a selector thats grabbing all the links I want, and I want to go through these links and create a PDFItem for each one. Here's the code I have below:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
item = PDFItem()
for link in links:
item['title'] = link.xpath('/text()')
item['url'] = URL + link.xpath('#href').extract()[0]
The url line works well, but I don't really know how to do the same for title. I guess I could just perform the query at the top, but adding '/text()' to the end of the selector, but this seems excessive. Is there a better way to just go through each link object in the links array and grab the text and href value?

I would use Scrapy. Scrapy is the best tool for crawling an entire website and generating a list of all PDF links. A spider like this would be very easy to write. You definitely don't need Mechanize.
After that, I would use Poppler to convert each PDF to HTML. It's not a Python module, but you can use the command pdftohtml. In my experience, I've had better results with Poppler than PDFMiner.
Edit:
links = sel.xpath('//a[contains(#href, "enforcementactions.pdf") and contains(#class, "titlelink")]')
for link in links:
item = PDFItem()
item['title'] = link.xpath('text()').extract()[0]
item['url'] = URL + link.xpath('#href').extract()[0]

In order to browse and find PDF links from a webpage a url library should suffice. Mechanize, as it's documentation suggests, is used to automate interactions with a website. Given your description I find it unnecessary.
The PDFMiner's pdf2txt.py converts a PDF to HTML. So you need to invoke this program as a sub process in your script to create output HTMLs.
So the libraries you would need are a HTTP library, like Requests and PDFMiner.
The work flow of your script will be something like:
import os
import requests
from subprocess import Popen
...
r = requests.get(<url-which-has-pdf-links>)
# Do a search for pdf links in r.text
...
for pdf_url in pdf_links:
# get the PDF content and save it to a local temp file
...
# Build the command line parameters, the way pdf2txt expects
# Invoke the PDFMiner's pdf2txt on the created file as a subprocess
Popen(cmd)
More info on using Requests to save the pdf file as a local file, here. More info on running programs as subprocesses here

how to scrawl file hosting website with scrapy in python?

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!

I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.