The British Library has a large amount of high-quality scans of books which are available for downloading. Unfortunately, their tool for downloading more than one page at a time does not work. For this reason, I've been trying to create a Python script with the Requests module that will download every page of a given book.
The jpg of every page has a specific url - in this case, that of the first page is https://api.bl.uk/image/iiif/ark:/81055/vdc_000000038900.0x000001/full/2306,/0/default.jpg and that of the second is https://api.bl.uk/image/iiif/ark:/81055/vdc_000000038900.0x000002/full/2306,/0/default.jpg. Extrapolating from the first nine pages (in this example, the book is 456 pages long), I naively created the following script:
import requests
base_url = "https://api.bl.uk/image/iiif/ark:/81055/vdc_000000038900.0x0000"
for i in range(1, 456):
target_url = base_url + str(i) + "/full/2306,/0/default.jpg"
r = requests.get(target_url)
with open('bl_' + str(i) + '.jpg', 'wb') as f:
f.write(r.content)
print(target_url)
This worked for the first 9 pages, but unfortunately, pages 10-15 are not 0000010-0000015, but 00000A-00000F. And the complications do not end here: pages 16-25 are 10-19, but with one leading 0 less (likewise 3-digit numbers have 2 zeros less, etc.). After that, pages 26-31 are 1A-1F, after which pages 16-25 are 10-19, after which pages 26-31 are 1A-1F, after which pages 32-41 are 20-29, after which pages 42-47 are 2A-2F. This pattern continues for as long as it can: up to page 159, which is 9F. After this, in order to remain in two digits, the pattern changes: pages 160-169 are A0-A9, pages 170-175 are AA-AF, pages 176-191 are B0-BF, and so on until page 255 which is FF. After this, pages 256-265 are 100-109, pages 266-271 are 10A-10F, pages 272-281 are 110-119, pages 282-287 are 11A-11F, and so on until page 415 which is 19F. After this, pages 416-425 are 1A0-1A9, pages 426-431 are 1AA-1AF, pages 432-441 are 1B0-1B9, and so on in this pattern until page 456, which is the final page of the book.
Evidently there is an algorithm generating this sequence according to certain parameters. Just as evidently, these parameters can be incorporated into the Python script I am trying to create. Sadly, my meagre coding knowledge was more than exhausted by the modest scriptlet above. I hope anyone here can help.
Replacing str(i) with f"{i:06x}" should give the correct numbering in hexadecimal over 6 zero padded digits in the URL.
Related
My Python program is pulling from a website from inside of a subprocess. This is working properly.
url = 'https://www.website.com/us/{0}/recent/kvc-4020_120/'.format(zipCode)
However, the website depending on the zip code, may have multiple pages of results. When this occurs it happens in the format of:
https://www.website.com/us/ZIPCODE/recent/kvc-4020_120?sortId=2&offset=48
In this case, ?sortId=2&offset= stays constant. My question is - how I can change the URL automatically, as if I was manually clicking to go to the next page? The only thing that changes would be the offset. It increases by 24 each page. Example:
Page 1, /recent/kvc-4020_120
Page 2, /recent/kvc-4020_120?sortId=2&offset=24
Page 3, /recent/kvc-4020_120?sortId=2&offset=48
etc etc.
This could only reach up to 150 pages. I'm just unsure how to take into account page 1 URL versus anything past page 1.
After pulling from the website, I write to a txt file. I want to automatically check if there is a next page and if there is, change the URL and repeat the process. If there's no next page, move on to the next zipcode.
A for loop:
for i in ['/recent/kvc-'+str(y)+'_120'
if x == 0 else '/recent/kvc-'+str(y)+'_120?sortid=2&offset=' + str(x)
for x in range(0, 48, 24) for y in range(4000,5000)]:
your_function('web_prefix' + i)
Where:
range(0, 48, 24) # increment to 48 by 24 (just an example)
range(4000, 5000) # Assumed range of Postcodes
I was wondering if there is any way using Tika/Python to only parse the first page or extract the metadata from the first page only? Right now, when I pass the pdf, it is parsing every single page.
I looked that this link: Is it possible to extract text by page for word/pdf files using Apache Tika?
However, this link explains more in java, which I am not familiar with. I was hoping there could be a python solution for it? Thanks!
from tika import parser
# running: java -jar tika-server1.18.jar before executing code below.
parsedPDF = parser.from_file('C:\\path\\to\\dir\\sample.pdf')
fulltext = parsedPDF['content']
metadata_dict = parsedPDF['metadata']
title = metadata_dict['title']
author = metadata_dict['Author'] # capturing all the names from lets say 15 pages. Just want it to capture from first page
pages = metadata_dict['xmpTPg:NPages']
Thanks for this info, really helpful. Here is my code to retrieve the content page by page (a bit dirty, but it works) :
raw_xml = parser.from_file(file, xmlContent=True)
body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
text_pages = body_without_tag.split("""<div class="page">""")[1:]
num_pages = len(text_pages)
if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
return text_pages
#Gagravarr comments regarding XHTML, I found that Tika had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it.
This worked out for me:
parsed_data_full = parser.from_file(file_name,xmlContent=True)
parsed_data_full = parsed_data_full['content']
There is a start and end for each page divider that starts with "<div" and ends with "</div>" first occurrence . Basically wrote a small code to capture sub-strings between 2 sub-strings and stored into a variable to my specific requirement.
I am totally new to programming but I come across with this bizarre phenomenon that I could not answer. Please point me in the right direction. I started crawling a entirely javascript built webpage with ajax tables. First I started with selenium and it worked well. However, I noticed that someone of you pros here mentioned scrapy is much faster. Then I tried and succeeded in building the crawler under scrapy, with a hell of headache.
I need to use re to extract the javascript strings and what happened next confused me. Here is what the python doc says(https://docs.python.org/3/library/re.html):
"but using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program."
I first started with using re.search(pattern,str) in a loop. Scrapy crawled 202 pages in 3 seconds (finish time - start time).
Then I followed the Python doc's suggestion, compiling re.compile(pattern) before the loop to improve efficiency. Scrapy crawled the same 202 pages with 37 seconds. What is going on here?
Here is the some of the code, other suggestions to improve the code is greatly appreciated. Thanks.
EDIT2: I was too presumptuous to base my view on a single run.
Later 3 tests with 2000 webpages show that regex compiling within the loop is finished in 25s on average. With the same 2000 webpages regex compiling before the loop is finished in 24s on average.
EDIT:
Here is the webpage I am trying to crawl
http://trac.syr.edu/phptools/immigration/detainhistory/
I am trying to crawl basically everything on a year-month basis from this database. There are three javascript generated columns on this webpage. When you select options form these drop-down menus, the webpage send back sql queries to its server and generate corresponding contents. I figured out directly generating these pre-defined queries to crawl all the table contents, however, it is great pain.
def parseYM(self, response):
c_list =response.xpath()
c_list_num = len(c_list)
item2 = response.meta['item']
# compiling before loop
# search_c=re.comple(pattern1)
# search_cid=re.comple(pattern2)
for j in range(c_list):
item = myItem()
item[1] = item2[1]
item['id'] = item2['id']
ym_id = item['ymid']
item[3] = re.search(pattern1, c_list[j]).group(1)
tmp1 = re.search(pattern2, c_list[j]).group(1)
# item['3'] = search_c.search(c_list[j]).group(1)
# tmp1 = search_cid.search(c_list[j]).group(1)
item[4] = tmp1
link1 = 'link1'
request = Request(link1, self.parse2, meta={'item': item}, dont_filter=True)
yield request
Unnecessary temp variable is used to avoid long lines. Maybe there are better ways? I have a feeling that the regular expression issue has something to do with the twisted reactor. The twisted doc is quite intimidating to newbies like me...
in need to extract more information from tripAdvisor
my code:
item = TripadvisorItem()
item['url'] = response.url.encode('ascii', errors='ignore')
item['state'] = hxs.xpath('//*[#id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
item['state']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(#class,"region_title")][2]/text()').extract()
item['city'] = hxs.select('//*[#id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
item['city'] =hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
item['city']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')
item['hotelName'] = hxs.xpath('//*[#id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')
reviews = hxs.select('.//div[contains(#id, "review")]')
1. For every hotel in tripAdvisor, there is a id number for the hotel. like 80075 for this hotel: http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS
how can i extract this id from the TA item?
More information i need for every hotel : shortDescription, stars, zipCode, country and coordinates(long, lat). Can i extract this things?
i need to extract for every review the traveller type. how?
my code for review:
for review in reviews:
it = Review()
it['state'] = item['state']
it['city'] = item['city']
it['hotelName'] = item['hotelName']
it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/#title').extract()
if(it['date']==[]):
it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
if(it['date']!=[]):
it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()
it['userName'] = review.xpath('.//div[contains(#class,"username mo")]/span/text()').extract()
if (it['userName']!=[]):
it['userName']=it['userName'][0].encode('ascii', errors='ignore')
it['userLocation'] = ''.join(review.xpath('.//div[contains(#class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(#class,"quote")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
else:
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(#class,"noQuotes")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
if(it['reviewContent']!=[]):
it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()
it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/#alt').extract()
if(it['generalRating']!=[]):
it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]
there is a good manual how to find these things? i lost myself with all the spans and the divs..
thanks!
I'll try to do this in purely XPath. Unfortunately, it looks like most of the info you want is contained in <script> tags:
Hotel ID - Returns "80075"
substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "locId:")), ",")
Alternatively, the Hotel ID is in the URL, as another answerer mentioned. If you're sure the format will always be the same (such as including a "d" prior to the ID), then you can use that instead.
Rating (the one at the top) - Returns "3.5"
//span[contains(#class, "rating_rr")]/img/#content
There are a couple instances of ratings on this page. The main rating at the top is what I've grabbed here. I haven't tested this within Scrapy, so it's possible that it's popoulated by JavaScript and not initially loaded as part of the HTML. If that's the case, you'll need to grab it somewhere else or use something like Selenium/PhantomJS.
Zip Code - Returns "10019"
(//span[#property="v:postal-code"]/text())[1]
Again, same deal as above. It's in the HTML, but you should check whether it's there upon page load.
Country - Returns ""US""
substring-before(substring-after(//script[contains(., "modelLocaleCountry")]/text(), "modelLocaleCountry = "), ";")
This one comes with quotes. You can always (and you should) use a pipeline to sanitize scraped data to get it to look the way you want.
Coordinates - Returns "40.76174" and "-73.985275", respectively
Lat: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lat:")), ",")
Lon: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lng:")), ",")
I'm not entirely sure where the short description exists on this page, so I didn't include that. It's possible you have to navigate elsewhere to get it. I also wasn't 100% sure what the "traveler type" meant, so I'll leave that one up to you.
As far as a manual, it's really about practice. You learn tricks and hacks for working within XPath, and Scrapy allows you to use some added features, such as regex and pipelines. I wouldn't recommend doing the whole "absolute path" XPath (i.e., ./div/div[3]/div[2]/ul/li[3]/...), since any deviation from that within the DOM will completely ruin your scraping. If you have a lot of data to scrape, and you plan on keeping this around a while, your project will become unmanageable very quickly if any site moves around even a single <div>.
I'd recommend more "querying" XPaths, such as //div[contains(#class, "foo")]//a[contains(#href, "detailID")]. Paths like that will make sure that no matter how many elements are placed between the elements you know will be there, and even if multiple target elements are slightly different from each other, you'll be able to grab them consistently.
XPaths are a lot of trial and error. A LOT. Here are a few tools that help me out significantly:
XPath Helper (Chrome extension)
scrapy shell <URL>
scrapy view <URL> (for rendering Scrapy's response in a browser)
PhantomJS (if you're interested in getting data that's been inserted via JavaScript)
Hope some of this helped.
Is it acceptable to get it from the URL using a regex?
id = re.search('(-d)([0-9]+)',url).group(2)
I have the following script that posts a search terms into a form and retrieves results:
import mechanize
url = "http://www.taliesin-arlein.net/names/search.php"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(name="form")
br["search_surname"] = "*"
res = br.submit()
content = res.read()
with open("surnames.txt", "w") as f:
f.write(content)
however the rendered web page, and hence script here limits the search to 250 results. Is there any way I can bypass this limit and retrieve all results?
Thank you
You could simply iterate over possible prefixes to get around the limit. There is 270,000 names and a limit of 250 results per query, therefore you need to make at least 1080 requests, there are 26 letters in the alphabet so if we assume there is an even distribution this would mean we would need to use a little over 2 letters as a prefix (log(1080)/log(26)), however it is unlikely to be that even (how many people have surnames starting with ZZ after all).
To get around this we use a modified depth first search like so:
import string
import time
import mechanize
def checkPrefix(prefix):
#Return list of names with this prefix.
url = "http://www.taliesin-arlein.net/names/search.php"
br = mechanize.Browser()
br.open(url)
br.select_form(name="form")
br["search_surname"] = prefix+'*'
res = br.submit()
content = res.read()
return extractSurnames(content)
def extractSurnames(pageText):
#write function to extract text from html
Q=[x for x in string.ascii_lowercase]
listOfSurnames=[]
while Q:
curPrefix=Q.pop()
print curPrefix
curSurnames=checkPrefix(curPrefix)
if len(curSurnames)<250:
#store surnames could also write to file.
listOfSurnames+=curSurnames
else:
#We clearly didnt get all of the names need to subdivide more
Q+=[curPrefix+x for x in string.ascii_lowercase]
time.sleep(5) # Sleep here to avoid overloading the server for other people.
Thus we query more in places where there are too many results to be displayed, but we do not query ZZZZ if there is less than 250 surnames that start with ZZZ (or shorter). Without knowing how skewed the name distribution is, hard to estimate how long this will take but the 5 seconds sleep multiplied by 1080 is 1.5 hours or so so you are probably looking at at least half a day if not longer.
Note: This could be made more efficient by declaring the browser globally, however whether this is appropriate depends on where this code will be placed.