Getting Beautiful to run through a list after finishing pagination - python

So, I am learning both python and web-scraping, so please forgive me if this is something extremely basic.
I found a script and modified it to scrape yell.com
Now, I understand pagination. And am able to scrape the entire set of one city using code similar to the one below.
for x in range(1,9):
print(f'Scraping page {x}')
content = extract(f'https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=134234234&keywords=dentists&location=birmingham&pageNum={x}')
transform(content)
time.sleep(5)
load()
print('Saved to CSV')
Now, I have a list of cities that I'd like to scrape.
So for instance, the location=birmingham parameter above would change location=portsmouth
The solution I have come up with is to define the entire city list in an array (it could be huge) and then call them.
However, I want the scrape to run through the entire range defined above and then move on to a different city, with the range reset. And I can't figure that bit out.

It sounds like you just need to include a second for loop to go through your long list of cities. Then city can be included into your URL. For example:
cities = ['birmingham', 'portsmouth', 'london'] # long list of cities
for city in cities:
print(f'City - {city}')
for x in range(1, 9):
print(f'Scraping page {x}')
content = extract(f'https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=134234234&keywords=dentists&location={city}&pageNum={x}')
transform(content)
time.sleep(5)
load()
print('Saved to CSV')

Related

How to add numbers in link (loop)

I'm writing a script where I try to scrape data from json files. The website link structure looks like this:
https://go.lime-go.com/395012/Organization/pase1009/
I want the Python script to go through a certain number and try to visit them. For example, right now the link is at pase1009. After the script has visited this link I want it to go to pase1010 and so on.
I'm really new to Python and trying to learn how to use loops, count, etc. but don't get it.
My PY code:
rlista = "https://go.lime-go.com/395012/Organization/pase1009/getEmployees"
page = self.driver.get(rlista)
time.sleep(2)
Best regards,
Tobias
You can combine several strings to one with the +-operator.
So you could save your base link in a variable and add the number afterwards in the loop.
Would look something like this:
baseLink = "https://your-link.com/any/further/stuff/pase"
for k in range(1000,1010,2):
link = baseLink + str(k)
print(link)
There your links would be
https://your-link.com/any/further/stuff/pase1000
https://your-link.com/any/further/stuff/pase1002
https://your-link.com/any/further/stuff/pase1004
https://your-link.com/any/further/stuff/pase1006
https://your-link.com/any/further/stuff/pase1008
as k will start with 1000, increment by 2 and stop before 1010 (range(start, stop, increment)).

Printing information I have stored - Python

I am creating a news feed scraper so I can collate my favourite football teams news daily. Im an apprentice developer and I thought doing it would increase my knowledge. Just a simple thing to scan one or two sites for just headlines and return the text of the headlines. I have downloaded python, and gained a bit of knowledge around beautiful soup methods and I have managed to find a path directly to each headline on my chosen site, and I have stored these to an array
`page_soup = soup(page_html, "html.parser")` //"parses" the stored data(page_html)
`page_soup.findAll(class_="lakeside__title-text")` //finds all titles on the BBC Liverpool Sports page.
`headline1 = allHeadlines[0]` //create a single entry called "headline1"` from the first slot in our search results
'headline1.text //prints "headline1" string to show its working e.g "'What do you know about Dalglish?(my result)'"
But now I am puzzled as to how to create the loop needed to store the data and display.
for item in allHeadlines{
//something here. im a noob so all i know around this is usually item = item +1
}
print to file etc.,.
Any reading material for me around this topic would be greatly appreciated
Sorry for editing issues, my first ever post.
Assuming allHeadlines is list of objects ( which have method text) .
We can create a list of text from for loop for display or writing to file.
text_headlines = [ item.text for item in allHeadlines if item.text ]
print(text_headlines)

How to scrape movies information from the IMDB website?

I am new with Python and trying to scrape IMDB. I am scraping a list of 250 top IMDB movies and want to get information on each unique website for example the length of each movie.
I already have a list of unique URLs. So, I want to loop over this list and for every URL in this list I want to retrieve the 'length' of that movie. Is this possible to do in one code?
for URL in urlofmovie:
htmlsource = requests.get(URL)
tree_url = html.fromstring(htmlsource)
lengthofmovie = tree_url.xpath('//*[#class="subtext"]')
I expect that lengthofmovie will become a list of all the lengths of the movies. However, it already goes wrong at line 2: the htmlsource.
To make it a list you should first create a list and then append each length to that list.
length_list = []
for URL in urlofmovie:
htmlsource = requests.get(URL)
tree_url = html.fromstring(htmlsource)
length_list.append(tree_url.xpath('//*[#class="subtext"]'))
Small tip: Since you are new to Python I would suggest you to go over PEP8 conventions. Your variable naming can make your(and other developers) life easier. (urlofmovie -> urls_of_movies)
However, it already goes wrong for at line 2: the htmlsource.
Please provide the exception you are receiving.

problem with accessing index from for loop and using it to create a new list

I am extremely new to Python and programming in general (I basically started a few days ago) so forgive me if I use the wrong terms or if I'm asking a silly question.
I’m writing a web scraper to get some data from a job vacancy website. I've written some code that first of all downloads the data from the main search results page, parses it and extracts from it the headings which contain a link to each of the vacancy pages where the details of each specific vacancy can be found. Then I’ve written code that opens each link and parses the html from each vacancy page.
Now this all works fine. The issue I have is with the following. I want to scrape some data from each of these vacancy pages and save the data for each vacancy in a separate list so that later I can put all these lists in a data frame. I’ve therefore been looking for a way to number or ‘index’ (if that is the right term to use) each list so that I can refer to them later. Below is the code I have at the moment. Following the advice I found by reading existing answers on Stackoverflow I’ve tried to use enumerate to create an index which I can assign to each list, as follows:
vacancy_headings = resultspage1_soup.body.findAll("a", class_ ="vacancy-link")
vacancydetails = []
for index, vacancy in enumerate(vacancy_headings, start=0):
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
vacancypage_client = urlopen(vacancypage_url)
vacancypage_html = vacancypage_client.read()
vacancypage_soup = soup(vacancypage_html, "html.parser")
vacancydetails[index]=[]
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
if p["itemprop"] == "employmentType" or p["itemprop"] == "streetAddress" or p["itemprop"] == "addressLocality" or p["itemprop"] == "addressRegion" or p["itemprop"] == "postalCode":
cells = p.text
vacancydetails[index].append(cells)`
But I get the following error message:
IndexError Traceback (most recent call last)
<ipython-input-10-b8a75df16395> in <module>()
9 vacancypage_html = vacancypage_client.read()
10 vacancypage_soup = soup(vacancypage_html, "html.parser")
---> 11 vacancydetails[index]=[]
12
13 for p in vacancypage_soup.select("p"):
IndexError: list assignment index out of range
Could someone explain to me (in easy-to-understand language if possible!) what is going wrong, and how I can fix this problem?
Thanks!!
Since vacancydetails is a list, trying to access a position in the list that doesn't exist is an error. And, when you first create it, the list is empty. So, before accessing any elements from the list, you'll need to first create those elements.
Thus, instead of this:
vacancydetails[index]=[]
...you want to append a new item to the list (and that new item happens to be an empty list itself), like this:
vacancydetails.append([])
The list vacancydetails is empty until you append to it (or assign to it from somewhere else). Because index is counting up from 0, you just want to manipulate the currently-final entry in vacancydetails in the for p loop.
So, rather than vacancydetails[index]=[] you want vacancydetails.append([]). But then the more pythonic thing to do is work with the last entry in vacancydetails, i.e., vacancydetails[-1], in which case you never need the index variable.
for vacancy in vacancy_headings:
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
### ...
vacancydetails.append([])
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
### ...
vacancydetails[-1].append(cells)

Scrapy, Crawling Reviews on Tripadvisor: extract more hotel and user information

in need to extract more information from tripAdvisor
my code:
item = TripadvisorItem()
item['url'] = response.url.encode('ascii', errors='ignore')
item['state'] = hxs.xpath('//*[#id="PAGE"]/div[2]/div[1]/ul/li[2]/a/span/text()').extract()[0].encode('ascii', errors='ignore')
if(item['state']==[]):
item['state']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[contains(#class,"region_title")][2]/text()').extract()
item['city'] = hxs.select('//*[#id="PAGE"]/div[2]/div[1]/ul/li[3]/a/span/text()').extract()
if(item['city']==[]):
item['city'] =hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[1]/span/text()').extract()
if(item['city']==[]):
item['city']=hxs.xpath('//*[#id="HEADING_GROUP"]/div[2]/address/span/span/span[3]/span/text()').extract()
item['city']= item['city'][0].encode('ascii', errors='ignore')
item['hotelName'] = hxs.xpath('//*[#id="HEADING"]/span[2]/span/a/text()').extract()
item['hotelName']=item['hotelName'][0].encode('ascii', errors='ignore')
reviews = hxs.select('.//div[contains(#id, "review")]')
1. For every hotel in tripAdvisor, there is a id number for the hotel. like 80075 for this hotel: http://www.tripadvisor.com/Hotel_Review-g60763-d80075-Reviews-Amsterdam_Court_Hotel-New_York_City_New_York.html#REVIEWS
how can i extract this id from the TA item?
More information i need for every hotel : shortDescription, stars, zipCode, country and coordinates(long, lat). Can i extract this things?
i need to extract for every review the traveller type. how?
my code for review:
for review in reviews:
it = Review()
it['state'] = item['state']
it['city'] = item['city']
it['hotelName'] = item['hotelName']
it['date'] = review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/#title').extract()
if(it['date']==[]):
it['date']=review.xpath('.//div[1]/div[2]/div/div[2]/span[2]/text()').extract()
if(it['date']!=[]):
it['date']=it['date'][0].encode('ascii', errors='ignore').replace("Reviewed","").strip()
it['userName'] = review.xpath('.//div[contains(#class,"username mo")]/span/text()').extract()
if (it['userName']!=[]):
it['userName']=it['userName'][0].encode('ascii', errors='ignore')
it['userLocation'] = ''.join(review.xpath('.//div[contains(#class,"location")]/text()').extract()).strip().encode('ascii', errors='ignore')
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div[1]/div[contains(#class,"quote")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
else:
it['reviewTitle'] = review.xpath('.//div[1]/div[2]/div/div[1]/a/span[contains(#class,"noQuotes")]/text()').extract()
if(it['reviewTitle']!=[]):
it['reviewTitle']=it['reviewTitle'][0].encode('ascii', errors='ignore')
it['reviewContent'] = review.xpath('.//div[1]/div[2]/div[1]/div[3]/p/text()').extract()
if(it['reviewContent']!=[]):
it['reviewContent']=it['reviewContent'][0].encode('ascii', errors='ignore').strip()
it['generalRating'] = review.xpath('.//div/div[2]/div/div[2]/span[1]/img/#alt').extract()
if(it['generalRating']!=[]):
it['generalRating'] =it['generalRating'][0].encode('ascii', errors='ignore').split()[0]
there is a good manual how to find these things? i lost myself with all the spans and the divs..
thanks!
I'll try to do this in purely XPath. Unfortunately, it looks like most of the info you want is contained in <script> tags:
Hotel ID - Returns "80075"
substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "locId:")), ",")
Alternatively, the Hotel ID is in the URL, as another answerer mentioned. If you're sure the format will always be the same (such as including a "d" prior to the ID), then you can use that instead.
Rating (the one at the top) - Returns "3.5"
//span[contains(#class, "rating_rr")]/img/#content
There are a couple instances of ratings on this page. The main rating at the top is what I've grabbed here. I haven't tested this within Scrapy, so it's possible that it's popoulated by JavaScript and not initially loaded as part of the HTML. If that's the case, you'll need to grab it somewhere else or use something like Selenium/PhantomJS.
Zip Code - Returns "10019"
(//span[#property="v:postal-code"]/text())[1]
Again, same deal as above. It's in the HTML, but you should check whether it's there upon page load.
Country - Returns ""US""
substring-before(substring-after(//script[contains(., "modelLocaleCountry")]/text(), "modelLocaleCountry = "), ";")
This one comes with quotes. You can always (and you should) use a pipeline to sanitize scraped data to get it to look the way you want.
Coordinates - Returns "40.76174" and "-73.985275", respectively
Lat: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lat:")), ",")
Lon: substring-before(normalize-space(substring-after(//script[contains(., "geoId:") and contains(., "lat")]/text(), "lng:")), ",")
I'm not entirely sure where the short description exists on this page, so I didn't include that. It's possible you have to navigate elsewhere to get it. I also wasn't 100% sure what the "traveler type" meant, so I'll leave that one up to you.
As far as a manual, it's really about practice. You learn tricks and hacks for working within XPath, and Scrapy allows you to use some added features, such as regex and pipelines. I wouldn't recommend doing the whole "absolute path" XPath (i.e., ./div/div[3]/div[2]/ul/li[3]/...), since any deviation from that within the DOM will completely ruin your scraping. If you have a lot of data to scrape, and you plan on keeping this around a while, your project will become unmanageable very quickly if any site moves around even a single <div>.
I'd recommend more "querying" XPaths, such as //div[contains(#class, "foo")]//a[contains(#href, "detailID")]. Paths like that will make sure that no matter how many elements are placed between the elements you know will be there, and even if multiple target elements are slightly different from each other, you'll be able to grab them consistently.
XPaths are a lot of trial and error. A LOT. Here are a few tools that help me out significantly:
XPath Helper (Chrome extension)
scrapy shell <URL>
scrapy view <URL> (for rendering Scrapy's response in a browser)
PhantomJS (if you're interested in getting data that's been inserted via JavaScript)
Hope some of this helped.
Is it acceptable to get it from the URL using a regex?
id = re.search('(-d)([0-9]+)',url).group(2)

Categories

Resources