How to scrape movies information from the IMDB website? - python

I am new with Python and trying to scrape IMDB. I am scraping a list of 250 top IMDB movies and want to get information on each unique website for example the length of each movie.
I already have a list of unique URLs. So, I want to loop over this list and for every URL in this list I want to retrieve the 'length' of that movie. Is this possible to do in one code?
for URL in urlofmovie:
htmlsource = requests.get(URL)
tree_url = html.fromstring(htmlsource)
lengthofmovie = tree_url.xpath('//*[#class="subtext"]')
I expect that lengthofmovie will become a list of all the lengths of the movies. However, it already goes wrong at line 2: the htmlsource.

To make it a list you should first create a list and then append each length to that list.
length_list = []
for URL in urlofmovie:
htmlsource = requests.get(URL)
tree_url = html.fromstring(htmlsource)
length_list.append(tree_url.xpath('//*[#class="subtext"]'))
Small tip: Since you are new to Python I would suggest you to go over PEP8 conventions. Your variable naming can make your(and other developers) life easier. (urlofmovie -> urls_of_movies)
However, it already goes wrong for at line 2: the htmlsource.
Please provide the exception you are receiving.

Related

how to get nested data with pandas and request

I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax
I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.

Getting Beautiful to run through a list after finishing pagination

So, I am learning both python and web-scraping, so please forgive me if this is something extremely basic.
I found a script and modified it to scrape yell.com
Now, I understand pagination. And am able to scrape the entire set of one city using code similar to the one below.
for x in range(1,9):
print(f'Scraping page {x}')
content = extract(f'https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=134234234&keywords=dentists&location=birmingham&pageNum={x}')
transform(content)
time.sleep(5)
load()
print('Saved to CSV')
Now, I have a list of cities that I'd like to scrape.
So for instance, the location=birmingham parameter above would change location=portsmouth
The solution I have come up with is to define the entire city list in an array (it could be huge) and then call them.
However, I want the scrape to run through the entire range defined above and then move on to a different city, with the range reset. And I can't figure that bit out.
It sounds like you just need to include a second for loop to go through your long list of cities. Then city can be included into your URL. For example:
cities = ['birmingham', 'portsmouth', 'london'] # long list of cities
for city in cities:
print(f'City - {city}')
for x in range(1, 9):
print(f'Scraping page {x}')
content = extract(f'https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=134234234&keywords=dentists&location={city}&pageNum={x}')
transform(content)
time.sleep(5)
load()
print('Saved to CSV')

Printing information I have stored - Python

I am creating a news feed scraper so I can collate my favourite football teams news daily. Im an apprentice developer and I thought doing it would increase my knowledge. Just a simple thing to scan one or two sites for just headlines and return the text of the headlines. I have downloaded python, and gained a bit of knowledge around beautiful soup methods and I have managed to find a path directly to each headline on my chosen site, and I have stored these to an array
`page_soup = soup(page_html, "html.parser")` //"parses" the stored data(page_html)
`page_soup.findAll(class_="lakeside__title-text")` //finds all titles on the BBC Liverpool Sports page.
`headline1 = allHeadlines[0]` //create a single entry called "headline1"` from the first slot in our search results
'headline1.text //prints "headline1" string to show its working e.g "'What do you know about Dalglish?(my result)'"
But now I am puzzled as to how to create the loop needed to store the data and display.
for item in allHeadlines{
//something here. im a noob so all i know around this is usually item = item +1
}
print to file etc.,.
Any reading material for me around this topic would be greatly appreciated
Sorry for editing issues, my first ever post.
Assuming allHeadlines is list of objects ( which have method text) .
We can create a list of text from for loop for display or writing to file.
text_headlines = [ item.text for item in allHeadlines if item.text ]
print(text_headlines)

problem with accessing index from for loop and using it to create a new list

I am extremely new to Python and programming in general (I basically started a few days ago) so forgive me if I use the wrong terms or if I'm asking a silly question.
I’m writing a web scraper to get some data from a job vacancy website. I've written some code that first of all downloads the data from the main search results page, parses it and extracts from it the headings which contain a link to each of the vacancy pages where the details of each specific vacancy can be found. Then I’ve written code that opens each link and parses the html from each vacancy page.
Now this all works fine. The issue I have is with the following. I want to scrape some data from each of these vacancy pages and save the data for each vacancy in a separate list so that later I can put all these lists in a data frame. I’ve therefore been looking for a way to number or ‘index’ (if that is the right term to use) each list so that I can refer to them later. Below is the code I have at the moment. Following the advice I found by reading existing answers on Stackoverflow I’ve tried to use enumerate to create an index which I can assign to each list, as follows:
vacancy_headings = resultspage1_soup.body.findAll("a", class_ ="vacancy-link")
vacancydetails = []
for index, vacancy in enumerate(vacancy_headings, start=0):
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
vacancypage_client = urlopen(vacancypage_url)
vacancypage_html = vacancypage_client.read()
vacancypage_soup = soup(vacancypage_html, "html.parser")
vacancydetails[index]=[]
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
if p["itemprop"] == "employmentType" or p["itemprop"] == "streetAddress" or p["itemprop"] == "addressLocality" or p["itemprop"] == "addressRegion" or p["itemprop"] == "postalCode":
cells = p.text
vacancydetails[index].append(cells)`
But I get the following error message:
IndexError Traceback (most recent call last)
<ipython-input-10-b8a75df16395> in <module>()
9 vacancypage_html = vacancypage_client.read()
10 vacancypage_soup = soup(vacancypage_html, "html.parser")
---> 11 vacancydetails[index]=[]
12
13 for p in vacancypage_soup.select("p"):
IndexError: list assignment index out of range
Could someone explain to me (in easy-to-understand language if possible!) what is going wrong, and how I can fix this problem?
Thanks!!
Since vacancydetails is a list, trying to access a position in the list that doesn't exist is an error. And, when you first create it, the list is empty. So, before accessing any elements from the list, you'll need to first create those elements.
Thus, instead of this:
vacancydetails[index]=[]
...you want to append a new item to the list (and that new item happens to be an empty list itself), like this:
vacancydetails.append([])
The list vacancydetails is empty until you append to it (or assign to it from somewhere else). Because index is counting up from 0, you just want to manipulate the currently-final entry in vacancydetails in the for p loop.
So, rather than vacancydetails[index]=[] you want vacancydetails.append([]). But then the more pythonic thing to do is work with the last entry in vacancydetails, i.e., vacancydetails[-1], in which case you never need the index variable.
for vacancy in vacancy_headings:
vacancypage_url = urljoin("https://www.findapprenticeship.service.gov.uk",vacancy["href"])
### ...
vacancydetails.append([])
for p in vacancypage_soup.select("p"):
if p.has_attr("itemprop"):
### ...
vacancydetails[-1].append(cells)

Scrape a webpage using scrapy into tab-delimited format

I would like to scrape and parse the data on these two pages: here and here into a tab-delimited format using scrapy. I did these commands:
scrapy shell
fetch("https://www.drugbank.ca/drugs/DB04899")
print response.text
My two question:
1. for example, for this page, when I type:
response.css(".sequence::text").extract()
[u'>DB04899: Natriuretic peptides B\nSPKMVQGSGCFGRKMDRISSSSGLGCKVLRRH']
But then when I type:
>>> response.css(".synonyms::text").extract()
[]
>>> response.css(".Synonyms::text").extract()
[]
But you can see that there are synonyms listed on the webpage and so the output should not be empty. Can someone demonstrate what I'm doing wrong? (I also tried other tags such as synonym, Synonym) etc.
When I type: response.css(".targets::text").extract(), the output is [u'Targets (3)']. I'm wondering how I can actually parse the data within this list, but I guess this is related to not using the right tags and question 1 above.
This is a vague question/advanced for me at the minute, is it possible to just scrape the whole page in one go, instead of having to know each individual tag? So my output would be a dictionary called 'identification' with Name, accession number, type etc as keys. Then a dictionary called pharmacology with indication, structured indication etc as keys, then another dictionary called interactions, and another called pharmaeconomics etc, one dictionary per page section?
Thanks
There are really no elements with synonyms or Synonyms class attribute value on the page.
You can get to the synonyms by "going to the right" of the dt element with the "Synonyms" text using following-sibling:
In [2]: response.xpath("//dt[. = 'Synonyms']/following-sibling::dd/ul/li/text()").extract()
Out[2]:
['BNP',
'Brain natriuretic peptide 32',
'Natriuretic peptides B',
'Nesiritide recombinant']

Categories

Resources