I am trying to scrape data from BoxOfficeMojo for a Data Science project. I made some changes to this code I found from an already existing GitHub repository to suit my needs.
https://github.com/OscarPrediction1/boxOfficeCrawler/blob/master/crawlMovies.py
I need some help regarding scraping a particular feature.
While I can scrape a movie gross normally, Box Office Mojo has a feature where they show you the Inflation-adjusted gross (The gross of the movie if it released in any particular year). The inflation-adjusted gross comes with an additional "&adjust_yr=2018" at the end of the normal movie link.
For example -
Titanic Normal link (http://www.boxofficemojo.com/movies/?id=titanic.htm)
Titanic 2018 Inflation adjusted link (http://www.boxofficemojo.com/movies/?id=titanic.htm&adjust_yr=2018)
In this particular code I linked earlier a table of URLs is created by going through the alphabetical list (http://www.boxofficemojo.com/movies/alphabetical.htm ) and then each of the URLs is visited. The problem is, the alphabetical list has the Normal links of the movies and not the inflation-adjusted links. What do I change to get the inflation-adjusted values from here?
(The only way I could crawl all the movies at once is via the alphabetical list. I have checked that earlier)
One possible way would be simply to generate all the necessary url's by appending the list of normal urls with "&adjust_yr=2018" and scraping each site.
I personally like to use xpath (a language to navigate html structures, very useful for scraping!) and recommend not using string matches to filter out data from HTML as it was once recommended to me. A simple way to use xpath is via the lxml library.
from lxml import html
<your setup>
....
for site in major_sites:
page = 1
while True:
# fetch table
url = "http://www.boxofficemojo.com/movies/alphabetical.htm?letter=" + site + "&p=.htm&page=" + str(page)
print(url)
element_tree = html.parse(url)
rows = element_tree.xpath('//td/*/a[contains(#href, "movies/?id")]/#href')
rows_adjusted = ['http://www.boxofficemojo.com' + row + '&adjust_yr=2018' for row in rows]
# then loop over the rows_adjusted and grab the necessary info from the page
If you're comfortable using the pandas dataframe library I would also like to point out the pd.read_html() function, which, in my opinion, is predestined for this task. It would allow you to scrape a whole alphabetical page in almost a single line. Plus you can perform any necessary substitutions / annotations afterwards columnwise.
One possible way could be this.
import pandas as pd
<your setup>
....
for site in major_sites:
page = 1
while True:
# fetch table
url = "http://www.boxofficemojo.com/movies/alphabetical.htm?letter=" + site + "&p=.htm&page=" + str(page)
print(url)
req = requests.get(url=url)
# pandas uses beautifulsoup to parse the html content
content = pd.read_html(req.content)
# chose the correct table from the result
tabular_data = content[3]
# drop the row with the title
tabular_data = tabular_data.drop(0)
# housekeeping renamer
tabular_data.columns = ['title', 'studio', 'total_gross', 'total_theaters',
'opening_gross', 'opening_theaters', 'opening_date']
# now you can use the pandas library to perform all necessary replacement and string operations
Further resources:
Wikipedia has a nice overview of xpath syntax
Related
I would like to extract Name, Address of School, Tel, Fax ,Head of School from the website:
https://www.edb.gov.hk/en/student-parents/sch-info/sch-search/schlist-by-district/school-list-cw.html
Is it possible to do so?
yes it is possible and there are many tool that help you do that. If you do not want to use a programming language, you can use plenty of tools (but probably have to pay for them, here is an article that might be useful: https://popupsmart.com/blog/web-scraping-tools).
However, If you want to use python, what you should do is to load the page and then parse HTML. Then you should look you desirable element and fetch its data. This article explains the whole process with code: https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
Here is a simple code that shows the tables in the page that you posted, based on the code from above paper:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.edb.gov.hk/en/student-parents/sch-info/sch-search/schlist-by-district/school-list-cw.html")
soup = BeautifulSoup(page.content, 'html.parser')
# Create top_items as empty list
top_items = []
# Extract and store in top_items according to instructions on the left
products = soup.select('table')
for elem in products:
print(elem)
You can try it out here:
https://colab.research.google.com/drive/13EzFWNBqpkGf4CZvGt5pYySCuW7Ij6I4?usp=sharing
Hello I am trying to collect event data from a my school event web page and save to a list based on locations. However there is no current event information on the page since school is closed. So I tried to get the tags that take the user specified dates and change those values. So that I can make url request with the new tags and get event data from the year before. However doing this does not give me any new information. How do I replace an old input tag with a new one, and how do I update the html page with these new tags. Attached below is example code of what I am doing.
response = requests.get(url)
#start and end dates I want to use
st_date = "04/01/2019"
ed_date = "04/14/2019"
soup = bs(response.text, 'html.parser')
input_list = soup.findAll('input')
#the first and second values in the list are the input tags
start_date = input_list[0]
end_date = input_list[1]
#replace the value attribute with date strings
start_date['value'] = st_date
end_date['value'] = en_date
#insert the new tags
soup.insert(1,start_date)
You're manipulating a static copy of the web page. Instead you should choose one of two options to achieve your goal:
Use Selenium or something equivalent that will execute any JavaScript code that observes form field changes, or will handle injected submit/update button clicks.
Open the Developer Tools in your browser (usually F12) and watch the Network tab for outgoing requests when you change the dates. Then call these endpoints with your desired dates and get the events data from the response.
Good luck.
If you really wanted to change the content of html/xml using BeautifulSoup. You can also do that.
from bs4 import BeautifulSoup
html = BeautifulSoup(some_html, 'html.parser')
html.find('tag').string = 'some new value'
Last line will change the content of html page.
Trying to do some webscraping. Trying to make a function that will spit out population for each country. I am trying to web-scrape from US census bureau, but I cant get back the right information.
https://www.census.gov/popclock/world/af
<div id ="basic-facts" class = "data-cell">
<div class = "data-contianer">
<div class="data-cell" style = "background-image: url.....">
<p>population</p>
<h2 data-population="">35.8M</h2>"
this is basically what the code looks like that im trying to scrape. What I want is that "35.8M"
I have tried a few methods and all I can get back is the heading itself "data-population", none of the data.
Someone mentioned to me that maybe the website has it in some format so that it cant be scraped. In my experience, when it is blocked, the formatting looks different, it is in a image or dynamic item or something that makes it more difficult to scrape. Does anyone have any thoughts on this?
# -*- coding: utf-8 -*-
# Tells python what encoding the string is stored in
# Import required libraries
import requests
from bs4 import BeautifulSoup
### country naming issues: In the URLS on the websites the countries are coded with
### a two digit code # "au" = australia, "in" = india. If we were to search for a
### country name or something like that we would need to have something to relate
### the country name to the two letter code so it can search for it.
country = 'albania'
countrycode = [al: 'albania', af: 'afghanistan',]
### this would take long to write
### it all out, maybe its possible to scrape these names?
# Create url for the requested location through string concatenation
url = 'https://www.census.gov/popclock/world/'+countrycode
# Send request to retrieve the web-page using the
# get() function from the requests library
# The page variable stores the response from the web-page
page = requests.get(url)
# Create a BeautifulSoup object with the response from the URL
# Access contents of the web-page using .content
# html_parser is used since our page is in HTML format
soup=BeautifulSoup(page.content,"html.parser")
################################################################## Start what im not sure about
# Locate element on page to be scraped
# find() locates the element in the BeautifulSoup object
1. First method
population = soup.find(id="basic-facts", class="data-cell")
#I tried some methods like this. got only errors
2. Second method
populaiton = soup.findAll("h2", {"data-population": ""})
for i in population:
print i
# this returns the headings for the table but no data
### here we need to take out the population data
### it is listed as "<h2 data-population = "" >35.8</h2>"
################################################################## end
# Extract text from the selected BeautifulSoup object using .text
population = population.text
#Final Output
#Return Scraped info
print 'The Population of'+country+'is'+population
I outlined the code with #######. I tried a few methods. I listed two
I am pretty new to coding in general, so excuse me if I didnt describe this all right, thanks for any advice anyone can give.
It is dynamically retrieved from an API call you can find in the network tab. As you are not using a browser, where this call would have been made for you, you will need to make the request direct yourself.
import requests
r = requests.get('https://www.census.gov/popclock/apiData_pop.php?get=POP,MPOP0_4,MPOP5_9,MPOP10_14,MPOP15_19,MPOP20_24,MPOP25_29,MPOP30_34,MPOP35_39,MPOP40_44,MPOP45_49,MPOP50_54,MPOP55_59,MPOP60_64,MPOP65_69,MPOP70_74,MPOP75_79,MPOP80_84,MPOP85_89,MPOP90_94,MPOP95_99,MPOP100_,FPOP0_4,FPOP5_9,FPOP10_14,FPOP15_19,FPOP20_24,FPOP25_29,FPOP30_34,FPOP35_39,FPOP40_44,FPOP45_49,FPOP50_54,FPOP55_59,FPOP60_64,FPOP65_69,FPOP70_74,FPOP75_79,FPOP80_84,FPOP85_89,FPOP90_94,FPOP95_99,FPOP100_&key=&YR=2019&FIPS=af').json()
data = list(zip(r[0],r[1]))
print(round(int(data[0][1])/100_0000,1))
Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.
I am currently trying to scape all of the names from a specific website. I was making some progress by following a guide on python-guide.org. I was able to scrape a lot of the information off of a certain site, but not the information I was after. Here is my code so far:
from lxml import html
import requests
page = requests.get('http://www.behindthename.com/names/gender/feminine/usage/african')
tree = html.fromstring(page.content)
#This will create a list of buyers:
Names = tree.xpath('//div[#class="browsename"]/text()')
print 'Names: ', Names
Unfortunately, that returns a lot of information, but not the list of names. I'm not sure what I'm doing wrong but I am certain it has to do with the #class="bowsername". I'm not very familiar with HTML.
maybe, you should use:
//div[#class="browsename"]/b/a/text()
In chrome, you can use F12 to inspect elements, then use CTRL + F, and input your xpath. Chrome will show you what elements you choose.