I have been following FC Pythons tutorial on web scraping and I do not understand how they have identified 1,41,2 as the link locations for this page. Is this something I should be able to see on the page source?
#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')
#Create an empty list to assign these values to
teamLinks = []
#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")
#We need the location that the link is pointing to, so for each link, take the link location.
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
teamLinks.append(links[i].get("href"))
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
The range(1,41,2) is used to avoid duplicated links. That's because in the table, each row there are multiple cells that contains the same link:
We can obtain the same result getting all links and removing duplicates with a Set:
teamLinks = list({x.get("href") for x in links})
In the website for each row in the table, there're 3 a.vereinprofil_tooltip links and they're same with same href. To avoid duplications the use 1, 3, 5, etc. links. And yes you should see it in the page source, and also in Chrome Dev Tools.
Also you collect links using different ways:
Use different selector, like #yw1 .zentriert a.vereinprofil_tooltip for CLUBS - PREMIER LEAGUE 19/20 table
Use python code to remove duplicates:
team_links = list(dict.fromkeys([f"https://www.transfermarkt.co.uk{x['href']}" for x in soup.select("a.vereinprofil_tooltip")]))
It is an attempt to remove duplicate entries.
A more robust way to achieve the same is this:
# iterate over all links in list
for i in range(len(links)):
teamLinks.append(links[i].get("href"))
for i in range(len(teamLinks)):
teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]
# make a set to remove duplicates and then make a list of it again
teamLinks = list(set(teamLinks))
teamLinks prints out to sth like this then:
['https://www.transfermarkt.co.uk/crystal-palace/spielplan/verein/873/saison_id/2019',
'https://www.transfermarkt.co.uk/afc-bournemouth/spielplan/verein/989/saison_id/2019',
'https://www.transfermarkt.co.uk/sheffield-united/spielplan/verein/350/saison_id/2019',
...
Related
I am trying to scrape several links and collect specific information belonging to each site.
I think I need to use for loop for this. This is the code that I wrote.
In this case, I could get only one result, but I need all results.
I want to know how to back to URL parts and work again till I find all results.
# collect urls
data = html.select("div.content a")
for i in data:
url = "wwww...." + i["href"]
# move to url link and collect specific information
raw_each = requests.get(url, headers={"User-Agent":"Mozilla/5.0"})
html_each = BeautifulSoup(raw_each.text,'html.parser')
reply = html_each.select("div.content td")
print(reply[14].text)
You constantly overwrite the value of url with the next link in your list.
If you want to keep all the links, you will have to store them in an object that can hold multiple links, such as a list:
data = html.select("div.content a")
urls = []
for i in data:
urls.append("wwww...." + i["href"])
This produces a list of links you can iterate over in the next step.
I have a loop that is constantly adding a variable with an unknown value to a list, and then prints the list. However I don't find a way to ignore the values previously found and added to the list when I want to print the list the next time.
I'm scraping a constantly updating website for keyword-matching links using requests and bs4 inside a loop. Once the website added the links I'm looking for my code adds them to a list, and prints the list. Once the website adds the next wave of matching links, these will also be added to my list, however my code will also add the old links found before to the list again since they still match my keyword. Is it possible to ignore these old links?
url = "www.website.com"
keyword = "news"
results = [] #list which saves the links
while True:
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
options = soup.find_all("a", class_="name-link")
for o in options:
if keyword in o.text:
link = o.attrs["href"] #the links I want
results.append(link) #adds links to list
print(results)
time.sleep(5) #wait until next scrape
#so with every loop the value of 'link' is changing which makes it hard
for me to find a way to ignore previously found links
To maybe make it easier to understand you could think of a loop adding an unknown number to a list with every execution of the loop, but the number should only be printed in the first execution.
Here is a proof of concept using sets, if the challenge is that you only want to keep unique links, and then print the new links found that have not been found previously:
import random
results = set()
for k in range(15):
new = {random.randint(1,5)}
print(f"First Seen: {new-results}")
results = results.union(new)
print(f"All: {results}")
If it is more of a streaming issue, where you save all links to a large list, but only want to print the latest ones found you can do something like this:
import random
results = []
for k in range(5):
n = len(results)
new = []
for k in range(random.randint(1,5)):
new.append(random.randint(1,5))
results.extend(new)
print(results[n:])
But then again, you can also just print new in this case....
This is a good use case for Set data structure. Sets do not maintain any ordering of the items. Very simple change to your code above:
url = "www.website.com"
keyword = "news"
results = {}
while True:
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
options = soup.find_all("a", class_="name-link")
for o in options:
if keyword in o.text:
link = o.attrs["href"] #the links I want
results.add(link) #adds links to list
print(results)
time.sleep(5) #wait until next scrape
If you want to maintain order, you can use some variation of an ordered dictionary. Please see here: Does Python have an ordered set?
Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.
I am trying to scrape data from BoxOfficeMojo for a Data Science project. I made some changes to this code I found from an already existing GitHub repository to suit my needs.
https://github.com/OscarPrediction1/boxOfficeCrawler/blob/master/crawlMovies.py
I need some help regarding scraping a particular feature.
While I can scrape a movie gross normally, Box Office Mojo has a feature where they show you the Inflation-adjusted gross (The gross of the movie if it released in any particular year). The inflation-adjusted gross comes with an additional "&adjust_yr=2018" at the end of the normal movie link.
For example -
Titanic Normal link (http://www.boxofficemojo.com/movies/?id=titanic.htm)
Titanic 2018 Inflation adjusted link (http://www.boxofficemojo.com/movies/?id=titanic.htm&adjust_yr=2018)
In this particular code I linked earlier a table of URLs is created by going through the alphabetical list (http://www.boxofficemojo.com/movies/alphabetical.htm ) and then each of the URLs is visited. The problem is, the alphabetical list has the Normal links of the movies and not the inflation-adjusted links. What do I change to get the inflation-adjusted values from here?
(The only way I could crawl all the movies at once is via the alphabetical list. I have checked that earlier)
One possible way would be simply to generate all the necessary url's by appending the list of normal urls with "&adjust_yr=2018" and scraping each site.
I personally like to use xpath (a language to navigate html structures, very useful for scraping!) and recommend not using string matches to filter out data from HTML as it was once recommended to me. A simple way to use xpath is via the lxml library.
from lxml import html
<your setup>
....
for site in major_sites:
page = 1
while True:
# fetch table
url = "http://www.boxofficemojo.com/movies/alphabetical.htm?letter=" + site + "&p=.htm&page=" + str(page)
print(url)
element_tree = html.parse(url)
rows = element_tree.xpath('//td/*/a[contains(#href, "movies/?id")]/#href')
rows_adjusted = ['http://www.boxofficemojo.com' + row + '&adjust_yr=2018' for row in rows]
# then loop over the rows_adjusted and grab the necessary info from the page
If you're comfortable using the pandas dataframe library I would also like to point out the pd.read_html() function, which, in my opinion, is predestined for this task. It would allow you to scrape a whole alphabetical page in almost a single line. Plus you can perform any necessary substitutions / annotations afterwards columnwise.
One possible way could be this.
import pandas as pd
<your setup>
....
for site in major_sites:
page = 1
while True:
# fetch table
url = "http://www.boxofficemojo.com/movies/alphabetical.htm?letter=" + site + "&p=.htm&page=" + str(page)
print(url)
req = requests.get(url=url)
# pandas uses beautifulsoup to parse the html content
content = pd.read_html(req.content)
# chose the correct table from the result
tabular_data = content[3]
# drop the row with the title
tabular_data = tabular_data.drop(0)
# housekeeping renamer
tabular_data.columns = ['title', 'studio', 'total_gross', 'total_theaters',
'opening_gross', 'opening_theaters', 'opening_date']
# now you can use the pandas library to perform all necessary replacement and string operations
Further resources:
Wikipedia has a nice overview of xpath syntax
I am trying to scrape information from these kind of pages.
I need the information contained under Internship, Residency, Fellowship. I can extract values from tables, but in this case I could not decide which table to use because the heading (like Internship) is present under a div tag outside the table as a simple plain text, and after that the table is present whose value I need to extract. And I have many such pages of this kind, and it is not necessary that each page has these values, like in some pages Residency may not be present at all. (This decreases the total number of tables in the page). One example of such page is this. In this page Internship is not present at all.
The main problem I am facing is all the tables have the same attribute values, so I can not decide which table in to use for different pages. If any value of my interest is not present in a page, I have to return an empty string for that value.
I am using BeautifulSoup in Python. Can someone point, how could I proceed in extracting those values.
It looks like the ids for the headings and data each have a unique value and standard suffixes. You can use that to search for the appropriate values. Here's my solution:
from BeautifulSoup import BeautifulSoup
# Insert whatever networking stuff you're doing here. I'm going to assume
# that you've already downloaded the page and assigned it to a variable
# named 'html'
soup = BeautifulSoup(html)
headings = ['Internship', 'Residency', 'Fellowship']
values = []
for heading in headings:
x = soup.find('span', text=heading)
if x:
span_id = x.parent['id']
table_id = span_id.replace('dnnTITLE_lblTitle', 'Display_HtmlHolder')
values.append(soup.find('td', attrs={'id': table_id}).text)
else:
values.append('')
print zip(headings, values)