Webscraping- how to append columns - python

I am scraping multiple google scholar pages and I have already written code using beautiful soup to extract information of title, author, journal.
This is a sample page.
https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en
I want to now extract information on h-index,i-10 index and citations. When I inspected the page, I saw that all these have the same class names (gsc_rsb_std). Given this, my doubt is
How to extract this information using beautiful soup? The following code extracted information on citations. How to do it for the other two since the class name is the same?
columns['Citations'] = soup.findAll('td',{'class':'gsc_rsb_std'}).text
There is only one value for name, citations, h-index and i-index. However, there are multiple rows of papers. Ideally, I want my output in the following form.
Name h-index paper1
Name h-index paper2
Name h-index paper3
I tried the following and I am getting the output as above but only the last paper is repeated. Not sure what is happening here.
soup = BeautifulSoup(driver.page_source, 'html.parser')
columns = {}
columns['Name'] = soup.find('div', {'id': 'gsc_prf_in'}).text
papers = soup.find_all('tr', {'class': 'gsc_a_tr'})
for paper in papers:
columns['title'] = paper.find('a', {'class': 'gsc_a_at'}).text
File.append(columns)
My output is like this. Looks like there is something wrong with the loop.
Name h-index paper3
Name h-index paper3
Name h-index paper3
Appreciate any help. Thanks in advance!

I would consider using :has and :contains and target by search string
import requests
from bs4 import BeautifulSoup
searches = ['Citations', 'h-index', 'i10-index']
r = requests.get('https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en')
soup = BeautifulSoup(r.text, 'html.parser')
for search in searches:
all_value = soup.select_one(f'td:has(a:contains("{search}")) + td')
print(f'{search} All:', all_value.text)
since_2016 = all_value.find_next('td')
print(f'{search} since 2016:', since_2016.text)
You could also have used pandas read_html to grab that table by index.
Selenium question:
The element has an id, which is faster to match on using css selectors/find_element_by_id e.g.
driver.find_element_by_id("gsc_prf_in").text
I see no need, however, for selenium when scraping this page.

You can use SelectorGadgets Chrome extension to visually grab CSS selectors.
Element(s) highlighted in:
red excludes from search. 
green included in the search. 
yellow is guessing what the user is looking to find and needs additional clarification.
Grab h-index:
Grab i10-index:
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
proxies = {
'http': os.getenv('HTTP_PROXY')
}
html = requests.get('https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
for cited_by_public_access in soup.select('.gsc_rsb'):
citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']
print('Citiation info:')
print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')
Output:
Citiation info:
55399
34899
69
59
148
101
23
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=cp-8uaAAAAAJ
Alternatively, you can do the same thing using Google Scholar Author API from SerpApi. It's a paid API with a free plan.
The main difference, in a particular example, is that you don't have to guess and tinker with how to grab certain elements of the HTML page.
Another thing is that you don't have to think about how to solve the CAPTCHA, (it could appear at some point) or find good proxies if there's a need for many requests.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "cp-8uaAAAAAJ",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']
print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')
public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']
print(f'{public_access_link}\n{public_access_available_articles}')
Output:
55399
34899
69
59
148
101
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=cp-8uaAAAAAJ
23
Disclaimer, I work for SerpApi.

Related

How do I extract specific pieces of text in a website with soup?

I am making Japanese flashcards by scraping the this website
My plan is to format it in a text file with the kanji, its 3 word examples, the hirigana reading on top of each of the words, and the english translation below it.
I want it to look like this:
kanji {word1},{hirigana},{english translation}
{word2},{hirigana},{english translation}
{word3},{hirigana},{english translation}
Example:
福 祝福 祝福,しゅくふ,blessing
幸福,こうふく,happiness; well-being; joy; welfare; blessedness
裕福,ゆうふく,wealthy; rich; affluent; well-off
So far I am trying just with the website I mentioned and eventually loop it for a list of kanji character I have. However I am not sure how to extract the text here from the website
I know soup can be used however I dont know what to put in the function to get the text I want.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
word1_list = []
word2_list = []
word3_list = []
kanji = '福'
url = f'https://jpdb.io/search?q={kanji}+%23kanji&lang=english#a'
session = HTMLSession()
response = session.get(url)
# // uncertain what I should put here
soup = BeautifulSoup(response.html.html, 'html.parser')
words = soup.select('div.jp') # // uncertain what I should put here
word1_list.append(words) # // I want to try putting the data I want here
Here is one way of getting the information you're after:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = 'https://jpdb.io/search?q=%E7%A6%8F+%23kanji&lang=english#a'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'
}
soup = bs(requests.get(url, headers=headers).text, 'html.parser')
translation = soup.select_one('h6:-soup-contains("Keyword")').find_next('div').get_text(strip=True)
print(translation)
big_list = []
japanese_texts = soup.select('div[class="jp"]')
for j in japanese_texts:
japanese_text = j.get_text(strip=True)
translation = j.find_next_sibling('div').get_text(strip=True)
big_list.append((japanese_text, translation))
df = pd.DataFrame(big_list, columns = ['Japanese', 'English'])
print(df)
Result in terminal:
good fortune
Japanese English
0 ç¥ç¦ blessing
1 å¹¸ç¦ happiness; well-being; joy; welfare; bless...
2 è£ç¦ wealthy; rich; affluent; well-off
3 æãããã¨ãããã¦æããããã¨ã... To love and to be loved is the greatest happin...
4 å½¼ã¯å¹¸ç¦ã§ããããã ã He seems happy.
5 幸ç¦ãªèãããã°ãã¾ãä¸å¹¸ãªèã... Some are happy; others unhappy.
BeautifulSoup documentation can be found here. Also, try to avoid using deprecated packages: requests-html was last released on Feb 17, 2019, so it's pretty much unmaintained.

How can I get my python code to scrape the correct part of a website?

I am trying to get python to scrape a page on Mississippi's state legislature website. My goal is scrape a page and add what I've scraped into a new csv. My command prompt doesn't give me errors, but I am only scraping a " symbol and that is it. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['http://www.legislature.ms.gov/legislation/all-measures/']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict = [item.text for item in soup.select('tbody')]
df = pd.DataFrame.from_dict(temp_dict, orient='index').transpose()
df.to_csv('3-New Bills.csv')
I believe the problem is with line 13:
temp_dict = [item.text for item in soup.select('tbody')]
What should I replace 'tbody' with in this code to see all of the bills? Thank you so much for your help.
EDIT: Please see Sergey K' comment below, for a more elegant solution.
That table is being loaded in an xframe, so you would have to scrape that xframe's source for data. The following code will return a dataframe with 3 columns (measure, shorttitle, author):
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
list_for_df = []
r = requests.get('http://billstatus.ls.state.ms.us/2022/pdf/all_measures/allmsrs.xml', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
for x in soup.select('msrgroup'):
list_for_df.append((x.measure.text.strip(), x.shorttitle.text.strip(), x.author.text.strip()))
df = pd.DataFrame(list_for_df, columns = ['measure', 'short_title', 'author'])
df
Result:
measure short_title author
0 HB 1 Use of technology portals by those on probatio... Bell (65th)
1 HB 2 Youth court records; authorize judge to releas... Bell (65th)
2 HB 3 Sales tax; exempt retail sales of severe weath... Bell (65th)
3 HB 4 DPS; require to establish training component r... Bell (65th)
4 HB 5 Bonds; authorize issuance to assist City of Ja... Bell (65th)
... ... ... ...
You can add more data to that table, like measurelink, authorlink, action, etc - whatever is available in the xml document tags.
Try get_text instead
https://beautiful-soup-4.readthedocs.io/en/latest/#get-text
temp_dict = [item.get_text() for item in soup.select('tbody')]
IIRC The .text only shows the direct child text, not including the text of descendant tags. See XPath - Difference between node() and text() (which I think applies here for .text as well - it is the child text node, not other child nodes)

Can't scrape Bangood site with beautiful soup and selenium

Hi guys i found some problems in using Beautiful Soup.
I'm trying to scrape Bangood's Website, but, I don't know why, i've only succedeed in scraping item's name.
Using selenium I scraped Item's Price (only un USD not in euros)
So I ask for your help, I would be so pleased if you knew any way to overcome these problems.
I would like to scrape Name, Price in Euros, Discount, Stars, Image, but I cannot understand why Beautiful soup doesn't work.
Ps. Obviously I don't want all the functions but the reason why beautiful soup give all these problems and an example if you can.
Now I'm trying to post here the html I want to scrape (in beautiful soup if possible).
Thanks for all!
The link i wanna scrape = https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA
<span class="main-price" oriprice-range="0-0" oriprice="22.99">19,48€</span>
<strong class="average-num">4.95</strong>
<img src="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg.webp" id="landingImage" data-large="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg" dpid="left_largerView_image_180411|product|18101211554" data-src="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg" style="height: 100%; transform: translate3d(0px, 0px, 0px);">
These are the functions i'm using
This doesn't work:
def take_image_bang(soup): #beautiful soup and json
img_div = soup.find("div", attrs={"class":'product-image'})
imgs_str = img_div.img.get('data-large') # a string in Json format
# convert to a dictionary
imgs_dict = json.loads(imgs_str)
print(imgs_dict)
#each key in the dictionary is a link of an image, and the value shows the size (print all the dictionay to inspect)
#num_element = 0
#first_link = list(imgs_dict.keys())[num_element]
return imgs_dict
These work (but only USD not Euros for the function get_price):
def get_title_bang(soup): #beautiful soup
try:
# Outer Tag Object
title = soup.find("span", attrs={"class":'product-title-text'})
# Inner NavigableString Object
title_value = title.string
# Title as a string value
title_string = title_value.strip()
# # Printing types of values for efficient understanding
# print(type(title))
# print(type(title_value))
# print(type(title_string))
# print()
except AttributeError:
title_string = ""
return title_string
def get_Bangood_price(driver): #selenium
c = CurrencyConverter()
prices = driver.find_elements_by_class_name('main-price')
for price in prices:
price = price.text.replace("US$","")
priceZ = float(price)
price_EUR = c.convert(priceZ, 'USD', 'EUR')
return price_EUR
As you want price in EUR url needs be change you can set accoriding from web page
import requests
from bs4 import BeautifulSoup
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
res=requests.get("https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA&DCC=IT&currency=EUR",headers=headers)
soup=BeautifulSoup(res.text,"lxml")
Finding title:
main_div=soup.find("div",class_="product-info")
title=main_div.find("h1",class_="product-title").get_text(strip=True)
print(title)
Output:
ANENG AN8008 Vero RMS Digitale Multimetri Tester di AC DC Corrente Tensione Resistenza Frenquenza CapacitàCOD
For findign reviews:
star=[i.get_text(strip=True) for i in main_div.find("div",class_="product-reviewer").find_all("dd")]
star
Output:
['5 Stella2618 (95.8%)',
'4 Stella105 (3.8%)',
'3 Stella9 (0.3%)',
'2 Stella0 (0.0%)',
'1 Stella2 (0.1%)']
finding price and other you can get from script tag use json to load it!
data=soup.find("script", attrs={"type":"application/ld+json"}).string.strip().strip(";")
import json
main_data=json.loads(data)
finding values from it:
price=main_data['offers']['priceCurrency']+" "+main_data['offers']['price']
image=main_data['image']
print(price,image)
Output:
EUR 19.48 https://imgaz3.staticbg.com/thumb/view/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg
for finding discount price as prices are update dynamically you can
use xhr link to call and find data from it! here is the
url
use post request for it!
To scrape the data in euros, you need to change your link address and add this to the end of the link:
For EURO add: &currency=EUR
For USD add: &currency=USD
For Euro the link should be :
https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA&currency=EUR
For another example: if you wish to change the warehouse for the product change:
For CN change: cur_warehouse=CN
For USA change: cur_warehouse=USA
For PL change: cur_warehouse=PL
These are dynamic variables for a URL that changes the webpage depending on their inputs.
After this, your second method should work just fine. Happy scraping!!!

Python web-scraping using Beautifulsoup: lowes stores

I am new to scraping. I am asked to get a list of store number, city, state from website: https://www.lowes.com/Lowes-Stores
Below is what I have tried so far. Since the structure does not have an attribute, I am not sure how to continue my code. Please guide!
import requests
from bs4 import BeautifulSoup
import json
from pandas import DataFrame as df
url = "https://www.lowes.com/Lowes-Stores"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
page = requests.get(url, headers=headers)
page.encoding = 'ISO-885901'
soup = BeautifulSoup(page.text, 'html.parser')
lowes_list = soup.find_all(class_ = "list unstyled")
for i in lowes_list[:2]:
print(i)
example = lowes_list[0]
example_content = example.contents
example_content
You've found the list elements that contain the links that you need for state store lookups in your for loop. You will need to get the href attribute from the "a" tag inside each "li" element.
This is only the first step since you'll need to follow those links to get the store results for each state.
Since you know the structure of this state link result, you can simply do:
for i in lowes_list:
list_items = i.find_all('li')
for x in list_items:
for link in x.find_all('a'):
print(link['href'])
There are definitely more efficient ways of doing this, but the list is very small and this works.
Once you have the links for each state, you can create another request for each one to visit those store results pages. Then obtain the href attribute from those search results links on each state's page. The
Anchorage Lowe's
contains the city and the store number.
Here is a full example. I included lots of comments to illustrate the points.
You pretty much had everything up to Line 27, but you needed to follow the links for each state. A good technique for approaching these is to test the path out in your web browser first with the dev tools open, watching the HTML so you have a good idea of where to start with the code.
This script will obtain the data you need, but doesn't provide any data presentation.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.lowes.com/Lowes-Stores"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"
}
page = requests.get(url, headers=headers, timeout=5)
page.encoding = "ISO-885901"
soup = bs(page.text, "html.parser")
lowes_state_lists = soup.find_all(class_="list unstyled")
# we will store the links for each state in this array
state_stores_links = []
# now we populate the state_stores_links array by finding the href in each li tag
for ul in lowes_state_lists:
list_items = ul.find_all("li")
# now we have all the list items from the page, we have to extract the href
for li in list_items:
for link in li.find_all("a"):
state_stores_links.append(link["href"])
# This next part is what the original question was missing, following the state links to their respective search result pages.
# at this point we have to request a new page for each state and store the results
# you can use pandas, but an dict works too.
states_stores = {}
for link in state_stores_links:
# splitting up the link on the / gives us the parts of the URLs.
# by inspecting with Chrome DevTools, we can see that each state follows the same pattern (state name and state abbreviation)
link_components = link.split("/")
state_name = link_components[2]
state_abbreviation = link_components[3]
# let's use the state_abbreviation as the dict's key, and we will have a stores array that we can do reporting on
# the type and shape of this dict is irrelevant at this point. This example illustrates how to obtain the info you're after
# in the end the states_stores[state_abbreviation]['stores'] array will dicts each with a store_number and a city key
states_stores[state_abbreviation] = {"state_name": state_name, "stores": []}
try:
# simple error catching in case something goes wrong, since we are sending many requests.
# our link is just the second half of the URL, so we have to craft the new one.
new_link = "https://www.lowes.com" + link
state_search_results = requests.get(new_link, headers=headers, timeout=5)
stores = []
if state_search_results.status_code == 200:
store_directory = bs(state_search_results.content, "html.parser")
store_directory_div = store_directory.find("div", class_="storedirectory")
# now we get the links inside the storedirectory div
individual_store_links = store_directory_div.find_all("a")
# we now have all the stores for this state! Let's parse and save them into our store dict
# the store's city is after the state's abbreviation followed by a dash, the store number is the last thing in the link
# example: "/store/AK-Wasilla/2512"
for store in individual_store_links:
href = store["href"]
try:
# by splitting the href which looks to be consistent throughout the site, we can get the info we need
split_href = href.split("/")
store_number = split_href[3]
# the store city is after the -, so we have to split that element up into its two parts and access the second part.
store_city = split_href[2].split("-")[1]
# creating our store dict
store_object = {"city": store_city, "store_number": store_number}
# adding the dict to our state's dict
states_stores[state_abbreviation]["stores"].append(store_object)
except Exception as e:
print(
"Error getting store info from {0}. Exception: {1}".format(
split_href, e
)
)
# let's print something so we can confirm our script is working
print(
"State store count for {0} is: {1}".format(
states_stores[state_abbreviation]["state_name"],
len(states_stores[state_abbreviation]["stores"]),
)
)
else:
print(
"Error fetching: {0}, error code: {1}".format(
link, state_search_results.status_code
)
)
except Exception as e:
print("Error fetching: {0}. Exception: {1}".format(state_abbreviation, e))

Adding a for loop to a working web scraper (Python and Beautifulsoup)

I have query regarding for loops and adding one to an already working web scraper to run through a list of webpages. What I'm looking at it probably two or three simple lines of code.
I appreciate this has probably been asked many times before and answered but I've been struggling to get some code to work for me for quite some time now. I'm relatively new to Python and looking to improve.
Background info:
I've written a web scraper using Python and Beautifulsoup which is successfully able to take a webpage from TransferMarkt.com and scrape all the required web links. The script is made up of two parts:
In the first part, I am taking the webpage for a football league,
e.g. The Premier League, and extract the webpage links for all the
individual teams in the league table and put them in a list.
In the second part of my script, I then take this list of individual teams and further extract information of each of the individual players for each team and then join this together to form one big pandas DataFrame of player information.
My query is regarding how to add a for loop to the first part of this web scraper to not just extract the team links from one league webpage, but to extract links from a list of league webpages.
Below I've included an example of a football league webpage, my web scraper code, and the output.
Example:
Example webpage to scrape (Premier League - code GB1): https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/gb1/plus/?saison_id=2019
Code (part 1 of 2) - scrape individual team links from league webpage:
# Python libraries
## Data Preprocessing
import pandas as pd
## Data scraping libraries
from bs4 import BeautifulSoup
import requests
# Assign league by code, e.g. Premier League = 'GB1', to the list_league_selected variable
list_league_selected = 'GB1'
# Assign season by year to season variable e.g. 2014/15 season = 2014
season = '2019'
# Create an empty list to assign these values to
team_links = []
# Web scraper script
## Process League Table
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = 'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/' + id + '/plus/?saison_id=' + season
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')
## Create an empty list to assign these values to - team_links
team_links = []
## Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")
## We need the location that the link is pointing to, so for each link, take the link location.
## Additionally, we only need the links in locations 1, 3, 5, etc. of our list, so loop through those only
for i in range(1,59,3):
team_links.append(links[i].get("href"))
## For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(team_links)):
team_links[i] = "https://www.transfermarkt.co.uk" + team_links[i]
# View list of team weblinks assigned to variable - team_links
team_links
Output:
Extracted links from example webpage (20 links in total for example webpage, just showing 4):
team_links = ['https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2019',
'https://www.transfermarkt.co.uk/fc-liverpool/startseite/verein/31/saison_id/2019',
'https://www.transfermarkt.co.uk/tottenham-hotspur/startseite/verein/148/saison_id/2019',
'https://www.transfermarkt.co.uk/fc-chelsea/startseite/verein/631/saison_id/2019',
...,
'https://www.transfermarkt.co.uk/sheffield-united/startseite/verein/350/saison_id/2019']
Using this list of teams - team_links, I am then able to further extract information for all the players of each team with the following code. From this output I'm then able to create a pandas DataFrame of all players info:
Code (part 2 of 2) - scrape individual player information using the team_links list:
# Create an empty DataFrame for the data, df
df = pd.DataFrame()
# Run the scraper through each of the links in the team_links list
for i in range(len(team_links)):
# Download and process the team page
page = team_links[i]
df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'value']
pageTree = requests.get(page, headers = headers)
pageSoup = BeautifulSoup(pageTree.content, 'lxml')
# Extract all data
position_number = [item.text for item in pageSoup.select('.items .rn_nummer')]
position_description = [item.text for item in pageSoup.select('.items td:not([class])')]
name = [item.text for item in pageSoup.select('.hide-for-small .spielprofil_tooltip')]
dob = [item.text for item in pageSoup.select('.zentriert:nth-of-type(4):not([id])')]
nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in pageSoup.select('.zentriert:nth-of-type(5):not([id])')]
value = [item.text for item in pageSoup.select('.rechts.hauptlink')]
df_temp = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, value)), columns = df_headers)
df = df.append(df_temp) # This last line of code is mine. It appends to temporary data to the master DataFrame, df
# View the pandas DataFrame
df
My question to you - adding a for loop to go through all the leagues:
What I need to do is replace the list_league_selected variable assigned to an individual league code in the first part of my code, and instead use a for loop to go through the full list of league codes - list_all_leagues. This list of league codes is as follows:
list_all_leagues = ['L1', 'GB1', 'IT1', 'FR1', 'ES1'] # codes for the top 5 European leagues
I've read through several solutions but I'm struggling to get the loop to work and append the full list of team webpages at the correct part. I believe I'm now really close to completing my scraper and any advice on how to create this for loop would be much appreciated!
Thanks in advance for your help!
Actually I've taken time to clear much mistakes in your code. and shorten the big road. Below you can achieve your target.
I considered been under antibiotic protection (😋) meant under requests.Session() to maintain the Session during my loop, which means to prevent TCP layer security from blocking/refusing/dropping my packet/request while Scraping.
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
leagues = ['L1', 'GB1', 'IT1', 'FR1', 'ES1']
def main(url):
with requests.Session() as req:
links = []
for lea in leagues:
print(f"Fetching Links from {lea}")
r = req.get(url.format(lea), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
link = [f"{url[:31]}{item.next_element.get('href')}" for item in soup.findAll(
"td", class_="hauptlink no-border-links hide-for-small hide-for-pad")]
links.extend(link)
print(f"Collected {len(links)} Links")
goals = []
for num, link in enumerate(links):
print(f"Extracting Page# {num +1}")
r = req.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
target = soup.find("table", class_="items")
pn = [pn.text for pn in target.select("div.rn_nummer")]
pos = [pos.text for pos in target.findAll("td", class_=False)]
name = [name.text for name in target.select("td.hide")]
dob = [date.find_next(
"td").text for date in target.select("td.hide")]
nat = [" / ".join([a.get("alt") for a in nat.find_all_next("td")[1] if a.get("alt")]) for nat in target.findAll(
"td", itemprop="athlete")]
val = [val.get_text(strip=True)
for val in target.select('td.rechts.hauptlink')]
goal = zip(pn, pos, name, dob, nat, val)
df = pd.DataFrame(goal, columns=[
'position_number', 'position_description', 'name', 'dob', 'nationality', 'value'])
goals.append(df)
new = pd.concat(goals)
new.to_csv("data.csv", index=False)
main("https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2019")
Output: View Online

Categories

Resources