Can't scrape Bangood site with beautiful soup and selenium

Can't scrape Bangood site with beautiful soup and selenium - python

Hi guys i found some problems in using Beautiful Soup.
I'm trying to scrape Bangood's Website, but, I don't know why, i've only succedeed in scraping item's name.
Using selenium I scraped Item's Price (only un USD not in euros)
So I ask for your help, I would be so pleased if you knew any way to overcome these problems.
I would like to scrape Name, Price in Euros, Discount, Stars, Image, but I cannot understand why Beautiful soup doesn't work.
Ps. Obviously I don't want all the functions but the reason why beautiful soup give all these problems and an example if you can.
Now I'm trying to post here the html I want to scrape (in beautiful soup if possible).
Thanks for all!
The link i wanna scrape = https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA
<span class="main-price" oriprice-range="0-0" oriprice="22.99">19,48€</span>
<strong class="average-num">4.95</strong>
<img src="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg.webp" id="landingImage" data-large="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg" dpid="left_largerView_image_180411|product|18101211554" data-src="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg" style="height: 100%; transform: translate3d(0px, 0px, 0px);">
These are the functions i'm using
This doesn't work:
def take_image_bang(soup): #beautiful soup and json
img_div = soup.find("div", attrs={"class":'product-image'})
imgs_str = img_div.img.get('data-large') # a string in Json format
# convert to a dictionary
imgs_dict = json.loads(imgs_str)
print(imgs_dict)
#each key in the dictionary is a link of an image, and the value shows the size (print all the dictionay to inspect)
#num_element = 0
#first_link = list(imgs_dict.keys())[num_element]
return imgs_dict
These work (but only USD not Euros for the function get_price):
def get_title_bang(soup): #beautiful soup
try:
# Outer Tag Object
title = soup.find("span", attrs={"class":'product-title-text'})
# Inner NavigableString Object
title_value = title.string
# Title as a string value
title_string = title_value.strip()
# # Printing types of values for efficient understanding
# print(type(title))
# print(type(title_value))
# print(type(title_string))
# print()
except AttributeError:
title_string = ""
return title_string
def get_Bangood_price(driver): #selenium
c = CurrencyConverter()
prices = driver.find_elements_by_class_name('main-price')
for price in prices:
price = price.text.replace("US$","")
priceZ = float(price)
price_EUR = c.convert(priceZ, 'USD', 'EUR')
return price_EUR

As you want price in EUR url needs be change you can set accoriding from web page
import requests
from bs4 import BeautifulSoup
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
res=requests.get("https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA&DCC=IT&currency=EUR",headers=headers)
soup=BeautifulSoup(res.text,"lxml")
Finding title:
main_div=soup.find("div",class_="product-info")
title=main_div.find("h1",class_="product-title").get_text(strip=True)
print(title)
Output:
ANENG AN8008 Vero RMS Digitale Multimetri Tester di AC DC Corrente Tensione Resistenza Frenquenza CapacitàCOD
For findign reviews:
star=[i.get_text(strip=True) for i in main_div.find("div",class_="product-reviewer").find_all("dd")]
star
Output:
['5 Stella2618 (95.8%)',
'4 Stella105 (3.8%)',
'3 Stella9 (0.3%)',
'2 Stella0 (0.0%)',
'1 Stella2 (0.1%)']
finding price and other you can get from script tag use json to load it!
data=soup.find("script", attrs={"type":"application/ld+json"}).string.strip().strip(";")
import json
main_data=json.loads(data)
finding values from it:
price=main_data['offers']['priceCurrency']+" "+main_data['offers']['price']
image=main_data['image']
print(price,image)
Output:
EUR 19.48 https://imgaz3.staticbg.com/thumb/view/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg
for finding discount price as prices are update dynamically you can
use xhr link to call and find data from it! here is the
url
use post request for it!

To scrape the data in euros, you need to change your link address and add this to the end of the link:
For EURO add: &currency=EUR
For USD add: &currency=USD
For Euro the link should be :
https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA&currency=EUR
For another example: if you wish to change the warehouse for the product change:
For CN change: cur_warehouse=CN
For USA change: cur_warehouse=USA
For PL change: cur_warehouse=PL
These are dynamic variables for a URL that changes the webpage depending on their inputs.
After this, your second method should work just fine. Happy scraping!!!

Related

Why is Python requests returning a different text value to what I get when I navigate to the webpage by hand?

I am trying to build a simple 'stock-checker' for a T-shirt I want to buy. Here is the link: https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069
As you can see, I am present with 'Coming Soon' text, whereas usually if an item is in stock, it will show 'Add To Cart'.
I thought the simplest way would be to use requests and beautifulsoup to isolate this <button> tag, and read the value of text. If it eventually says 'Add To Cart', then I will write the code to email/message myself it's back in stock.
However, here's the code I have so far, and you'll see that the response says the text contains 'Add To Cart', which is not what the website actually shows?
import requests
import bs4
URL = 'https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069'
def check_stock(url):
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, "html.parser")
buttons = soup.find_all('button', {'name': 'add'})
return buttons
if __name__ == '__main__':
buttons = check_stock(URL)
print(buttons[0].text)

All data available in <script> tag in JSON. So we need to get this, and extract the information we need. Let's use a simple slice by indexes to get clean JSON
import requests
import json
url = 'https://yesfriends.co/products/mens-t-shirt-black'
response = requests.get(url)
index_start = response.text.index('product:', 0) + len('product:')
index_finish = response.text.index(', }', index_start)
json_obj = json.loads(response.text[index_start:index_finish])
for variant in json_obj['variants']:
available = 'IN STOCK' if variant['available'] else 'OUT OF STOCK'
print(variant['id'], variant['option1'], available)
OUTPUT:
40840532623533 XXS OUT OF STOCK
40840532656301 XS OUT OF STOCK
40840532689069 S OUT OF STOCK
40840532721837 M OUT OF STOCK
40840532754605 L OUT OF STOCK
40840532787373 XL OUT OF STOCK
40840532820141 XXL OUT OF STOCK
40840532852909 3XL IN STOCK
40840532885677 4XL OUT OF STOCK

How can I get my python code to scrape the correct part of a website?

I am trying to get python to scrape a page on Mississippi's state legislature website. My goal is scrape a page and add what I've scraped into a new csv. My command prompt doesn't give me errors, but I am only scraping a " symbol and that is it. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['http://www.legislature.ms.gov/legislation/all-measures/']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict = [item.text for item in soup.select('tbody')]
df = pd.DataFrame.from_dict(temp_dict, orient='index').transpose()
df.to_csv('3-New Bills.csv')
I believe the problem is with line 13:
temp_dict = [item.text for item in soup.select('tbody')]
What should I replace 'tbody' with in this code to see all of the bills? Thank you so much for your help.

EDIT: Please see Sergey K' comment below, for a more elegant solution.
That table is being loaded in an xframe, so you would have to scrape that xframe's source for data. The following code will return a dataframe with 3 columns (measure, shorttitle, author):
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
list_for_df = []
r = requests.get('http://billstatus.ls.state.ms.us/2022/pdf/all_measures/allmsrs.xml', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
for x in soup.select('msrgroup'):
list_for_df.append((x.measure.text.strip(), x.shorttitle.text.strip(), x.author.text.strip()))
df = pd.DataFrame(list_for_df, columns = ['measure', 'short_title', 'author'])
df
Result:
measure short_title author
0 HB 1 Use of technology portals by those on probatio... Bell (65th)
1 HB 2 Youth court records; authorize judge to releas... Bell (65th)
2 HB 3 Sales tax; exempt retail sales of severe weath... Bell (65th)
3 HB 4 DPS; require to establish training component r... Bell (65th)
4 HB 5 Bonds; authorize issuance to assist City of Ja... Bell (65th)
... ... ... ...
You can add more data to that table, like measurelink, authorlink, action, etc - whatever is available in the xml document tags.

Try get_text instead
https://beautiful-soup-4.readthedocs.io/en/latest/#get-text
temp_dict = [item.get_text() for item in soup.select('tbody')]
IIRC The .text only shows the direct child text, not including the text of descendant tags. See XPath - Difference between node() and text() (which I think applies here for .text as well - it is the child text node, not other child nodes)

Price comparison - python

Hi guys i am trying to create a program in python that compares prices from websites but i cant get the prices. I have managed to ge the title of the product and the quantity using the code bellow.
page = requests.get(urls[7],headers=Headers)
soup = BeautifulSoup(page.text, 'html.parser')
title = soup.find("h1",{"class" : "Titlestyles__TitleStyles-sc-6rxg4t-0 fDKOTS"}).get_text().strip()
quantity = soup.find("li", class_="quantity").get_text().strip()
total_price = soup.find('div', class_='Pricestyles__ProductPriceStyles-sc-118x8ec-0 fzwZWj price')
print(title)
print(quantity)
print(total_price)
Iam trying to get the price from this website (Iam creating a program do look for diper prices lol) https://www.drogasil.com.br/fralda-huggies-tripla-protecao-tamanho-m.html .
the price is not coming even if i get the text it always says that its nonetype.

Some of the information is built up via javascript from data stored in <script> sections in the HTML. You can access this directly by searching for it and using Python's JSON library to decode it into a Python structure. For example:
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'https://www.drogasil.com.br/fralda-huggies-tripla-protecao-tamanho-m.html'
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
script = soup.find('script', type='application/ld+json')
data = json.loads(script.text)
title = data['name']
total_price = data['offers']['price']
quantity = soup.find("li", class_="quantity").get_text().strip()
print(title)
print(quantity)
print(total_price)
Giving you:
HUGGIES FRALDAS DESCARTAVEL INFANTIL TRIPLA PROTECAO TAMANHO M COM 42 UNIDADES
42 Tiras
38.79
I recommend you add print(data) to see what other information is available.

Webscraping- how to append columns

I am scraping multiple google scholar pages and I have already written code using beautiful soup to extract information of title, author, journal.
This is a sample page.
https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en
I want to now extract information on h-index,i-10 index and citations. When I inspected the page, I saw that all these have the same class names (gsc_rsb_std). Given this, my doubt is
How to extract this information using beautiful soup? The following code extracted information on citations. How to do it for the other two since the class name is the same?
columns['Citations'] = soup.findAll('td',{'class':'gsc_rsb_std'}).text
There is only one value for name, citations, h-index and i-index. However, there are multiple rows of papers. Ideally, I want my output in the following form.
Name h-index paper1
Name h-index paper2
Name h-index paper3
I tried the following and I am getting the output as above but only the last paper is repeated. Not sure what is happening here.
soup = BeautifulSoup(driver.page_source, 'html.parser')
columns = {}
columns['Name'] = soup.find('div', {'id': 'gsc_prf_in'}).text
papers = soup.find_all('tr', {'class': 'gsc_a_tr'})
for paper in papers:
columns['title'] = paper.find('a', {'class': 'gsc_a_at'}).text
File.append(columns)
My output is like this. Looks like there is something wrong with the loop.
Name h-index paper3
Name h-index paper3
Name h-index paper3
Appreciate any help. Thanks in advance!

I would consider using :has and :contains and target by search string
import requests
from bs4 import BeautifulSoup
searches = ['Citations', 'h-index', 'i10-index']
r = requests.get('https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en')
soup = BeautifulSoup(r.text, 'html.parser')
for search in searches:
all_value = soup.select_one(f'td:has(a:contains("{search}")) + td')
print(f'{search} All:', all_value.text)
since_2016 = all_value.find_next('td')
print(f'{search} since 2016:', since_2016.text)
You could also have used pandas read_html to grab that table by index.
Selenium question:
The element has an id, which is faster to match on using css selectors/find_element_by_id e.g.
driver.find_element_by_id("gsc_prf_in").text
I see no need, however, for selenium when scraping this page.

You can use SelectorGadgets Chrome extension to visually grab CSS selectors.
Element(s) highlighted in:
red excludes from search. 
green included in the search. 
yellow is guessing what the user is looking to find and needs additional clarification.
Grab h-index:
Grab i10-index:
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
proxies = {
'http': os.getenv('HTTP_PROXY')
}
html = requests.get('https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
for cited_by_public_access in soup.select('.gsc_rsb'):
citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']
print('Citiation info:')
print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')
Output:
Citiation info:
55399
34899
69
59
148
101
23
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=cp-8uaAAAAAJ
Alternatively, you can do the same thing using Google Scholar Author API from SerpApi. It's a paid API with a free plan.
The main difference, in a particular example, is that you don't have to guess and tinker with how to grab certain elements of the HTML page.
Another thing is that you don't have to think about how to solve the CAPTCHA, (it could appear at some point) or find good proxies if there's a need for many requests.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "cp-8uaAAAAAJ",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']
print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')
public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']
print(f'{public_access_link}\n{public_access_available_articles}')
Output:
55399
34899
69
59
148
101
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=cp-8uaAAAAAJ
23
Disclaimer, I work for SerpApi.

Adding a for loop to a working web scraper (Python and Beautifulsoup)

I have query regarding for loops and adding one to an already working web scraper to run through a list of webpages. What I'm looking at it probably two or three simple lines of code.
I appreciate this has probably been asked many times before and answered but I've been struggling to get some code to work for me for quite some time now. I'm relatively new to Python and looking to improve.
Background info:
I've written a web scraper using Python and Beautifulsoup which is successfully able to take a webpage from TransferMarkt.com and scrape all the required web links. The script is made up of two parts:
In the first part, I am taking the webpage for a football league,
e.g. The Premier League, and extract the webpage links for all the
individual teams in the league table and put them in a list.
In the second part of my script, I then take this list of individual teams and further extract information of each of the individual players for each team and then join this together to form one big pandas DataFrame of player information.
My query is regarding how to add a for loop to the first part of this web scraper to not just extract the team links from one league webpage, but to extract links from a list of league webpages.
Below I've included an example of a football league webpage, my web scraper code, and the output.
Example:
Example webpage to scrape (Premier League - code GB1): https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/gb1/plus/?saison_id=2019
Code (part 1 of 2) - scrape individual team links from league webpage:
# Python libraries
## Data Preprocessing
import pandas as pd
## Data scraping libraries
from bs4 import BeautifulSoup
import requests
# Assign league by code, e.g. Premier League = 'GB1', to the list_league_selected variable
list_league_selected = 'GB1'
# Assign season by year to season variable e.g. 2014/15 season = 2014
season = '2019'
# Create an empty list to assign these values to
team_links = []
# Web scraper script
## Process League Table
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = 'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/' + id + '/plus/?saison_id=' + season
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')
## Create an empty list to assign these values to - team_links
team_links = []
## Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")
## We need the location that the link is pointing to, so for each link, take the link location.
## Additionally, we only need the links in locations 1, 3, 5, etc. of our list, so loop through those only
for i in range(1,59,3):
team_links.append(links[i].get("href"))
## For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(team_links)):
team_links[i] = "https://www.transfermarkt.co.uk" + team_links[i]
# View list of team weblinks assigned to variable - team_links
team_links
Output:
Extracted links from example webpage (20 links in total for example webpage, just showing 4):
team_links = ['https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2019',
'https://www.transfermarkt.co.uk/fc-liverpool/startseite/verein/31/saison_id/2019',
'https://www.transfermarkt.co.uk/tottenham-hotspur/startseite/verein/148/saison_id/2019',
'https://www.transfermarkt.co.uk/fc-chelsea/startseite/verein/631/saison_id/2019',
...,
'https://www.transfermarkt.co.uk/sheffield-united/startseite/verein/350/saison_id/2019']
Using this list of teams - team_links, I am then able to further extract information for all the players of each team with the following code. From this output I'm then able to create a pandas DataFrame of all players info:
Code (part 2 of 2) - scrape individual player information using the team_links list:
# Create an empty DataFrame for the data, df
df = pd.DataFrame()
# Run the scraper through each of the links in the team_links list
for i in range(len(team_links)):
# Download and process the team page
page = team_links[i]
df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'value']
pageTree = requests.get(page, headers = headers)
pageSoup = BeautifulSoup(pageTree.content, 'lxml')
# Extract all data
position_number = [item.text for item in pageSoup.select('.items .rn_nummer')]
position_description = [item.text for item in pageSoup.select('.items td:not([class])')]
name = [item.text for item in pageSoup.select('.hide-for-small .spielprofil_tooltip')]
dob = [item.text for item in pageSoup.select('.zentriert:nth-of-type(4):not([id])')]
nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in pageSoup.select('.zentriert:nth-of-type(5):not([id])')]
value = [item.text for item in pageSoup.select('.rechts.hauptlink')]
df_temp = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, value)), columns = df_headers)
df = df.append(df_temp) # This last line of code is mine. It appends to temporary data to the master DataFrame, df
# View the pandas DataFrame
df
My question to you - adding a for loop to go through all the leagues:
What I need to do is replace the list_league_selected variable assigned to an individual league code in the first part of my code, and instead use a for loop to go through the full list of league codes - list_all_leagues. This list of league codes is as follows:
list_all_leagues = ['L1', 'GB1', 'IT1', 'FR1', 'ES1'] # codes for the top 5 European leagues
I've read through several solutions but I'm struggling to get the loop to work and append the full list of team webpages at the correct part. I believe I'm now really close to completing my scraper and any advice on how to create this for loop would be much appreciated!
Thanks in advance for your help!

Actually I've taken time to clear much mistakes in your code. and shorten the big road. Below you can achieve your target.
I considered been under antibiotic protection (😋) meant under requests.Session() to maintain the Session during my loop, which means to prevent TCP layer security from blocking/refusing/dropping my packet/request while Scraping.
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
leagues = ['L1', 'GB1', 'IT1', 'FR1', 'ES1']
def main(url):
with requests.Session() as req:
links = []
for lea in leagues:
print(f"Fetching Links from {lea}")
r = req.get(url.format(lea), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
link = [f"{url[:31]}{item.next_element.get('href')}" for item in soup.findAll(
"td", class_="hauptlink no-border-links hide-for-small hide-for-pad")]
links.extend(link)
print(f"Collected {len(links)} Links")
goals = []
for num, link in enumerate(links):
print(f"Extracting Page# {num +1}")
r = req.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
target = soup.find("table", class_="items")
pn = [pn.text for pn in target.select("div.rn_nummer")]
pos = [pos.text for pos in target.findAll("td", class_=False)]
name = [name.text for name in target.select("td.hide")]
dob = [date.find_next(
"td").text for date in target.select("td.hide")]
nat = [" / ".join([a.get("alt") for a in nat.find_all_next("td")[1] if a.get("alt")]) for nat in target.findAll(
"td", itemprop="athlete")]
val = [val.get_text(strip=True)
for val in target.select('td.rechts.hauptlink')]
goal = zip(pn, pos, name, dob, nat, val)
df = pd.DataFrame(goal, columns=[
'position_number', 'position_description', 'name', 'dob', 'nationality', 'value'])
goals.append(df)
new = pd.concat(goals)
new.to_csv("data.csv", index=False)
main("https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2019")
Output: View Online

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't scrape Bangood site with beautiful soup and selenium - python

Related

Why is Python requests returning a different text value to what I get when I navigate to the webpage by hand?

How can I get my python code to scrape the correct part of a website?

Price comparison - python

Webscraping- how to append columns

Adding a for loop to a working web scraper (Python and Beautifulsoup)

Categories

Resources