I'm trying to scrape the website (https://na.op.gg/champion/statistics) to get which champions have the highest winrate using xpath, which I was able to do using this:
champion = tree.xpath('//div[#class="champion-index-table__name"]/text()')
but, I realized that the names I'm trying to get are on a table that changes in size depending on the current game meta, so I wanted to just scrape the names that fall under the specific categories so I wont have any problems later on when the number of champions in the table changes. The website separates them each under different "tiers" that look like this:
<tbody class="tabItem champion-trend-tier-TOP" style="display: table-row-group;">
<tr>
<td class="champion-index-table__cell champion-index-table__cell--rank">1</td>
<td class="champion-index-table__cell champion-index-table__cell--change champion-index-table__cell--change--stay">
<img src="//opgg-static.akamaized.net/images/site/champion/icon-championtier-stay.png" alt="">
0
</td>
<td class="champion-index-table__cell champion-index-table__cell--image">
<i class="__sprite __spc32 __spc32-32"></i>
</td>
<td class="champion-index-table__cell champion-index-table__cell--champion">
<a href="/champion/garen/statistics/top">
<div class="champion-index-table__name">Garen</div>
<div class="champion-index-table__position">
Top, Middle </div>
</a>
</td>
<td class="champion-index-table__cell champion-index-table__cell--value">53.12%</td>
<td class="champion-index-table__cell champion-index-table__cell--value">16.96%</td>
<td class="champion-index-table__cell champion-index-table__cell--value">
<img src="//opgg-static.akamaized.net/images/site/champion/icon-champtier-1.png" alt="">
</td>
</tr>
<tr>
Then the next one goes to
<tbody class="tabItem champion-trend-tier-JUNGLE" style="display: table-row-group;">
So, I've tried this, but it's not outputting anything but [].
Hopefully my question is making sense.
championtop = tree.xpath('//div/table/tbody/tr//td[4][#class="champion-index-table__name"]/text()')
I was able to accomplish my goal by just doing
champion = tree.xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-TOP"]/tr/td[4]/a/div[1]
Thanks for reading!
You may go through all the line of code , I'm sure you will get your answer.
this is how you can target element in xpath easily and reliably
Initial setup
from selenium import webdriver
wb = webdriver.Chrome('Path to your chrome webdriver')
wb.get('https://na.op.gg/champion/statistics')
For TOP Tier
y_top = {}
tbody_top = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-TOP"]/tr')
for i in range(len(tbody_top)):
y_top[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-TOP"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-TOP"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
for jungle
y_jung = {}
tbody_jung = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-JUNGLE"]/tr')
for i in range(len(tbody_jung)):
y_jung[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-JUNGLE"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-JUNGLE"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
for middle
y_mid = {}
tbody_mid = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-MID"]/tr')
for i in range(len(tbody_mid)):
y_mid[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-MID"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-MID"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
for Bottom
y_bott = {}
tbody_bott = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-ADC"]/tr')
for i in range(len(tbody_bott)):
y_bott[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-ADC"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-ADC"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
for Support
y_sup = {}
tbody_sup = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-SUPPORT"]/tr')
for i in range(len(tbody_sup)):
y_sup[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-SUPPORT"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-SUPPORT"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
print(max(y_top,key = y_top.get)) # it will print the max win rate for TOP tier
print(max(y_jung,key = y_jung.get))
print(max(y_mid,key = y_mid.get))
print(max(y_bott,key = y_bott.get))
print(max(y_sup,key = y_sup.get))
==========================================
You can target any element using its attribute xpath like below:
1. wb.find_element_by_xpath('//div[#id="the_id_of_div"]/a/span')
2. wb.find_element_by_xpath('//div[#class="class name"]/p/span')
3. wb.find_element_by_xpath('//div[#title="title of element"]/p/span')
4. wb.find_element_by_xpath('//div[#style="style x "]/p/span')
5. wb.find_elements_by_xpath('//*[contains(text(),"the text u wanna find")]') #may be the page has multiple same text that u wanna search...keep in mind
6. to find parent of a found element ==>
that_found_element.find_element_by_xpath('..') # and you can iterate it to the top most parent using loop
7. find sibling of an element ===>
to find the preceding element
wb.find_element_by_xpath('//span[#id ="theidx"]//preceding-sibling::input') #this tells target a input tag which is preceding sibling of span tag with id as "theidx"
to find the following element
wb.find_element_by_xpath('//span[#id ="theidy"]//following-sibling::input') #this tells target a input tag which is following sibling of span tag with id as "theidy"
Related
I'm trying for a bootcamp project to scrape data from the following website: https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30. My aim is to get separate lists for the following columns: Date, Market Cap, volume, Open, and Close. My problem is, that for the columns Market Cap, volume, open and close, the class name (text-center) is the same:
<td class="text-center">
$161,716,193,676
</td>
<td class="text-center">
$16,571,161,476
</td>
<td class="text-center">
$1,340.02
</td>
<td class="text-center">
N/A
</td>
I've tried to solve it with this:
import requests
url_get = requests.get('https://www.coingecko.com/en/coins/ethereum/historical_data#panel')
from bs4 import BeautifulSoup
soup = BeautifulSoup(url_get.content,"html.parser")
table = soup.find('div', attrs={'class':'card-block'})
row = table.find_all('th', attrs={'class':'font-semibold text-center'})
row_length = len(row)
row_length
temp = []
for i in range(0, row_length):
Date = table.find_all('th', attrs={'class':'font-semibold text-center'})[i].text
Market_Cap = table.find_all('td', attrs={'class':'text-center'})[i].text
Market_Cap = Market_Cap.strip()
Volume = table.find_all('td', attrs={'class':'text-center'})[i].text
Volume = Market_Cap.strip()
Open = table.find_all('td', attrs={'class':'text-center'})[i].text
Open = Open.strip()
Close = table.find_all('td', attrs={'class':'text-center'})[i].text
Close = Close.strip()
temp.append((Date,Market_Cap,Volume,Open,Close))
temp
and the output was looking as frustrating like this:
[('2022-09-29',
'$161,716,193,676',
'$161,716,193,676',
'$161,716,193,676',
('2022-09-28',
'$16,571,161,476',
'$16,571,161,476',
'$16,571,161,476',
'$16,571,161,476'),
(and on)
I need to do it with the row length method (which is 31 according to my code), but since the code is not identical, i suppose i can't get the output as i wanted. it would much appreciated if anyone could help me figure it out. cheers
soup.find_all('td',class_= 'text-center').text
I want to get
the number after: page=
the number after: "new">
the number after: /a>-
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 4449-4450<br/> </td>
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 5111-5550<br/> </td>
<td> </td>
...
My code
tables = soup.find_all('a', attrs={'target': 'new'})
gives my only a list (see below) without the third number
[4449,
5111,
...]
her is how i would try to extract the 3 numbers from my list, once it has the third digit in it.
list_of_number1 = []
list_of_number2 = []
list_of_number3 = []
regex = re.compile("page=(\d+)")
for table in tables:
number1 = filter(regex.match, tables)
number2 = table.next_sibling.strip()
number3 =
list_of_number1.append(number1)
list_of_number2.append(number2)
list_of_number3.append(number3)
Do i use beautifulsoup for the third number or is it feasible to regex through the whole html for any number following "/a>-"
Here is how you can obtain your result using just the information that you need to get the numbers in the specific a element and in the text node immediately on the right:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all('a', attrs={'target': 'new'})
print([(t.text, t["href"].split('=')[-1], t.next_sibling.replace('-', '')) for t in tables])
# => [('4449', '99', '4450'), ('5111', '77', '5550')]
You may certainly go the harder way with regexps:
import re
#... the same initialization code as above
for t in tables:
page = ""
page_m = re.search(r"[#?]page=(\d+)", t["href"])
if page_m:
page = page_m.group(1)
else:
page = ""
num = "".join([x for x in t.next_sibling if x.isdigit()])
results.append((int(t.text), int(page), int(num)))
print(results)
# => [(4449, 99, 4450), (5111, 77, 5550)]
NOTE:
t.text - gets the element text
t["href"] - gets the href attribute value of the t element
t.next_sibling - gets the next node after current one that is on the same hierarchy level.
You can also try:
for b in soup.select('a'):
print(b.attrs['href'].split('=')[1], b.text, b.nextSibling)
Output:
99 4449 -4450
77 5111 -5550
I am a self-learner and a beginner, searched a lot but maybe have lack of searching. I am scraping some values from two web sites and I want o compare them with an HTML output. Each web pages, I am combinin two class'es and gettin into a list. But when making an output with HTML I don't want all list to print. So I made function to choose any keywords to print. When I want to print out that function, It turns out 'None' at HTML output but it turns what I wanted on console. So how to show that special list?
OS= Windows , Python3.
from bs4 import BeautifulSoup
import requests
import datetime
import os
import webbrowser
carf_meySayf = requests.get('https://www.carrefoursa.com/tr/tr/meyve/c/1015?show=All').text
carf_soup = BeautifulSoup(carf_meySayf, 'lxml')
#spans
carf_name_span = carf_soup.find_all('span', {'class' : 'item-name'})
carf_price_span = carf_soup.find_all('span', {'class' : 'item-price'})
#spans to list
carf_name_list = [span.get_text() for span in carf_name_span]
carf_price_list = [span.get_text() for span in carf_price_span]
#combine lists
carf_mey_all = [carf_name_list +' = ' + carf_price_list for carf_name_list, carf_price_list in zip(carf_name_list, carf_price_list)]
#Function to choose and print special product
def test(namelist,product):
for i in namelist:
if product in i:
print(i)
a = test(carf_mey_all,'Muz')
# Date
date = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# HTML part
html_str = """
<html>
<title>Listeler</title>
<h2>Tarih: %s</h2>
<h3>Product & Shop List</h3>
<table style="width:100%%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
%s
</tr>
</html>
"""
whole = html_str %(date,a)
Html_file= open("Meyve.html","w")
Html_file.write(whole)
Html_file.close()
the method test() must have return value, for example
def test(namelist,product):
results = ''
for i in namelist:
if product in i:
print(i)
results += '<td>%s</td>\n' % i
return results
Meyve.html results:
<html>
<title>Listeler</title>
<h2>Tarih: 2018-12-29 07:34:00</h2>
<h3>Product & Shop List</h3>
<table style="width:100%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
<td>Muz = 6,99 TL</td>
<td>İthal Muz = 12,90 TL</td>
<td>Paket Yerli Muz = 9,99 TL</td>
</tr>
</html>
note: to be valid html you need to add <body></body>
The problem is that your test() function isn't explicitly returning anything, so it is implicitly returning None.
To fix this, test() should accumulate the text it wants to return (i.e, by building a list or string) and return a string containing the text you want to insert into html_str.
I want to get the movie tittle, year, rating, genres, and run time of five movies from the html page given in the code. These are in the rows of table called results.
from bs4 import BeautifulSoup
import urllib2
def read_from_url(url, num_m=5):
html_string = urllib2.urlopen(url)
soup = BeautifulSoup(html_string)
movie_table = soup.find('table', 'results') # table of movie
list_movies = []
count = 0
for row in movie_table.find_all("tr"):
dict_each_movie = {}
title = title.encode("ascii", "ignore") # getting title
dict_each_movie["title"] = title
year = year.encode("ascii","ignore") # getting year
dict_each_movie["year"] = year
rank = rank.encode("ascii","ignore") # getting rank
dict_each_movie["rank"] = rank
# genres = [] # getting genres of a movie
runtime = runtime.encode("ascii","ignore") # getting rank
dict_each_movie["runtime"] = runtime
list_movies.append(dict_each_movie)
count+=1
if count==num_of_m:
break
return list_movies
print read_from_url('http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2015',2)
Expected output:
[{'rating': '10.0', 'genres': ['Comedy', 'Family'], 'title': 'How to Beat a Bully', 'rank': '1', 'year': '2014', 'runtime': '90'},..........]
You're accessing a variable that hasn't been declared. When the interpreter sees title.encode("ascii", "ignore") it looks for the variable title which hasn't been declared previously. Python can't possible know what title is thus you can't call encode on it. Same goes for year and rank. Instead use:
title = 'How to Beat a Bully'.encode('ascii','ignore')
Why so???
Make your life easier with CSS Selectors.
<table>
<tr class="my_class">
<td id="id_here">
<a href = "link_here"/>First Link</a>
</td>
<td id="id_here">
<a href = "link_here"/>Second Link</a>
</td>
</tr>
</table>
for tr in movie_table.select("tr.my_class"):
for td in tr.select("td#id_here"):
print("Link " + td.select("a")[0]["href"])
print("Text "+ td.select("a")[0].text)
Hello all…I am trying to use BeautifulSoup to pick up the content of “Date of Employment:” on a webpage. the webpage contains 5 tables. the 5 tables are similar and looked like below.
<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Design Team</th><th class="top">Top</th></tr></thead><tbody><tr><td style="width:20px;">Designer:</td><td>Michael Linnen</td></tr>
<tr><td style="width:20px;">Date of Employment:</td><td>07 Jan 2012</td></tr>
<tr><td style="width:20px;">No of Works:</td><td>6</td></tr>
<tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>
<table class="table1"><thead><tr><th style="width: 140px;" class="CII">Operation Team</th><th class="top">Top</th></tr></thead><tbody><tr><td style="width:20px;">Manager:</td><td>Nich Sharmen</td></tr>
<tr><td style="width:20px;">Date of Employment:</td><td>02 Nov 2005</td></tr>
<tr><td style="width:20px;">Zones:</td><td>6</td></tr>
<tr><td style="width: 15px">No of teams:</td><td vAlign="top">2<br>Combined</td></tr>
The texts I want is in the 3rd table, the table header is "Design Team".
I am Using below:
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
aa = soup.find_all(text=re.compile("Date of Employment:"))
bb = aa[2].findNext('td')
print bb.text
the problem is that, the “Date of Employment:” in this table sometimes is not available. when it's not there, the code picks the "Date of Employment:" in the next table.
How do I restrict my code to pick only the wanted ones in the table named “Design Team”? thanks.
Rather than finding all the Date of Employment and finding the next td you can directy find the 5th table, given that the th is Design Team
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
aa = soup.find_all(text="Design Team")
nexttr = aa.next_sibling
if nexttr.td.text == "Date of Employment:":
print nexttr.td.next_sibling.text
else:
print "No Date of Employment:"
nexttr = aa.next_sibling finds the next tr tag within the table tag.
if nexttr.td.text == "Date of Employment:": ensures that the text within the next td tag withn the tr is "No Date of Employment:"
nexttr.td.next_sibling extracts the immediate td tag following the "Date of Employment"
print nexttr.td.next_sibling.text prints the date