How can I make a proper loop for this case - python

I'm trying for a bootcamp project to scrape data from the following website: https://www.coingecko.com/en/coins/bitcoin/historical_data/usd?start_date=2021-01-01&end_date=2021-09-30. My aim is to get separate lists for the following columns: Date, Market Cap, volume, Open, and Close. My problem is, that for the columns Market Cap, volume, open and close, the class name (text-center) is the same:
<td class="text-center">
$161,716,193,676
</td>
<td class="text-center">
$16,571,161,476
</td>
<td class="text-center">
$1,340.02
</td>
<td class="text-center">
N/A
</td>
I've tried to solve it with this:
import requests
url_get = requests.get('https://www.coingecko.com/en/coins/ethereum/historical_data#panel')
from bs4 import BeautifulSoup
soup = BeautifulSoup(url_get.content,"html.parser")
table = soup.find('div', attrs={'class':'card-block'})
row = table.find_all('th', attrs={'class':'font-semibold text-center'})
row_length = len(row)
row_length
temp = []
for i in range(0, row_length):
Date = table.find_all('th', attrs={'class':'font-semibold text-center'})[i].text
Market_Cap = table.find_all('td', attrs={'class':'text-center'})[i].text
Market_Cap = Market_Cap.strip()
Volume = table.find_all('td', attrs={'class':'text-center'})[i].text
Volume = Market_Cap.strip()
Open = table.find_all('td', attrs={'class':'text-center'})[i].text
Open = Open.strip()
Close = table.find_all('td', attrs={'class':'text-center'})[i].text
Close = Close.strip()
temp.append((Date,Market_Cap,Volume,Open,Close))
temp
and the output was looking as frustrating like this:
[('2022-09-29',
'$161,716,193,676',
'$161,716,193,676',
'$161,716,193,676',
('2022-09-28',
'$16,571,161,476',
'$16,571,161,476',
'$16,571,161,476',
'$16,571,161,476'),
(and on)
I need to do it with the row length method (which is 31 according to my code), but since the code is not identical, i suppose i can't get the output as i wanted. it would much appreciated if anyone could help me figure it out. cheers

soup.find_all('td',class_= 'text-center').text

Related

.XPath for items under a certain td

I'm trying to scrape the website (https://na.op.gg/champion/statistics) to get which champions have the highest winrate using xpath, which I was able to do using this:
champion = tree.xpath('//div[#class="champion-index-table__name"]/text()')
but, I realized that the names I'm trying to get are on a table that changes in size depending on the current game meta, so I wanted to just scrape the names that fall under the specific categories so I wont have any problems later on when the number of champions in the table changes. The website separates them each under different "tiers" that look like this:
<tbody class="tabItem champion-trend-tier-TOP" style="display: table-row-group;">
<tr>
<td class="champion-index-table__cell champion-index-table__cell--rank">1</td>
<td class="champion-index-table__cell champion-index-table__cell--change champion-index-table__cell--change--stay">
<img src="//opgg-static.akamaized.net/images/site/champion/icon-championtier-stay.png" alt="">
0
</td>
<td class="champion-index-table__cell champion-index-table__cell--image">
<i class="__sprite __spc32 __spc32-32"></i>
</td>
<td class="champion-index-table__cell champion-index-table__cell--champion">
<a href="/champion/garen/statistics/top">
<div class="champion-index-table__name">Garen</div>
<div class="champion-index-table__position">
Top, Middle </div>
</a>
</td>
<td class="champion-index-table__cell champion-index-table__cell--value">53.12%</td>
<td class="champion-index-table__cell champion-index-table__cell--value">16.96%</td>
<td class="champion-index-table__cell champion-index-table__cell--value">
<img src="//opgg-static.akamaized.net/images/site/champion/icon-champtier-1.png" alt="">
</td>
</tr>
<tr>
Then the next one goes to
<tbody class="tabItem champion-trend-tier-JUNGLE" style="display: table-row-group;">
So, I've tried this, but it's not outputting anything but [].
Hopefully my question is making sense.
championtop = tree.xpath('//div/table/tbody/tr//td[4][#class="champion-index-table__name"]/text()')
I was able to accomplish my goal by just doing
champion = tree.xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-TOP"]/tr/td[4]/a/div[1]
Thanks for reading!
You may go through all the line of code , I'm sure you will get your answer.
this is how you can target element in xpath easily and reliably
Initial setup
from selenium import webdriver
wb = webdriver.Chrome('Path to your chrome webdriver')
wb.get('https://na.op.gg/champion/statistics')
For TOP Tier
y_top = {}
tbody_top = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-TOP"]/tr')
for i in range(len(tbody_top)):
y_top[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-TOP"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-TOP"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
for jungle
y_jung = {}
tbody_jung = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-JUNGLE"]/tr')
for i in range(len(tbody_jung)):
y_jung[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-JUNGLE"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-JUNGLE"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
for middle
y_mid = {}
tbody_mid = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-MID"]/tr')
for i in range(len(tbody_mid)):
y_mid[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-MID"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-MID"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
for Bottom
y_bott = {}
tbody_bott = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-ADC"]/tr')
for i in range(len(tbody_bott)):
y_bott[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-ADC"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-ADC"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
for Support
y_sup = {}
tbody_sup = wb.find_elements_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-SUPPORT"]/tr')
for i in range(len(tbody_sup)):
y_sup[wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-SUPPORT"]/tr['+str(i+1)+']/td[4]/a/div[1]').text] = wb.find_element_by_xpath('//table[#class = "champion-index-table tabItems"]/tbody[#class="tabItem champion-trend-tier-SUPPORT"]/tr['+str(i+1)+']/td[5]').text.rstrip('%')
print(max(y_top,key = y_top.get)) # it will print the max win rate for TOP tier
print(max(y_jung,key = y_jung.get))
print(max(y_mid,key = y_mid.get))
print(max(y_bott,key = y_bott.get))
print(max(y_sup,key = y_sup.get))
==========================================
You can target any element using its attribute xpath like below:
1. wb.find_element_by_xpath('//div[#id="the_id_of_div"]/a/span')
2. wb.find_element_by_xpath('//div[#class="class name"]/p/span')
3. wb.find_element_by_xpath('//div[#title="title of element"]/p/span')
4. wb.find_element_by_xpath('//div[#style="style x "]/p/span')
5. wb.find_elements_by_xpath('//*[contains(text(),"the text u wanna find")]') #may be the page has multiple same text that u wanna search...keep in mind
6. to find parent of a found element ==>
that_found_element.find_element_by_xpath('..') # and you can iterate it to the top most parent using loop
7. find sibling of an element ===>
to find the preceding element
wb.find_element_by_xpath('//span[#id ="theidx"]//preceding-sibling::input') #this tells target a input tag which is preceding sibling of span tag with id as "theidx"
to find the following element
wb.find_element_by_xpath('//span[#id ="theidy"]//following-sibling::input') #this tells target a input tag which is following sibling of span tag with id as "theidy"

How to scrape a table from any site and store it to data frame?

I need to scrape a table from https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
and store this data in python dataframe.
I have pulled the table but unable to pick the columns (Postcode, Borough, Neighbourhood)
My table looks like this:
<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td>North York</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A</td>
<td>North York</td>
<td>Victoria Village
</td></tr>
...
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td')
Postcode = row.columns[1].get_text()
Borough = row.columns[2].get_text()
Neighbourhood = row.column[3].get_text()
df.append([Postcode,Borough,Neighbourhood])
With the above code I am getting
TypeError: 'NoneType' object is not subscriptable
I googled it and got to know that I cannot do
Postcode = row.columns[1].get_text()
because of inline propery of the function.
I tried something else too but got some "Index error message".
It's simple. I need to traverse the row and goes on picking the three columns for each row and store it in a list. But I am not able to write it in a code.
Expected output is
Postcode Borough Neighbourhood
M1A Not assigned Not assigned
M2A Not assigned Not assigned
M3A North York Parkwoods
The code for scraping is wrong in below parts.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
table = soup.find('table', {'class': 'wikitable sortable'})
df = []
for row in table.find_all('tr'):
columns = row.find_all('td') # the first row is returning <th> tags, but since you queried <td> tags, it's returning empty list.
if len(columns)>0: #In order to skip first row or in general, empty rows, you need to put an if check.
#Use the indices properly to get different values.
Postcode = columns[0].get_text()
Borough =columns[1].get_text()
Neighbourhood = columns[2].get_text()
df.append([Postcode,Borough,Neighbourhood])
Then again, be careful, using get_text will also return the links and anchor tags intact. You might wanna change the code to avoid that.
Happy web scraping :)
I don't know pandas but I use this script to scrape table. Hope it is helpful.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
tbl= soup.find('table', {'class': 'wikitable sortable'})
table_dict = {
"head": [th.text.strip() for th in tbl.find_all('th')],
"rows": [
[td.text.strip() for td in tr.find_all("td")]
for tr in tbl.find_all("tr")
if not tr.find("th")
]
}
If you want to scrape a table from web, you can use pandas library.
import pandas as pd
url = 'valid_url'
df = pd.read_html(url)
print(df[0].head())

Wiki scraping using python

I am trying to scrape the data stored in the table of this wikipedia page https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India).
However i am unable to scrape the full data
Hers's what i wrote so far:
from bs4 import BeautifulSoup
import urllib2
wiki = "https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page,"html.parser")
name = ""
pic = ""
strt = ""
end = ""
pri = ""
x=""
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) == 8:
name = cells[0].find(text=True)
print name`
The output obtained is:
Jairamdas Daulatram, Surjit Singh Barnala, Rao Birendra Singh
Whereas the output should be: Jairamdas Daulatram followed by Panjabrao Deshmukh
Have you read the raw html?
Because some of the cells span several rows (e.g. Political Party), most rows do not have 8 cells in them.
You cannot therefore do if len(cells) == 8 and expect it to work. Think about what this line was meant to achieve. If it was to ignore the header row then you could replace it with if len(cells) > 0 because all the header cells are <th> tags (and therefore will not appear in your list).
Page source (showing your problem):
<tr>
<td>Jairamdas Daulatram</td>
<td></td>
<td>1948</td>
<td>1952</td>
<td rowspan="6">Indian National Congress</td>
<td rowspan="6" bgcolor="#00BFFF" width="4px"></td>
<td rowspan="3">Jawaharlal Nehru</td>
<td><sup id="cite_ref-1" class="reference"><span>[</span>1<span>]</span></sup></td>
</tr>
<tr>
<td>Panjabrao Deshmukh</td>
<td></td>
<td>1952</td>
<td>1962</td>
<td><sup id="cite_ref-2" class="reference"><span>[</span>2<span>]</span></sup></td>
</tr>
Like already stated in a previous post. It does not make sense to set a static length. Just check if <td> exists. The code below is written in Python 3, but should work in Python 2.7 as well with some small adjustments.
from bs4 import BeautifulSoup
from urllib.request import urlopen
wiki = urlopen("https://en.wikipedia.org/wiki/Minister_of_Agriculture_(India)")
soup = BeautifulSoup(wiki, "html.parser")
table = soup.find("table", { "class" : "wikitable" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if cells:
name = cells[0].find(text=True)
print(name)

python parsing beautiful soup data to csv

I have written code in python3 to parse an html/css table. Have a few issues with it:
my csv output file headers are not generated based on html (tag: td, class: t1) by my code (on the first run when the output file is being created)
if the incoming html table has a few additional fields (tag: td, class: t1) my code cannot currently capture them and create additional headers in the csv output file
the data is not written to the output cvs file till ALL the ids (A001,A002,A003...) from my input file are processed. i want to write to the output cvs file when the processing of each id from my input file is completed (i.e. A001 to be written to csv before processing A002).
whenever i rerun the code, the data does not begin from the next line in the output csv
Being a noob, I am sure my code is very rudimentary and there will be a better way to do this and would like to learn to write this better and fix the above as well.
Need advise & guidance, please help. Thank you.
My Code:
import csv
import requests
from bs4 import BeautifulSoup
## SIDs.csv contains ids in col2 based on which the 'url' variable pulls the respective data
SIDFile = open('SIDs.csv')
SIDReader = csv.reader(SIDFile)
SID = list(SIDReader)
SqID_data = []
#create and open output file
with open('output.csv','a', newline='') as csv_h:
fields = \
[
"ID",
"Financial Year",
"Total Income",
"Total Expenses",
"Tax Expense",
"Net Profit"
]
for row in SID:
col1,col2 = row
SID ="%s" % (col2)
url = requests.get("http://.......")
soup = BeautifulSoup(url.text, "lxml")
fy = soup.findAll('td',{'class':'tablehead'})
titles = soup.findAll('td',{'class':'t1'})
values = soup.findAll('td',{'class':'t0'})
if titles:
data = {}
for title in titles:
name = title.find("td", class_ = "t1")
data["ID"] = SID
data["Financial Year"] = fy[0].string.strip()
data["Total Income"] = values[0].string.strip()
data["Total Expenses"] = values[1].string.strip()
data["Tax Expense"] = values[2].string.strip()
data["Net Profit"] = values[3].string.strip()
SqID_data.append(data)
#Prepare CSV writer.
writer = csv.DictWriter\
(
csv_h,
fields,
quoting = csv.QUOTE_ALL,
extrasaction = "ignore",
dialect = "excel",
lineterminator = "\n",
)
writer.writeheader()
writer.writerows(SqID_data)
print("write rows complete")
Excerpt of HTML being processed:
<p>
<TABLE border=0 cellspacing=1 cellpadding=6 align=center class="vTable">
<TR>
<TD class=tablehead>Financial Year</TD>
<TD class=t1>01-Apr-2015 To 31-Mar-2016</TD>
</TR>
</TABLE>
</p>
<p>
<br>
<table cellpadding=3 cellspacing=1 class=vTable>
<TR>
<TD class=t1><b>Total income from operations (net) ( a + b)</b></td>
<TD class=t0 nowrap>675529.00</td>
</tr>
<TR>
<TD class=t1><b>Total expenses</b></td>
<TD class=t0 nowrap>446577.00</td>
</tr>
<TR>
<TD class=t1>Tax expense</td>
<TD class=t0 nowrap>71708.00</td>
</tr>
<TR>
<TD class=t1><b>Net Profit / (Loss)</b></td>
<TD class=t0 nowrap>157621</td>
</tr>
</table>
</p>
SIDs.csv (no header row)
1,A0001
2,A0002
3,A0003
Expected Output: output.csv (create header row)
ID,Financial Year,Total Income,Total Expenses,Tax Expense,Net Profit,OtherFieldsAsAndWhenFound
A001,01-Apr-2015 To 31-Mar-2016,675529.00,446577.00,71708.00,157621.00
A002,....
A003,....
I would recommend looking at pandas.read_html for parsing your web data; on your sample data this gives you:
import pandas as pd
tables=pd.read_html(s, index_col=0)
tables[0]
Out[11]:
1
0
Financial Year 01-Apr-2015 To 31-Mar-2016
tables[1]
1
0
Total income from operations (net) ( a + b) 675529
Total expenses 446577
Tax expense 71708
Net Profit / (Loss) 157621
You can then do what ever data manipulations you need (adding id's etc) using Pandas functions, and then export with DataFrame.to_csv.

Scraping html page

I want to get the movie tittle, year, rating, genres, and run time of five movies from the html page given in the code. These are in the rows of table called results.
from bs4 import BeautifulSoup
import urllib2
def read_from_url(url, num_m=5):
html_string = urllib2.urlopen(url)
soup = BeautifulSoup(html_string)
movie_table = soup.find('table', 'results') # table of movie
list_movies = []
count = 0
for row in movie_table.find_all("tr"):
dict_each_movie = {}
title = title.encode("ascii", "ignore") # getting title
dict_each_movie["title"] = title
year = year.encode("ascii","ignore") # getting year
dict_each_movie["year"] = year
rank = rank.encode("ascii","ignore") # getting rank
dict_each_movie["rank"] = rank
# genres = [] # getting genres of a movie
runtime = runtime.encode("ascii","ignore") # getting rank
dict_each_movie["runtime"] = runtime
list_movies.append(dict_each_movie)
count+=1
if count==num_of_m:
break
return list_movies
print read_from_url('http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2015',2)
Expected output:
[{'rating': '10.0', 'genres': ['Comedy', 'Family'], 'title': 'How to Beat a Bully', 'rank': '1', 'year': '2014', 'runtime': '90'},..........]
You're accessing a variable that hasn't been declared. When the interpreter sees title.encode("ascii", "ignore") it looks for the variable title which hasn't been declared previously. Python can't possible know what title is thus you can't call encode on it. Same goes for year and rank. Instead use:
title = 'How to Beat a Bully'.encode('ascii','ignore')
Why so???
Make your life easier with CSS Selectors.
<table>
<tr class="my_class">
<td id="id_here">
<a href = "link_here"/>First Link</a>
</td>
<td id="id_here">
<a href = "link_here"/>Second Link</a>
</td>
</tr>
</table>
for tr in movie_table.select("tr.my_class"):
for td in tr.select("td#id_here"):
print("Link " + td.select("a")[0]["href"])
print("Text "+ td.select("a")[0].text)

Categories

Resources