Web Scraping with BS4 - Can you sort this out? - python

Can you fix this code for me? This is giving me a erroneous message like,
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Can anyone please help me on this? Below is the code
import pandas as pd
import requests
from bs4 import BeautifulSoup
url="https://www.cse.lk/pages/trade-summary/trade-summary.component.html"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
cse = pd.DataFrame(columns=["Company Name", "Symbol", "Share Volume", "Trade Volume", "Previous Close (Rs.)", "Open (Rs.)", "High (Rs.)", "Low (Rs.)", "**Last Traded Price (Rs.)", "Change (Rs.)", "Change Percentage (%)"])
for row in soup.find_all('tbody').find_all('tr'): ##for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
Company_Name = col[0].text
Symbol = col[1].text
Share_Volume = col[2].text
Trade_Volume = col[3].text
Previous_Close = col[4].text
Open = col[5].text
High = col[6].text
Low = col[7].text
Last_Traded_Price = col[8].text
Change = col[9].text
Change_Percentage = col[10].text
cse = cse.append({"Company Name":Company_Name,"Symbol":Symbol,"Share Volume":Share_Volume,"Trade Volume":Trade_Volume,"Previous Close (Rs.)":Previous_Close,"Open (Rs.)":Open,"High (Rs.)":High,"Low (Rs.)":Low,"**Last Traded Price (Rs.)":Last_Traded_Price,"Change (Rs.)":Change,"Change Percentage (%)":Change_Percentage}, ignore_index=True)

The data is loaded from external URL via Javascript, so beautifulsoup doesn't see it. You can use this example how to load it:
import requests
import pandas as pd
url = "https://www.cse.lk/api/tradeSummary"
data = requests.post(url).json()
df = pd.DataFrame(data["reqTradeSummery"])
print(df)
df.to_csv("data.csv", index=None)
Prints:
id name symbol quantity percentageChange change price previousClose high low lastTradedTime issueDate turnover sharevolume tradevolume marketCap marketCapPercentage open closingPrice crossingVolume crossingTradeVol status
0 204 ABANS ELECTRICALS PLC ABAN.N0000 317 4.184704 7.25 180.50 173.25 183.00 172.00 1626944252441 01/JAN/1984 1.256363e+06 7012 44 9.224561e+08 0.0 179.00 180.50 7012 44 0
1 1845 ABANS FINANCE PLC AFSL.N0000 89 -3.225806 -1.00 30.00 31.00 30.10 30.00 1626944124197 27/JUN/2011 1.160916e+06 38652 11 1.996847e+09 0.0 30.10 30.00 38652 11 3
2 2065 ACCESS ENGINEERING PLC AEL.N0000 500 -0.432900 -0.10 23.00 23.10 23.40 22.90 1626944388726 27/MAR/2012 1.968675e+07 855534 264 2.300000e+10 0.0 23.10 23.00 855534 264 0
3 472 ACL CABLES PLC ACL.N0000 1000 -0.963855 -0.40 41.10 41.50 41.70 40.90 1626944397450 01/JAN/1976 3.037800e+07 738027 421 9.846521e+09 0.0 41.50 41.10 738027 421 0
4 406 ACL PLASTICS PLC APLA.N0000 20 0.842697 2.25 269.25 267.00 272.75 266.00 1626943847820 05/APR/1995 1.436916e+06 5333 26 1.134216e+09 0.0 272.75 269.25 5333 26 0
...
and saves data.csv (screenshot from LibreOffice);

Related

Trying to scrape a specific table but getting no results

I tried three different techniques to scrape a table named 'table-light', but nothing is actually working for me. The code below shows my attempts to extract the data.
import pandas as pd
tables = pd.read_html('https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap')
tables
############################################################################
import requests
import pandas as pd
url = 'https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[10]
print(df)
############################################################################
import requests
from bs4 import BeautifulSoup
url = "https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table-light')
print(table)
The table that I am trying to extract data from is named 'table-light'. I want to get all the columns and all 144 rows. How can I do that?
You can try to set User-Agent header to get the correct HTML (and not captcha page):
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml") # <-- don't use html.parser here
table = soup.select_one(".table-light")
for td in table.tr.select('td'):
td.name = 'th'
df = pd.read_html(str(table))[0]
print(df.head())
Prints:
No. Name Market Cap P/E Fwd P/E PEG P/S P/B P/C P/FCF EPS past 5Y EPS next 5Y Sales past 5Y Change Volume
0 1 Real Estate - Development 3.14B 3.21 21.12 0.24 0.60 0.52 2.28 17.11 43.30% 13.42% 13.69% 1.43% 715.95K
1 2 Textile Manufacturing 3.42B 32.58 25.04 - 1.43 2.58 9.88 90.16 15.31% -0.49% 3.54% 1.83% 212.71K
2 3 Coking Coal 5.31B 2.50 4.93 0.37 0.64 1.53 4.20 2.54 38.39% 6.67% 22.92% 5.43% 1.92M
3 4 Real Estate - Diversified 6.71B 17.38 278.89 0.87 2.78 1.51 15.09 91.64 0.48% 19.91% 11.97% 3.31% 461.33K
4 5 Other Precious Metals & Mining 8.10B 24.91 29.07 2.71 6.52 1.06 14.47 97.98 16.30% 9.19% 20.71% 0.23% 4.77M

Cannot scrape some table using Pandas

i'm more than a noob in python, i'm tryng to get some tables from this page:
https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html
Using Pandas and command pd.read_html i'm able to get most of them but not the "Line Score" and the "Four Factors"...if i print all the table (they are 19) these two are missing, inspecting with chrome they seem to be table and i also get them with excel importing from web.
What am i missing here?
Any help appreciated, thanks!
If you look at the page source (not by inspecting), you'd see those tables are within the comments of the html. You can either a) edit the html str and remove the <!-- and --> from the html, then let pandas parse, or 2) use bs4 to pull out the comments, then parse that tables that way.
I'll show you both options:
Option 1: Remove the comment tags from the page source
import requests
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
response = requests.get(url).text.replace("<!--","").replace("-->","")
dfs = pd.read_html(response, header=1)
Output:
You can see you now have 21 tables, with the 4th and 5th tables the ones in question.
print(len(dfs))
for each in dfs[3:5]:
print('\n\n', each, '\n')
21
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1
Option 2: Pull out comments with bs4
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
dfs = pd.read_html(url, header=1)
comments = data.find_all(string=lambda text: isinstance(text, Comment))
other_tables = []
for each in comments:
if '<table' in str(each):
try:
other_tables.append(pd.read_html(str(each), header=1)[0])
except:
continue
Output:
for each in other_tables:
print(each, '\n')
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1

How can I return multple rows instead of just one using Python?

I'm attempting to pull data from finivz and I'm able to pull only one row at all time.
Here's my code:
url = ('https://finviz.com/quote.ashx?t=' + ticker.upper())
r = Request(url, headers = header)
html = urlopen(r).read()
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')
rows = rows[13:20]
for row in rows:
row_td = row.find_all('td') <------------ I believe the issue is with this section?
#print(row_td)
str_cells = str(row_td)
clean = BeautifulSoup(str_cells, "lxml").get_text()
print(clean)
Only prints:
[Dividend %, 2.97%, Quick Ratio, 1.30, Sales past 5Y, -5.70%, Gross Margin, 60.60%, 52W Low, 20.59%, ATR, 0.64] - even though I specify rows[13:30]
I'd like to print out all of the rows from the table on the page.
here is a screenshot of the table
You can do it easily using only pandas DataFrame.
Here is the full working output.
CODE:
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
url = "https://finviz.com/quote.ashx?t=KO"
req = requests.get(url,headers=headers)
wiki_table = pd.read_html(req.text, attrs = {"class":"snapshot-table2"} )
df = wiki_table[0]
print(df)
OUTPUT:
0 1 2 3 ... 8 9 10 11
0 Index DJIA S&P500 P/E 30.35 ... Shs Outstand 4.31B Perf Week -1.03%
1 Market Cap 245.44B Forward P/E 23.26 ... Shs Float 4.29B Perf Month 0.30%
2 Income 8.08B PEG 3.00 ... Short Float 0.75% Perf Quarter 3.70%
3 Sales 36.41B P/S 6.74 ... Short Ratio 2.34 Perf Half Y 11.87%
4 Book/sh 5.16 P/B 10.98 ... Target Price 62.06 Perf Year 19.62%
5 Cash/sh 3.01 P/C 18.82 ... 52W Range 46.97 - 57.56 Perf YTD 3.28%
6 Dividend 1.68 P/FCF 95.02 ... 52W High -1.60% Beta 0.63
7 Dividend % 2.97% Quick Ratio 1.30 ... 52W Low 20.59% ATR 0.64
8 Employees 80300 Current Ratio 1.50 ... RSI (14) 51.63 Volatility 1.17% 0.94%
9 Optionable Yes Debt/Eq 1.89 ... Rel Volume 0.76 Prev Close 56.86
10 Shortable Yes LT Debt/Eq 1.79 ... Avg Volume 13.67M Price 56.64
11 Recom 2.20 SMA20 -0.42% ... Volume 10340772 Change -0.39%
[12 rows x 12 columns]
In the for-loop, you're rewriting variable row_td over and over. Store the content of the variable in the list (in my example, I use the list all_data to store all rows).
To print all rows from the table, you can use next example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
url = "https://finviz.com/quote.ashx?t=KO"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml")
all_data = []
for tr in soup.select(".snapshot-table2 tr"):
tds = [td.get_text(strip=True) for td in tr.select("td")]
all_data.append(tds)
fmt_string = "{:<15}" * 12
for row in all_data:
print(fmt_string.format(*row))
Prints:
Index DJIA S&P500 P/E 30.35 EPS (ttm) 1.87 Insider Own 0.30% Shs Outstand 4.31B Perf Week -1.03%
Market Cap 245.44B Forward P/E 23.26 EPS next Y 2.44 Insider Trans -2.65% Shs Float 4.29B Perf Month 0.30%
Income 8.08B PEG 3.00 EPS next Q 0.58 Inst Own 69.00% Short Float 0.75% Perf Quarter 3.70%
Sales 36.41B P/S 6.74 EPS this Y -13.30% Inst Trans 0.55% Short Ratio 2.34 Perf Half Y 11.87%
Book/sh 5.16 P/B 10.98 EPS next Y 7.84% ROA 8.90% Target Price 62.06 Perf Year 19.62%
Cash/sh 3.01 P/C 18.82 EPS next 5Y 10.12% ROE 40.10% 52W Range 46.97 - 57.56 Perf YTD 3.28%
Dividend 1.68 P/FCF 95.02 EPS past 5Y 1.40% ROI 12.20% 52W High -1.60% Beta 0.63
Dividend % 2.97% Quick Ratio 1.30 Sales past 5Y -5.70% Gross Margin 60.60% 52W Low 20.59% ATR 0.64
Employees 80300 Current Ratio 1.50 Sales Q/Q 41.70% Oper. Margin 25.70% RSI (14) 51.63 Volatility 1.17% 0.94%
Optionable Yes Debt/Eq 1.89 EPS Q/Q 47.70% Profit Margin 22.20% Rel Volume 0.76 Prev Close 56.86
Shortable Yes LT Debt/Eq 1.79 Earnings Jul 21 BMO Payout 87.90% Avg Volume 13.67M Price 56.64
Recom 2.20 SMA20 -0.42% SMA50 1.65% SMA200 6.56% Volume 10,340,772 Change -0.39%

Python Beautiful Soup Webscraping: Cannot get a full table to display

I am relatively new to python and this is my first web scrape. I am trying to scrape a table and can only get the first column to show up. I am using the find method instead of find_all which I am pretty sure what is causing this, but when I use the find_all method I cannot get any text to display. Here is the url I am scraping from: https://www.fangraphs.com/teams/mariners/stats
I am trying to get the top table (Batting Stat Leaders) to work. My code is below:
from bs4 import BeautifulSoup
import requests
import time
htmlText = requests.get('https://www.fangraphs.com/teams/mariners/stats').text
soup = BeautifulSoup(htmlText, 'lxml', )
playerTable = soup.find('div', class_='team-stats-table')
input = input("Would you like to see Batting, Starting Pitching, Relief Pitching, or Fielding Stats? \n")
def BattingStats():
print("BATTING STATS:")
print("Player Name: ")
for tr in playerTable.find_all('tr')[1:55]:
tds = tr.find('td').text
print(tds)
if input == "Batting" or "batting":
BattingStats()
You can use list-comprehension to get text from all rows:
import requests
from bs4 import BeautifulSoup
playerTable = soup.find("div", class_="team-stats-table")
def BattingStats():
print("BATTING STATS:")
print("Player Name: ")
for tr in playerTable.find_all("tr")[1:55]:
tds = [td.text for td in tr.select("td")]
print(tds)
BattingStats()
Prints:
BATTING STATS:
Player Name:
Mitch Haniger 30 94 406 25 0 6.7% 23.4% .257 .291 .268 .323 .524 .358 133 0.2 16.4 -6.5 2.4
Ty France 26 89 372 9 0 7.3% 16.9% .150 .314 .276 .355 .426 .341 121 0.0 9.5 -2.6 2.0
Kyle Seager 33 97 403 18 2 8.4% 25.8% .201 .246 .215 .285 .416 .302 95 -0.3 -2.9 5.4 1.6
...
Solution with pandas:
import pandas as pd
url = "https://www.fangraphs.com/teams/mariners/stats"
df = pd.read_html(url)[7]
print(df)
Prints:
Name Age G PA HR SB BB% K% ISO BABIP AVG OBP SLG wOBA wRC+ BsR Off Def WAR
0 Mitch Haniger 30 94 406 25 0 6.7% 23.4% 0.257 0.291 0.268 0.323 0.524 0.358 133.0 0.2 16.4 -6.5 2.4
1 Ty France 26 89 372 9 0 7.3% 16.9% 0.150 0.314 0.276 0.355 0.426 0.341 121.0 0.0 9.5 -2.6 2.0
2 Kyle Seager 33 97 403 18 2 8.4% 25.8% 0.201 0.246 0.215 0.285 0.416 0.302 95.0 -0.3 -2.9 5.4 1.6
...

Multiple table header <thead> in table <table> and how to scrape data from <thead> as a table row

I'm trying to scrape data from a website but the table has two sets of data, first, 2-3 lines of data are in thead and rest in tbody. I can easily extract data only from one at a time when I try both I got some error like TypeError, AttributeError. btw I'm using python
here is the code
import requests
from bs4 import BeautifulSoup
import pandas as pd
url="https://www.worldometers.info/world-population/"
r=requests.get(url)
print(r)
html=r.text
soup=BeautifulSoup(html,'html.parser')
print(soup.title.text)
print()
print()
live_data=soup.find_all('div',id='maincounter-wrap')
print(live_data)
for i in live_data:
print(i.text)
table_body=soup.find('thead')
table_rows=table_body.find_all('tr')
table_body_2=soup.find('tbody')
table_rows_2=soup.find_all('tr')
year_july1=[]
population=[]
yearly_change_in_perchantage=[]
yearly_change=[]
median_age=[]
fertillity_rate=[]
density=[]#density (p\km**)
urban_population_in_perchantage=[]
urban_population=[]
for tr in table_rows:
td=tr.find_all('td')
year_july1.append(td[0].text)
population.append(td[1].text)
yearly_change_in_perchantage.append(td[2].text)
yearly_change.append(td[3].text)
median_age.append(td[4].text)
fertillity_rate.append(td[5].text)
density.append(td[6].text)
urban_population_in_perchantage.append(td[7].text)
urban_population.append(td[8].text)
for tr in table_rows_2:
td=tr.find_all('td')
year_july1.append(td[0].text)
population.append(td[1].text)
yearly_change_in_perchantage.append(td[2].text)
yearly_change.append(td[3].text)
median_age.append(td[4].text)
fertillity_rate.append(td[5].text)
density.append(td[6].text)
urban_population_in_perchantage.append(td[7].text)
urban_population.append(td[8].text)
headers=['year_july1','population','yearly_change_in_perchantage','yearly_change','median_age','fertillity_rate','density','urban_population_in_perchantage','urban_population']
data_2= pd.DataFrame(list(zip(year_july1,population,yearly_change_in_perchantage,yearly_change,median_age,fertillity_rate,density,urban_population_in_perchantage,urban_population)),columns=headers)
print(data_2)
data_2.to_csv("C:\\Users\\data_2.csv")
you can try the below code it generates the required data. Do let me know if you need any clarification:-
import requests
import pandas as pd
url = 'https://www.worldometers.info/world-population/'
html = requests.get(url).content
df_list = pd.read_html(html, header=0)
df = df_list[0]
#print(df)
df.to_csv("data.csv", index=False)
gives me below output
print(df)
Year (July 1) Population ... Urban Pop % Urban Population
0 2020 7794798739 ... 56.2 % 4378993944
1 2019 7713468100 ... 55.7 % 4299438618
2 2018 7631091040 ... 55.3 % 4219817318
3 2017 7547858925 ... 54.9 % 4140188594
4 2016 7464022049 ... 54.4 % 4060652683
5 2015 7379797139 ... 54.0 % 3981497663
6 2010 6956823603 ... 51.7 % 3594868146
7 2005 6541907027 ... 49.2 % 3215905863
8 2000 6143493823 ... 46.7 % 2868307513
9 1995 5744212979 ... 44.8 % 2575505235
10 1990 5327231061 ... 43.0 % 2290228096
11 1985 4870921740 ... 41.2 % 2007939063
12 1980 4458003514 ... 39.3 % 1754201029
13 1975 4079480606 ... 37.7 % 1538624994
14 1970 3700437046 ... 36.6 % 1354215496
15 1965 3339583597 ... N.A. N.A.
16 1960 3034949748 ... 33.7 % 1023845517
17 1955 2773019936 ... N.A. N.A.
[18 rows x 9 columns]

Categories

Resources