How to scrape a table in a given website? - python

I am trying to extract values for each row for "Fee" field in the website https://remittanceprices.worldbank.org/en/corridor/United-States/Mexico. I came really close but not able to extract the values for "Fee" column. The code is shown below
url = 'https://remittanceprices.worldbank.org/en/corridor/Australia/China'
r = requests.get(url,verify=False)
soup = BeautifulSoup(r.text,'lxml')
rows = [i.get_text("|").split("|") for i in soup.select('#tab-1 .corridor-row')]
for row in rows:
#a,b,c,d,e = row[2],row[15],row[18],row[21],row[25]
#print(a,b,c,d,e,sep='|')
print('{0[2]}|{0[15]}|{0[18]}|{0[21]}|{0[25]}'.format(row))```

def main():
import requests
from bs4 import BeautifulSoup
url = "https://remittanceprices.worldbank.org/en/corridor/United-States/Mexico"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for cell in soup.select("div.views-field-field-cc1-fee > div > span:first-of-type"):
print(cell.text.strip())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
0.00
0.00
2.99
2.99
3.00
2.99
4.00
5.00
3.99
3.99
3.99
3.99
5.00
5.00
5.00
3.99
3.99
8.00
8.50
10.00
3.99
7.00
10.00
4.95
10.00
9.99
8.00
>>>

Related

Trying to scrape a specific table but getting no results

I tried three different techniques to scrape a table named 'table-light', but nothing is actually working for me. The code below shows my attempts to extract the data.
import pandas as pd
tables = pd.read_html('https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap')
tables
############################################################################
import requests
import pandas as pd
url = 'https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[10]
print(df)
############################################################################
import requests
from bs4 import BeautifulSoup
url = "https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table-light')
print(table)
The table that I am trying to extract data from is named 'table-light'. I want to get all the columns and all 144 rows. How can I do that?
You can try to set User-Agent header to get the correct HTML (and not captcha page):
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml") # <-- don't use html.parser here
table = soup.select_one(".table-light")
for td in table.tr.select('td'):
td.name = 'th'
df = pd.read_html(str(table))[0]
print(df.head())
Prints:
No. Name Market Cap P/E Fwd P/E PEG P/S P/B P/C P/FCF EPS past 5Y EPS next 5Y Sales past 5Y Change Volume
0 1 Real Estate - Development 3.14B 3.21 21.12 0.24 0.60 0.52 2.28 17.11 43.30% 13.42% 13.69% 1.43% 715.95K
1 2 Textile Manufacturing 3.42B 32.58 25.04 - 1.43 2.58 9.88 90.16 15.31% -0.49% 3.54% 1.83% 212.71K
2 3 Coking Coal 5.31B 2.50 4.93 0.37 0.64 1.53 4.20 2.54 38.39% 6.67% 22.92% 5.43% 1.92M
3 4 Real Estate - Diversified 6.71B 17.38 278.89 0.87 2.78 1.51 15.09 91.64 0.48% 19.91% 11.97% 3.31% 461.33K
4 5 Other Precious Metals & Mining 8.10B 24.91 29.07 2.71 6.52 1.06 14.47 97.98 16.30% 9.19% 20.71% 0.23% 4.77M

Web Scraping with BS4 - Can you sort this out?

Can you fix this code for me? This is giving me a erroneous message like,
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Can anyone please help me on this? Below is the code
import pandas as pd
import requests
from bs4 import BeautifulSoup
url="https://www.cse.lk/pages/trade-summary/trade-summary.component.html"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
cse = pd.DataFrame(columns=["Company Name", "Symbol", "Share Volume", "Trade Volume", "Previous Close (Rs.)", "Open (Rs.)", "High (Rs.)", "Low (Rs.)", "**Last Traded Price (Rs.)", "Change (Rs.)", "Change Percentage (%)"])
for row in soup.find_all('tbody').find_all('tr'): ##for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
Company_Name = col[0].text
Symbol = col[1].text
Share_Volume = col[2].text
Trade_Volume = col[3].text
Previous_Close = col[4].text
Open = col[5].text
High = col[6].text
Low = col[7].text
Last_Traded_Price = col[8].text
Change = col[9].text
Change_Percentage = col[10].text
cse = cse.append({"Company Name":Company_Name,"Symbol":Symbol,"Share Volume":Share_Volume,"Trade Volume":Trade_Volume,"Previous Close (Rs.)":Previous_Close,"Open (Rs.)":Open,"High (Rs.)":High,"Low (Rs.)":Low,"**Last Traded Price (Rs.)":Last_Traded_Price,"Change (Rs.)":Change,"Change Percentage (%)":Change_Percentage}, ignore_index=True)
The data is loaded from external URL via Javascript, so beautifulsoup doesn't see it. You can use this example how to load it:
import requests
import pandas as pd
url = "https://www.cse.lk/api/tradeSummary"
data = requests.post(url).json()
df = pd.DataFrame(data["reqTradeSummery"])
print(df)
df.to_csv("data.csv", index=None)
Prints:
id name symbol quantity percentageChange change price previousClose high low lastTradedTime issueDate turnover sharevolume tradevolume marketCap marketCapPercentage open closingPrice crossingVolume crossingTradeVol status
0 204 ABANS ELECTRICALS PLC ABAN.N0000 317 4.184704 7.25 180.50 173.25 183.00 172.00 1626944252441 01/JAN/1984 1.256363e+06 7012 44 9.224561e+08 0.0 179.00 180.50 7012 44 0
1 1845 ABANS FINANCE PLC AFSL.N0000 89 -3.225806 -1.00 30.00 31.00 30.10 30.00 1626944124197 27/JUN/2011 1.160916e+06 38652 11 1.996847e+09 0.0 30.10 30.00 38652 11 3
2 2065 ACCESS ENGINEERING PLC AEL.N0000 500 -0.432900 -0.10 23.00 23.10 23.40 22.90 1626944388726 27/MAR/2012 1.968675e+07 855534 264 2.300000e+10 0.0 23.10 23.00 855534 264 0
3 472 ACL CABLES PLC ACL.N0000 1000 -0.963855 -0.40 41.10 41.50 41.70 40.90 1626944397450 01/JAN/1976 3.037800e+07 738027 421 9.846521e+09 0.0 41.50 41.10 738027 421 0
4 406 ACL PLASTICS PLC APLA.N0000 20 0.842697 2.25 269.25 267.00 272.75 266.00 1626943847820 05/APR/1995 1.436916e+06 5333 26 1.134216e+09 0.0 272.75 269.25 5333 26 0
...
and saves data.csv (screenshot from LibreOffice);

Python Beautiful Soup Webscraping: Cannot get a full table to display

I am relatively new to python and this is my first web scrape. I am trying to scrape a table and can only get the first column to show up. I am using the find method instead of find_all which I am pretty sure what is causing this, but when I use the find_all method I cannot get any text to display. Here is the url I am scraping from: https://www.fangraphs.com/teams/mariners/stats
I am trying to get the top table (Batting Stat Leaders) to work. My code is below:
from bs4 import BeautifulSoup
import requests
import time
htmlText = requests.get('https://www.fangraphs.com/teams/mariners/stats').text
soup = BeautifulSoup(htmlText, 'lxml', )
playerTable = soup.find('div', class_='team-stats-table')
input = input("Would you like to see Batting, Starting Pitching, Relief Pitching, or Fielding Stats? \n")
def BattingStats():
print("BATTING STATS:")
print("Player Name: ")
for tr in playerTable.find_all('tr')[1:55]:
tds = tr.find('td').text
print(tds)
if input == "Batting" or "batting":
BattingStats()
You can use list-comprehension to get text from all rows:
import requests
from bs4 import BeautifulSoup
playerTable = soup.find("div", class_="team-stats-table")
def BattingStats():
print("BATTING STATS:")
print("Player Name: ")
for tr in playerTable.find_all("tr")[1:55]:
tds = [td.text for td in tr.select("td")]
print(tds)
BattingStats()
Prints:
BATTING STATS:
Player Name:
Mitch Haniger 30 94 406 25 0 6.7% 23.4% .257 .291 .268 .323 .524 .358 133 0.2 16.4 -6.5 2.4
Ty France 26 89 372 9 0 7.3% 16.9% .150 .314 .276 .355 .426 .341 121 0.0 9.5 -2.6 2.0
Kyle Seager 33 97 403 18 2 8.4% 25.8% .201 .246 .215 .285 .416 .302 95 -0.3 -2.9 5.4 1.6
...
Solution with pandas:
import pandas as pd
url = "https://www.fangraphs.com/teams/mariners/stats"
df = pd.read_html(url)[7]
print(df)
Prints:
Name Age G PA HR SB BB% K% ISO BABIP AVG OBP SLG wOBA wRC+ BsR Off Def WAR
0 Mitch Haniger 30 94 406 25 0 6.7% 23.4% 0.257 0.291 0.268 0.323 0.524 0.358 133.0 0.2 16.4 -6.5 2.4
1 Ty France 26 89 372 9 0 7.3% 16.9% 0.150 0.314 0.276 0.355 0.426 0.341 121.0 0.0 9.5 -2.6 2.0
2 Kyle Seager 33 97 403 18 2 8.4% 25.8% 0.201 0.246 0.215 0.285 0.416 0.302 95.0 -0.3 -2.9 5.4 1.6
...

Pandas and bs4 skip hyperlink in scraped table

Im trying to scrape a table off of a MTG goldfish using pandas and bs4.Long goal is to text myself the movers and shakers list but I get 4 out of 5 columns but it skips and gives an odd result for the one that has a hyper link. All i want is the displayed name for the hyper link so i can read it as a table
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
response = requests.get("https://www.mtggoldfish.com/movers/paper/standard")
soup = bs(response.text, "html.parser")
table = soup.find_all('table')
df = pd.read_html(str(table))[0]
print(df)
The out put is this
Top Winners Top Winners.1 ... Top Winners.3 Top Winners.4
0 5.49 xznr ... $ 16.00 +52%
1 0.96 thb ... $ 18.99 +5%
2 0.63 xznr ... $ 5.46 +13%
3 0.49 m21 ... $ 4.99 +11%
4 0.41 xznr ... $ 4.45 +10%
5 0.32 xznr ... $ 17.10 +2%
6 0.25 xznr ... $ 0.71 +54%
7 0.25 xznr ... $ 0.67 +60%
8 0.15 eld ... $ 18.70 +1%
9 0.12 thb ... $ 11.87 +1%
The 3rd column is the name of the card attached to a hyperlink to the cards page on the site. I cant figure out how to extract everything together.
Just call .to_string():
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
response = requests.get("https://www.mtggoldfish.com/movers/paper/standard")
soup = bs(response.text, "html.parser")
table = soup.find_all("table")
df = pd.read_html(str(table))[0]
print(df.to_string())
Output:
Top Winners Top Winners.1 Top Winners.2 Top Winners.3 Top Winners.4
0 0.96 thb Kroxa, Titan of Death's Hunger $ 18.99 +5%
1 0.63 xznr Clearwater Pathway $ 5.46 +13%
2 0.49 m21 Thieves' Guild Enforcer $ 4.99 +11%
3 0.41 xznr Skyclave Apparition $ 4.45 +10%
4 0.32 xznr Scourge of the Skyclaves $ 17.10 +2%
5 0.25 xznr Malakir Rebirth $ 0.71 +54%
6 0.25 xznr Blackbloom Rogue $ 0.67 +60%
7 0.16 xznr Zof Consumption $ 0.63 +34%
8 0.15 eld Oko, Thief of Crowns $ 18.70 +1%
9 0.12 thb Heliod, Sun-Crowned $ 11.87 +1%
You can read html tables directly to pandas. flavor can be set to "html.parser", but "lxml" is faster.
import pandas as pd
tables = pd.read_html("https://www.mtggoldfish.com/movers/paper/standard", flavor='lxml')
# Daily Change
daily_winners = tables[0]
daily_lossers = tables[1]
# Weekly Change
weekly_winners = table[2]
weekly_lossers = tablke[3]

Python Beginner: How to Extract data from selected row from a website table

import urllib.request
from bs4 import BeautifulSoup
url = ('http://texaset.tamu.edu/')
page = urllib.request.urlopen(url).read()
soup = BeautifulSoup(page)
#table = soup.find_all('table')
gdata = soup.find_all('td',{"class":"Data"})
for item in gdata:
print(item.text)
This is the code for extracting the data from the website
after executing code output is similar to this:
Conroe
0.12
58
45
28
15.76
0.00
4.70
6.06
Huntsville
0.10
56
41
27
16.21
0.00
2.10
3.57
Overton
0.12
53
35
42
16.34
0.00
7.52
16.89
But I need the data of only one city.. like this:
Conroe
0.12
58
45
28
15.76
0.00
4.70
6.06
I'm not sure about what you are asking here, but my guess is, you are trying to extract the text from a table cell.
Have you tried print(gdata.text)
EDIT 1
If the CSS selectors are same for all the cells, and thinking that there might be more than 2 lines of extracted data, and the desired data can be anywhere on the table; I suggest extract them all to a list and than search for Conroe As;
for item in List1:
if item.startswith('Conroe'):
print(item)

Categories

Resources