I tried three different techniques to scrape a table named 'table-light', but nothing is actually working for me. The code below shows my attempts to extract the data.
import pandas as pd
tables = pd.read_html('https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap')
tables
############################################################################
import requests
import pandas as pd
url = 'https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[10]
print(df)
############################################################################
import requests
from bs4 import BeautifulSoup
url = "https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table-light')
print(table)
The table that I am trying to extract data from is named 'table-light'. I want to get all the columns and all 144 rows. How can I do that?
You can try to set User-Agent header to get the correct HTML (and not captcha page):
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml") # <-- don't use html.parser here
table = soup.select_one(".table-light")
for td in table.tr.select('td'):
td.name = 'th'
df = pd.read_html(str(table))[0]
print(df.head())
Prints:
No. Name Market Cap P/E Fwd P/E PEG P/S P/B P/C P/FCF EPS past 5Y EPS next 5Y Sales past 5Y Change Volume
0 1 Real Estate - Development 3.14B 3.21 21.12 0.24 0.60 0.52 2.28 17.11 43.30% 13.42% 13.69% 1.43% 715.95K
1 2 Textile Manufacturing 3.42B 32.58 25.04 - 1.43 2.58 9.88 90.16 15.31% -0.49% 3.54% 1.83% 212.71K
2 3 Coking Coal 5.31B 2.50 4.93 0.37 0.64 1.53 4.20 2.54 38.39% 6.67% 22.92% 5.43% 1.92M
3 4 Real Estate - Diversified 6.71B 17.38 278.89 0.87 2.78 1.51 15.09 91.64 0.48% 19.91% 11.97% 3.31% 461.33K
4 5 Other Precious Metals & Mining 8.10B 24.91 29.07 2.71 6.52 1.06 14.47 97.98 16.30% 9.19% 20.71% 0.23% 4.77M
Related
I'm trying to find a way to paste some trades (which I've already correctly formatted) in a field box on this Online UK Capital Gains Tax Calculator - http://www.cgtcalculator.com/calculator.aspx - and then click on the 'Calculate' button to finally receive the information in the results box.
An example of some trades (not my trades) - this data is currently in a DataFrame:
B 01/12/2000 BARC 3000 4.82 15.00 72.3
B 02/12/2000 BARC 5000 4.825 15.00 120.62
B 03/09/2002 VOD 18000 3.04 10.00 273.60
B 15/01/2003 BP. 5000 3.75 10.00 93.75
B 24/03/2003 BP. 3000 3.82 10.00 57.30
S 14/04/2003 BARC 6000 5.52 15.00 0.00
S 23/02/2004 VOD 9000 3.62 10.00 0.00
S 24/02/2004 VOD 9000 3.625 10.00 0.00
S 15/07/2005 BP. 8000 6.28 10.00 0.00
B 22/01/2007 BP. 5000 5.50 10.00 124.20
B 22/06/2009 BP. 2000 5.02 10.00 50.20
S 24/12/2012 BP. 5000 4.336 10.00 0.00
When I paste the data above in the entry field and the click on 'Format/Sort', I can see it appearing on the html source as such:
<textarea name="TEXTAREA1" id="TEXTAREA1" class="trades" cols="114">
"Trade Data Appears Here"
Then, when I click on the 'Calculate' button, there appears to be a number of <span> tags flashing. The results can be seen under the Network and then Response tabs.
What would be the best way of pasting the trades data onto the webpage field, clicking on the Calculate button and then obtaining the results in Python? Would a post.request work? I'm just after some guidance here as I'm new to Python but understand that what I'm asking for might not be possible or even slightly too complex for me.
I've tried the following but I'm only getting the content of the webpage back.
import requests
import pandas as pd
data = pd.read_csv("/Users/Test.csv")
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36'}
url = 'http://www.cgtcalculator.com/calculator.aspx'
response = requests.post(url, data=data, headers=headers)
print(response.text)
I'm attempting to pull data from finivz and I'm able to pull only one row at all time.
Here's my code:
url = ('https://finviz.com/quote.ashx?t=' + ticker.upper())
r = Request(url, headers = header)
html = urlopen(r).read()
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')
rows = rows[13:20]
for row in rows:
row_td = row.find_all('td') <------------ I believe the issue is with this section?
#print(row_td)
str_cells = str(row_td)
clean = BeautifulSoup(str_cells, "lxml").get_text()
print(clean)
Only prints:
[Dividend %, 2.97%, Quick Ratio, 1.30, Sales past 5Y, -5.70%, Gross Margin, 60.60%, 52W Low, 20.59%, ATR, 0.64] - even though I specify rows[13:30]
I'd like to print out all of the rows from the table on the page.
here is a screenshot of the table
You can do it easily using only pandas DataFrame.
Here is the full working output.
CODE:
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
url = "https://finviz.com/quote.ashx?t=KO"
req = requests.get(url,headers=headers)
wiki_table = pd.read_html(req.text, attrs = {"class":"snapshot-table2"} )
df = wiki_table[0]
print(df)
OUTPUT:
0 1 2 3 ... 8 9 10 11
0 Index DJIA S&P500 P/E 30.35 ... Shs Outstand 4.31B Perf Week -1.03%
1 Market Cap 245.44B Forward P/E 23.26 ... Shs Float 4.29B Perf Month 0.30%
2 Income 8.08B PEG 3.00 ... Short Float 0.75% Perf Quarter 3.70%
3 Sales 36.41B P/S 6.74 ... Short Ratio 2.34 Perf Half Y 11.87%
4 Book/sh 5.16 P/B 10.98 ... Target Price 62.06 Perf Year 19.62%
5 Cash/sh 3.01 P/C 18.82 ... 52W Range 46.97 - 57.56 Perf YTD 3.28%
6 Dividend 1.68 P/FCF 95.02 ... 52W High -1.60% Beta 0.63
7 Dividend % 2.97% Quick Ratio 1.30 ... 52W Low 20.59% ATR 0.64
8 Employees 80300 Current Ratio 1.50 ... RSI (14) 51.63 Volatility 1.17% 0.94%
9 Optionable Yes Debt/Eq 1.89 ... Rel Volume 0.76 Prev Close 56.86
10 Shortable Yes LT Debt/Eq 1.79 ... Avg Volume 13.67M Price 56.64
11 Recom 2.20 SMA20 -0.42% ... Volume 10340772 Change -0.39%
[12 rows x 12 columns]
In the for-loop, you're rewriting variable row_td over and over. Store the content of the variable in the list (in my example, I use the list all_data to store all rows).
To print all rows from the table, you can use next example:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0"
}
url = "https://finviz.com/quote.ashx?t=KO"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml")
all_data = []
for tr in soup.select(".snapshot-table2 tr"):
tds = [td.get_text(strip=True) for td in tr.select("td")]
all_data.append(tds)
fmt_string = "{:<15}" * 12
for row in all_data:
print(fmt_string.format(*row))
Prints:
Index DJIA S&P500 P/E 30.35 EPS (ttm) 1.87 Insider Own 0.30% Shs Outstand 4.31B Perf Week -1.03%
Market Cap 245.44B Forward P/E 23.26 EPS next Y 2.44 Insider Trans -2.65% Shs Float 4.29B Perf Month 0.30%
Income 8.08B PEG 3.00 EPS next Q 0.58 Inst Own 69.00% Short Float 0.75% Perf Quarter 3.70%
Sales 36.41B P/S 6.74 EPS this Y -13.30% Inst Trans 0.55% Short Ratio 2.34 Perf Half Y 11.87%
Book/sh 5.16 P/B 10.98 EPS next Y 7.84% ROA 8.90% Target Price 62.06 Perf Year 19.62%
Cash/sh 3.01 P/C 18.82 EPS next 5Y 10.12% ROE 40.10% 52W Range 46.97 - 57.56 Perf YTD 3.28%
Dividend 1.68 P/FCF 95.02 EPS past 5Y 1.40% ROI 12.20% 52W High -1.60% Beta 0.63
Dividend % 2.97% Quick Ratio 1.30 Sales past 5Y -5.70% Gross Margin 60.60% 52W Low 20.59% ATR 0.64
Employees 80300 Current Ratio 1.50 Sales Q/Q 41.70% Oper. Margin 25.70% RSI (14) 51.63 Volatility 1.17% 0.94%
Optionable Yes Debt/Eq 1.89 EPS Q/Q 47.70% Profit Margin 22.20% Rel Volume 0.76 Prev Close 56.86
Shortable Yes LT Debt/Eq 1.79 Earnings Jul 21 BMO Payout 87.90% Avg Volume 13.67M Price 56.64
Recom 2.20 SMA20 -0.42% SMA50 1.65% SMA200 6.56% Volume 10,340,772 Change -0.39%
Can you fix this code for me? This is giving me a erroneous message like,
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Can anyone please help me on this? Below is the code
import pandas as pd
import requests
from bs4 import BeautifulSoup
url="https://www.cse.lk/pages/trade-summary/trade-summary.component.html"
data = requests.get(url).text
soup = BeautifulSoup(data, 'html5lib')
cse = pd.DataFrame(columns=["Company Name", "Symbol", "Share Volume", "Trade Volume", "Previous Close (Rs.)", "Open (Rs.)", "High (Rs.)", "Low (Rs.)", "**Last Traded Price (Rs.)", "Change (Rs.)", "Change Percentage (%)"])
for row in soup.find_all('tbody').find_all('tr'): ##for row in soup.find("tbody").find_all('tr'):
col = row.find_all("td")
Company_Name = col[0].text
Symbol = col[1].text
Share_Volume = col[2].text
Trade_Volume = col[3].text
Previous_Close = col[4].text
Open = col[5].text
High = col[6].text
Low = col[7].text
Last_Traded_Price = col[8].text
Change = col[9].text
Change_Percentage = col[10].text
cse = cse.append({"Company Name":Company_Name,"Symbol":Symbol,"Share Volume":Share_Volume,"Trade Volume":Trade_Volume,"Previous Close (Rs.)":Previous_Close,"Open (Rs.)":Open,"High (Rs.)":High,"Low (Rs.)":Low,"**Last Traded Price (Rs.)":Last_Traded_Price,"Change (Rs.)":Change,"Change Percentage (%)":Change_Percentage}, ignore_index=True)
The data is loaded from external URL via Javascript, so beautifulsoup doesn't see it. You can use this example how to load it:
import requests
import pandas as pd
url = "https://www.cse.lk/api/tradeSummary"
data = requests.post(url).json()
df = pd.DataFrame(data["reqTradeSummery"])
print(df)
df.to_csv("data.csv", index=None)
Prints:
id name symbol quantity percentageChange change price previousClose high low lastTradedTime issueDate turnover sharevolume tradevolume marketCap marketCapPercentage open closingPrice crossingVolume crossingTradeVol status
0 204 ABANS ELECTRICALS PLC ABAN.N0000 317 4.184704 7.25 180.50 173.25 183.00 172.00 1626944252441 01/JAN/1984 1.256363e+06 7012 44 9.224561e+08 0.0 179.00 180.50 7012 44 0
1 1845 ABANS FINANCE PLC AFSL.N0000 89 -3.225806 -1.00 30.00 31.00 30.10 30.00 1626944124197 27/JUN/2011 1.160916e+06 38652 11 1.996847e+09 0.0 30.10 30.00 38652 11 3
2 2065 ACCESS ENGINEERING PLC AEL.N0000 500 -0.432900 -0.10 23.00 23.10 23.40 22.90 1626944388726 27/MAR/2012 1.968675e+07 855534 264 2.300000e+10 0.0 23.10 23.00 855534 264 0
3 472 ACL CABLES PLC ACL.N0000 1000 -0.963855 -0.40 41.10 41.50 41.70 40.90 1626944397450 01/JAN/1976 3.037800e+07 738027 421 9.846521e+09 0.0 41.50 41.10 738027 421 0
4 406 ACL PLASTICS PLC APLA.N0000 20 0.842697 2.25 269.25 267.00 272.75 266.00 1626943847820 05/APR/1995 1.436916e+06 5333 26 1.134216e+09 0.0 272.75 269.25 5333 26 0
...
and saves data.csv (screenshot from LibreOffice);
I am trying to extract values for each row for "Fee" field in the website https://remittanceprices.worldbank.org/en/corridor/United-States/Mexico. I came really close but not able to extract the values for "Fee" column. The code is shown below
url = 'https://remittanceprices.worldbank.org/en/corridor/Australia/China'
r = requests.get(url,verify=False)
soup = BeautifulSoup(r.text,'lxml')
rows = [i.get_text("|").split("|") for i in soup.select('#tab-1 .corridor-row')]
for row in rows:
#a,b,c,d,e = row[2],row[15],row[18],row[21],row[25]
#print(a,b,c,d,e,sep='|')
print('{0[2]}|{0[15]}|{0[18]}|{0[21]}|{0[25]}'.format(row))```
def main():
import requests
from bs4 import BeautifulSoup
url = "https://remittanceprices.worldbank.org/en/corridor/United-States/Mexico"
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
for cell in soup.select("div.views-field-field-cc1-fee > div > span:first-of-type"):
print(cell.text.strip())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
0.00
0.00
2.99
2.99
3.00
2.99
4.00
5.00
3.99
3.99
3.99
3.99
5.00
5.00
5.00
3.99
3.99
8.00
8.50
10.00
3.99
7.00
10.00
4.95
10.00
9.99
8.00
>>>
Im trying to scrape a table off of a MTG goldfish using pandas and bs4.Long goal is to text myself the movers and shakers list but I get 4 out of 5 columns but it skips and gives an odd result for the one that has a hyper link. All i want is the displayed name for the hyper link so i can read it as a table
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
response = requests.get("https://www.mtggoldfish.com/movers/paper/standard")
soup = bs(response.text, "html.parser")
table = soup.find_all('table')
df = pd.read_html(str(table))[0]
print(df)
The out put is this
Top Winners Top Winners.1 ... Top Winners.3 Top Winners.4
0 5.49 xznr ... $ 16.00 +52%
1 0.96 thb ... $ 18.99 +5%
2 0.63 xznr ... $ 5.46 +13%
3 0.49 m21 ... $ 4.99 +11%
4 0.41 xznr ... $ 4.45 +10%
5 0.32 xznr ... $ 17.10 +2%
6 0.25 xznr ... $ 0.71 +54%
7 0.25 xznr ... $ 0.67 +60%
8 0.15 eld ... $ 18.70 +1%
9 0.12 thb ... $ 11.87 +1%
The 3rd column is the name of the card attached to a hyperlink to the cards page on the site. I cant figure out how to extract everything together.
Just call .to_string():
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
response = requests.get("https://www.mtggoldfish.com/movers/paper/standard")
soup = bs(response.text, "html.parser")
table = soup.find_all("table")
df = pd.read_html(str(table))[0]
print(df.to_string())
Output:
Top Winners Top Winners.1 Top Winners.2 Top Winners.3 Top Winners.4
0 0.96 thb Kroxa, Titan of Death's Hunger $ 18.99 +5%
1 0.63 xznr Clearwater Pathway $ 5.46 +13%
2 0.49 m21 Thieves' Guild Enforcer $ 4.99 +11%
3 0.41 xznr Skyclave Apparition $ 4.45 +10%
4 0.32 xznr Scourge of the Skyclaves $ 17.10 +2%
5 0.25 xznr Malakir Rebirth $ 0.71 +54%
6 0.25 xznr Blackbloom Rogue $ 0.67 +60%
7 0.16 xznr Zof Consumption $ 0.63 +34%
8 0.15 eld Oko, Thief of Crowns $ 18.70 +1%
9 0.12 thb Heliod, Sun-Crowned $ 11.87 +1%
You can read html tables directly to pandas. flavor can be set to "html.parser", but "lxml" is faster.
import pandas as pd
tables = pd.read_html("https://www.mtggoldfish.com/movers/paper/standard", flavor='lxml')
# Daily Change
daily_winners = tables[0]
daily_lossers = tables[1]
# Weekly Change
weekly_winners = table[2]
weekly_lossers = tablke[3]