Pandas and bs4 skip hyperlink in scraped table

Pandas and bs4 skip hyperlink in scraped table - python

Im trying to scrape a table off of a MTG goldfish using pandas and bs4.Long goal is to text myself the movers and shakers list but I get 4 out of 5 columns but it skips and gives an odd result for the one that has a hyper link. All i want is the displayed name for the hyper link so i can read it as a table
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
response = requests.get("https://www.mtggoldfish.com/movers/paper/standard")
soup = bs(response.text, "html.parser")
table = soup.find_all('table')
df = pd.read_html(str(table))[0]
print(df)
The out put is this
Top Winners Top Winners.1 ... Top Winners.3 Top Winners.4
0 5.49 xznr ... $ 16.00 +52%
1 0.96 thb ... $ 18.99 +5%
2 0.63 xznr ... $ 5.46 +13%
3 0.49 m21 ... $ 4.99 +11%
4 0.41 xznr ... $ 4.45 +10%
5 0.32 xznr ... $ 17.10 +2%
6 0.25 xznr ... $ 0.71 +54%
7 0.25 xznr ... $ 0.67 +60%
8 0.15 eld ... $ 18.70 +1%
9 0.12 thb ... $ 11.87 +1%
The 3rd column is the name of the card attached to a hyperlink to the cards page on the site. I cant figure out how to extract everything together.

Just call .to_string():
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
response = requests.get("https://www.mtggoldfish.com/movers/paper/standard")
soup = bs(response.text, "html.parser")
table = soup.find_all("table")
df = pd.read_html(str(table))[0]
print(df.to_string())
Output:
Top Winners Top Winners.1 Top Winners.2 Top Winners.3 Top Winners.4
0 0.96 thb Kroxa, Titan of Death's Hunger $ 18.99 +5%
1 0.63 xznr Clearwater Pathway $ 5.46 +13%
2 0.49 m21 Thieves' Guild Enforcer $ 4.99 +11%
3 0.41 xznr Skyclave Apparition $ 4.45 +10%
4 0.32 xznr Scourge of the Skyclaves $ 17.10 +2%
5 0.25 xznr Malakir Rebirth $ 0.71 +54%
6 0.25 xznr Blackbloom Rogue $ 0.67 +60%
7 0.16 xznr Zof Consumption $ 0.63 +34%
8 0.15 eld Oko, Thief of Crowns $ 18.70 +1%
9 0.12 thb Heliod, Sun-Crowned $ 11.87 +1%

You can read html tables directly to pandas. flavor can be set to "html.parser", but "lxml" is faster.
import pandas as pd
tables = pd.read_html("https://www.mtggoldfish.com/movers/paper/standard", flavor='lxml')
# Daily Change
daily_winners = tables[0]
daily_lossers = tables[1]
# Weekly Change
weekly_winners = table[2]
weekly_lossers = tablke[3]

Related

Trying to scrape a specific table but getting no results

I tried three different techniques to scrape a table named 'table-light', but nothing is actually working for me. The code below shows my attempts to extract the data.
import pandas as pd
tables = pd.read_html('https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap')
tables
############################################################################
import requests
import pandas as pd
url = 'https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[10]
print(df)
############################################################################
import requests
from bs4 import BeautifulSoup
url = "https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
table = soup.find_all('table-light')
print(table)
The table that I am trying to extract data from is named 'table-light'. I want to get all the columns and all 144 rows. How can I do that?

You can try to set User-Agent header to get the correct HTML (and not captcha page):
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://finviz.com/groups.ashx?g=industry&v=120&o=marketcap"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "lxml") # <-- don't use html.parser here
table = soup.select_one(".table-light")
for td in table.tr.select('td'):
td.name = 'th'
df = pd.read_html(str(table))[0]
print(df.head())
Prints:
No. Name Market Cap P/E Fwd P/E PEG P/S P/B P/C P/FCF EPS past 5Y EPS next 5Y Sales past 5Y Change Volume
0 1 Real Estate - Development 3.14B 3.21 21.12 0.24 0.60 0.52 2.28 17.11 43.30% 13.42% 13.69% 1.43% 715.95K
1 2 Textile Manufacturing 3.42B 32.58 25.04 - 1.43 2.58 9.88 90.16 15.31% -0.49% 3.54% 1.83% 212.71K
2 3 Coking Coal 5.31B 2.50 4.93 0.37 0.64 1.53 4.20 2.54 38.39% 6.67% 22.92% 5.43% 1.92M
3 4 Real Estate - Diversified 6.71B 17.38 278.89 0.87 2.78 1.51 15.09 91.64 0.48% 19.91% 11.97% 3.31% 461.33K
4 5 Other Precious Metals & Mining 8.10B 24.91 29.07 2.71 6.52 1.06 14.47 97.98 16.30% 9.19% 20.71% 0.23% 4.77M

Cannot scrape some table using Pandas

i'm more than a noob in python, i'm tryng to get some tables from this page:
https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html
Using Pandas and command pd.read_html i'm able to get most of them but not the "Line Score" and the "Four Factors"...if i print all the table (they are 19) these two are missing, inspecting with chrome they seem to be table and i also get them with excel importing from web.
What am i missing here?
Any help appreciated, thanks!

If you look at the page source (not by inspecting), you'd see those tables are within the comments of the html. You can either a) edit the html str and remove the <!-- and --> from the html, then let pandas parse, or 2) use bs4 to pull out the comments, then parse that tables that way.
I'll show you both options:
Option 1: Remove the comment tags from the page source
import requests
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
response = requests.get(url).text.replace("<!--","").replace("-->","")
dfs = pd.read_html(response, header=1)
Output:
You can see you now have 21 tables, with the 4th and 5th tables the ones in question.
print(len(dfs))
for each in dfs[3:5]:
print('\n\n', each, '\n')
21
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1
Option 2: Pull out comments with bs4
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
dfs = pd.read_html(url, header=1)
comments = data.find_all(string=lambda text: isinstance(text, Comment))
other_tables = []
for each in comments:
if '<table' in str(each):
try:
other_tables.append(pd.read_html(str(each), header=1)[0])
except:
continue
Output:
for each in other_tables:
print(each, '\n')
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1

Webscraping a dynamic page with filtered table

I am trying to parse the page https://statusinvest.com.br/fundos-imobiliarios/knhy11, where I need the info from a Dividends history table, which is dynamically filtered by pages.
So, by doing
url = 'https://statusinvest.com.br/fundos-imobiliarios/knhy11'
from urllib.request import Request, urlopen
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
text = webpage.decode('utf-8')
spt = text.split('\n')
file1 = open("myfile.txt","w")
for line in spt:
file1.write(line)
file1.close()
I am able to get the first (default) page, but the info from the other pages doesn't come.
How could I fix that?
Tkx.

Full data is located in an input tag with id results in the value attribute. The result is in JSON.
You can use beautifulsoup to parse html like in the following script :
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
r = requests.get("https://statusinvest.com.br/fundos-imobiliarios/knhy11")
soup = BeautifulSoup(r.content, "html.parser")
data = pd.DataFrame([
{
"TIPO": t["et"],
"DATA COM": t["ed"],
"PAGAMENTO": t["pd"],
"VALOR": t["v"]
}
for t in json.loads(soup.find("input",{"id":"results"})["value"])
])
print(data)
Output:
TIPO DATA COM PAGAMENTO VALOR
0 Rendimento 30/09/2020 14/10/2020 0.64
1 Rendimento 31/08/2020 14/09/2020 0.65
2 Rendimento 31/07/2020 13/08/2020 0.45
3 Rendimento 30/06/2020 13/07/2020 0.32
4 Rendimento 29/05/2020 12/06/2020 0.30
5 Rendimento 30/04/2020 14/05/2020 0.50
6 Rendimento 31/03/2020 14/04/2020 0.70
.............................................
.............................................
18 Rendimento 29/03/2019 11/04/2019 0.73
19 Rendimento 28/02/2019 15/03/2019 0.64
20 Rendimento 31/01/2019 13/02/2019 0.49
21 Rendimento 28/12/2018 14/01/2019 0.36
22 Rendimento 30/11/2018 13/12/2018 0.56
23 Rendimento 31/10/2018 14/11/2018 0.57
24 Rendimento 28/09/2018 11/10/2018 0.32
25 Rendimento 31/08/2018 14/09/2018 0.42
26 Rendimento 31/07/2018 13/08/2018 0.11
Try this on repl.it

Selenium code can not catch the table from Chrome

I am using selenium to parse from
https://www.worldometers.info/coronavirus/
and doing as the following, I get attribute error and the table variable remains empty, what is the reason ?
I use Chrome 80. Are the tags right ?
AttributeError: 'NoneType' object has no attribute 'tbody'
from selenium import webdriver
import bs4
browser = webdriver.Chrome()
browser.get("https://www.worldometers.info/coronavirus/")
html = bs4.BeautifulSoup(browser.page_source, "html.parser")
table = html.find("table",class_="table table-bordered table-hover main_table_countries dataTable no-footer") #

Wherever I have table tags, I find it easier to use pandas to capture the table.
import pandas as pd
url = 'https://www.worldometers.info/coronavirus/'
table = pd.read_html(url)[0]
Output:
print(table)
Country,Other TotalCases ... Tot Cases/1M pop Tot Deaths/1M pop
0 China 81093 ... 56.00 2.0
1 Italy 63927 ... 1057.00 101.0
2 USA 43734 ... 132.00 2.0
3 Spain 35136 ... 751.00 49.0
4 Germany 29056 ... 347.00 1.0
.. ... ... ... ... ...
192 Somalia 1 ... 0.06 NaN
193 Syria 1 ... 0.06 NaN
194 Timor-Leste 1 ... 0.80 NaN
195 Turks and Caicos 1 ... 26.00 NaN
196 Total: 378782 ... 48.60 2.1
[197 rows x 10 columns]

Python Beginner: How to Extract data from selected row from a website table

import urllib.request
from bs4 import BeautifulSoup
url = ('http://texaset.tamu.edu/')
page = urllib.request.urlopen(url).read()
soup = BeautifulSoup(page)
#table = soup.find_all('table')
gdata = soup.find_all('td',{"class":"Data"})
for item in gdata:
print(item.text)
This is the code for extracting the data from the website
after executing code output is similar to this:
Conroe
0.12
58
45
28
15.76
0.00
4.70
6.06
Huntsville
0.10
56
41
27
16.21
0.00
2.10
3.57
Overton
0.12
53
35
42
16.34
0.00
7.52
16.89
But I need the data of only one city.. like this:
Conroe
0.12
58
45
28
15.76
0.00
4.70
6.06

I'm not sure about what you are asking here, but my guess is, you are trying to extract the text from a table cell.
Have you tried print(gdata.text)
EDIT 1
If the CSS selectors are same for all the cells, and thinking that there might be more than 2 lines of extracted data, and the desired data can be anywhere on the table; I suggest extract them all to a list and than search for Conroe As;
for item in List1:
if item.startswith('Conroe'):
print(item)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas and bs4 skip hyperlink in scraped table - python

Related

Trying to scrape a specific table but getting no results

Cannot scrape some table using Pandas

Webscraping a dynamic page with filtered table

Selenium code can not catch the table from Chrome

Python Beginner: How to Extract data from selected row from a website table

Categories

Resources