How to extract a specific table from an url via Python?

How to extract a specific table from an url via Python? - python

I am currently looking at the following Link:
https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund
There is a table that displays all Positions of the ETF. My goal is to extract the table and save it to a xlsx file. I wrote a code:
import requests
import pandas as pd
url = 'https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
df.to_excel('my data.xlsx')
However, when I try pd.read_html(html) it show me always that no tables could have been found on the website. Does somebody know how to identify and pull the desired table via Python?

The problem is that the website uses cookies, so, when you use the default link, it redirects you to a first page that needs you to click in a button to accept cookies. And then you go to the right page.
I encountered the right URL, that goes straight to the main page that you are looking for, try this:
url = 'https://www.ishares.com/de/privatanleger/de/produkte/251931/ishares-stoxx-europe-600-ucits-etf-de-fund?switchLocale=y&siteEntryPassthrough=true'
html = requests.get(url).content
df_list = pd.read_html(html)
print(df_list)
Here is my Output:
[Empty DataFrame
Columns: [Ex-Tag, Fälligkeitsdatum, Gesamtausschüttung]
Index: [], Empty DataFrame
Columns: [Ex-Tag, Fälligkeitsdatum, Gesamtausschüttung]
Index: [], Unnamed: 0 2012 2013 2014 2015 ... 2017 2018 2019 2020 2021
0 Gesamtrendite (%) 177 210 74 108 ... 108 -110 276 -19 251
1 Vergleichsindex (%) 178 212 72 96 ... 106 -108 268 -20 249
[2 rows x 11 columns], Unnamed: 0 ... Von 31.Mär.2021Bis 31.Mär.2022
0 Gesamtrendite (%) Per 31.Mär.2022 ... 863
1 Vergleichsindex (%) Per 31.Mär.2022 ... 849
[2 rows x 6 columns], Unnamed: 0 1J 3J 5J 10J Seit Auflage
0 Gesamtrendite (%) 197 894 538 940 651
1 Vergleichsindex (%) 178 879 522 920 636, Unnamed: 0 Seit 1.1. 1M 3M ... 3J 5J 10J Seit Auflage
0 Gesamtrendite (%) -736 -80 -43 ... 2929 2994 14548 21694
1 Vergleichsindex (%) -755 -92 -65 ... 2874 2896 14119 20904
[2 rows x 10 columns], Empty DataFrame
Columns: [Emittententicker, Name, Sektor, Anlageklasse, Marktwert, Gewichtung (%), Nominalwert, Nominale, ISIN, Kurs, Standort, Börse, Marktwährung]
Index: [], Empty DataFrame
Columns: [Kategorie, Fonds]
Index: [], Empty DataFrame
Columns: [Kategorie, Fonds]
Index: [], Börse Ticker ... Common Code (EOC) iNAV ISIN
0 Xetra EXSA ... 186 794 77 -
1 Bolsa Mexicana De Valores EXSA ... - -
2 Borsa Italiana EXSA ... - -
3 SIX Swiss Exchange EXSA ... - -
[4 rows x 14 columns]]
Process finished with exit code 0

This is how you would use Selenium:
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
#maximize the window size
driver.maximize_window()
#navigate to the url
driver.get("https://www.google.com/")

Related

Cannot scrape some table using Pandas

i'm more than a noob in python, i'm tryng to get some tables from this page:
https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html
Using Pandas and command pd.read_html i'm able to get most of them but not the "Line Score" and the "Four Factors"...if i print all the table (they are 19) these two are missing, inspecting with chrome they seem to be table and i also get them with excel importing from web.
What am i missing here?
Any help appreciated, thanks!

If you look at the page source (not by inspecting), you'd see those tables are within the comments of the html. You can either a) edit the html str and remove the <!-- and --> from the html, then let pandas parse, or 2) use bs4 to pull out the comments, then parse that tables that way.
I'll show you both options:
Option 1: Remove the comment tags from the page source
import requests
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
response = requests.get(url).text.replace("<!--","").replace("-->","")
dfs = pd.read_html(response, header=1)
Output:
You can see you now have 21 tables, with the 4th and 5th tables the ones in question.
print(len(dfs))
for each in dfs[3:5]:
print('\n\n', each, '\n')
21
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1
Option 2: Pull out comments with bs4
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = 'https://www.basketball-reference.com/wnba/boxscores/202208030SEA.html'
result = requests.get(url).text
data = BeautifulSoup(result, 'html.parser')
dfs = pd.read_html(url, header=1)
comments = data.find_all(string=lambda text: isinstance(text, Comment))
other_tables = []
for each in comments:
if '<table' in str(each):
try:
other_tables.append(pd.read_html(str(each), header=1)[0])
except:
continue
Output:
for each in other_tables:
print(each, '\n')
Unnamed: 0 1 2 3 4 T
0 Minnesota Lynx 18 14 22 23 77
1 Seattle Storm 30 26 22 11 89
Unnamed: 0 Pace eFG% TOV% ORB% FT/FGA ORtg
0 MIN 97.0 0.507 16.1 14.3 0.101 95.2
1 SEA 97.0 0.579 11.8 9.7 0.114 110.1

How to get distinct rows from pandas dataframe?

I am having trouble with getting distinct values from my dataframe.. Below is the code i currently use, in line 25(3rd of vier()) is the issue: I would like to show the top 10 fastest drivers based on their average heat(go-kart heat) time.
Input:
HeatNumber,NumberOfKarts,KartNumber,DriverName,Laptime
334,11,5,Monique,00:53.862
334,11,5,Monique,00:59.070
334,11,5,Monique,00:47.832
334,11,5,Monique,00:47.213
334,11,5,Monique,00:51.975
334,11,5,Monique,00:46.423
334,11,5,Monique,00:49.539
334,11,5,Monique,00:49.935
334,11,5,Monique,00:45.267
334,11,12,Robert-Jan,00:55.606
334,11,12,Robert-Jan,00:52.249
334,11,12,Robert-Jan,00:50.965
334,11,12,Robert-Jan,00:53.878
334,11,12,Robert-Jan,00:48.802
334,11,12,Robert-Jan,00:48.766
334,11,12,Robert-Jan,00:46.003
334,11,12,Robert-Jan,00:46.257
334,11,12,Robert-Jan,00:47.334
334,11,20,Katja,00:56.222
334,11,20,Katja,01:01.005
334,11,20,Katja,00:50.296
334,11,20,Katja,00:48.004
334,11,20,Katja,00:51.203
334,11,20,Katja,00:47.672
334,11,20,Katja,00:50.243
334,11,20,Katja,00:50.453
334,11,20,Katja,01:06.192
334,11,13,Bensu,00:56.332
334,11,13,Bensu,00:54.550
334,11,13,Bensu,00:52.023
334,11,13,Bensu,00:52.518
334,11,13,Bensu,00:50.738
334,11,13,Bensu,00:50.359
334,11,13,Bensu,00:49.307
334,11,13,Bensu,00:49.595
334,11,13,Bensu,00:50.504
334,11,17,Marit,00:56.740
334,11,17,Marit,00:52.534
334,11,17,Marit,00:48.331
334,11,17,Marit,00:56.204
334,11,17,Marit,00:49.066
334,11,17,Marit,00:49.210
334,11,17,Marit,00:45.655
334,11,17,Marit,00:46.261
334,11,17,Marit,00:46.837
334,11,11,Niels,00:58.518
334,11,11,Niels,01:01.562
334,11,11,Niels,00:51.238
334,11,11,Niels,00:48.808
Code:
import pandas as pd
import matplotlib.pyplot as plt
#Data
df = pd.read_csv('dataset_kartanalyser.csv')
df = df.dropna(axis=0, how='any')
df = df.join(df['Laptime'].str.split(':', 1, expand=True).rename(columns={0:'M', 1:'S'}))
df['M'] = df['M'].astype(int)
df['S'] = df['S'].astype(float)
df['Laptime'] = (df['M'] * 60) + df['S']
df.drop(['M', 'S'], axis=1, inplace=True)
#Funties
def twee():
print("Het totaal aantal karts = " + str(df['KartNumber'].nunique()))
print("Het aantal unique drivers = " + str(df['DriverName'].nunique()))
print("Het totaal aantal heats = " + str(df['HeatNumber'].nunique()))
def drie():
print("De 10 snelste Drivers obv individuele tijd zijn: ")
print((df.groupby('DriverName')['Laptime'].nsmallest(1)).nsmallest(10))
def vier():
print('De 10 snelste Drivers obv snelste heat gemiddelde:')
print((df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)).nsmallest(10))
print(df)
HeatNumber NumberOfKarts KartNumber DriverName Laptime
0 334 11 5 Monique 53.862
1 334 11 5 Monique 59.070
2 334 11 5 Monique 47.832
3 334 11 5 Monique 47.213
4 334 11 5 Monique 51.975
... ... ... ... ... ...
4053 437 2 20 luuk 39.678
4054 437 2 20 luuk 39.872
4055 437 2 20 luuk 39.454
4056 437 2 20 luuk 39.575
4057 437 2 20 luuk 39.648
Output:
DriverName HeatNumber
giovanni 411 26.233
ryan 411 27.747
giovanni 408 27.938
papa 394 28.075
guus 406 28.998
Rob 427 29.371
Suus 427 29.416
Jan-jullius 394 29.428
Joep 427 29.934
Indy 423 29.991
The output i get i almost correct, expect that the driver "giovanni" occurs twice. I would like to only show the fastest avg heat time for each driver. Anyone who know how to do this?

ok so add drop_duplication on a column like this just need to add sort as well
df.sort_values('B', ascending=True)
.drop_duplicates('A', keep='first')
(df.groupby(['DriverName', 'HeatNumber'])['Laptime'].mean().round(3)sort_values('Laptime', ascending=True).drop_duplicates('DriverName', keep='first')).nsmallest(10))

You group the datas by Drivername and HeatNumber. See the HeatNumbers, one of them is 411 and another is 408. Because of that pandas understand they are exactly different. If you equals them, they will be one.

Adding a matching value from 3 different DataFrames, not the entire column Python

I have three different DateFrames (df2019, df2020, and df2021) and the all have the same columns(here are a few) with some overlapping 'BrandID':
BrandID StockedOutDays Profit SalesQuantity
243 01-02760 120 516452.76 64476
138 01-01737 96 603900.0 80520
166 01-02018 125 306796.8 52896
141 01-01770 109 297258.6 39372
965 02-35464 128 214039.2 24240
385 01-03857 92 326255.16 30954
242 01-02757 73 393866.4 67908
What I'm trying to do is add the value from one column for a specific BrandID from each of the 3 DataFrame's. In my specific case, I'd like to add the value of 'Sales Quantity' for 'BrandID' = 01-02757 from df2019, df2020 and df2021 and get a line I can run to see a single number.
I've searched around and tried a bunch of different things, but am stuck. Please help, thank you!
EDIT *** I'm looking for something like this I think, I just don't know how to sum them all together:
df2021.set_index('BrandID',inplace=True)
df2020.set_index('BrandID',inplace=True)
df2019.set_index('BrandID',inplace=True)
df2021.loc['01-02757']['SalesQuantity']+df2020.loc['01-02757']['SalesQuantity']+
df2019.loc['01-02757']['SalesQuantity']

import pandas as pd
df2019 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":120, "Profit":516452.76, "SalesQuantity":64476},
{"BrandID":"01-01737", "StockedOutDays":96, "Profit":603900.0, "SalesQuantity":80520}])
df2020 = pd.DataFrame([{"BrandID":"01-02760", "StockedOutDays":123, "Profit":76481.76, "SalesQuantity":2457},
{"BrandID":"01-01737", "StockedOutDays":27, "Profit":203014.0, "SalesQuantity":15648}])
df2019["year"] = 2019
df2020["year"] = 2020
df = pd.DataFrame.append(df2019, df2020)
df_sum = df.groupby("BrandID").agg("sum").drop("year",axis=1)
print(df)
print(df_sum)
df:
BrandID StockedOutDays Profit SalesQuantity year
0 01-02760 120 516452.76 64476 2019
1 01-01737 96 603900.00 80520 2019
0 01-02760 123 76481.76 2457 2020
1 01-01737 27 203014.00 15648 2020
df_sum:
StockedOutDays Profit SalesQuantity
BrandID
01-01737 123 806914.00 96168
01-02760 243 592934.52 66933

Scrape a data from a html table

I'm try to scrape a table, from B3 site, but the result is a empty data frame.
What's wrong?
import pandas as pd
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-ptBR.asp?Data=31/08/2020&Data1=20200831&slcTaxa=PRE"
df = pd.read_html(io=url)
print (df)

The default parser for read_html is lxml, which is not able to parse your document. Switching to html5lib (Beautifulsoup 4) does the trick.
Below is your code with the addition of a flavor parameter.
import pandas as pd
url = "http://www2.bmf.com.br/pages/portal/bmfbovespa/lumis/lum-taxas-referenciais-bmf-ptBR.asp?Data=31/08/2020&Data1=20200831&slcTaxa=PRE"
df = pd.read_html(io=url, flavor='bs4')
print (df)
[ Dias Corridos DI x pré
Dias Corridos 252(2)(4) 360(1)
0 1 190 0
1 8 191 171
2 11 191 199
283 10760 848 832
284 10801 848 832
285 10941 848 833
286 12677 854 838
[287 rows x 3 columns]]

Selecting top % of rows in pandas

I have a sample dataframe as below (actual dataset is roughly 300k entries long):
user_id revenue
----- --------- ---------
0 234 100
1 2873 200
2 827 489
3 12 237
4 8942 28934
... ... ...
96 498 892384
97 2345 92
98 239 2803
99 4985 98332
100 947 4588
which displays the revenue generated by users. I would like to select the rows where the top 20% of the revenue is generated (hence giving the top 20% revenue generating users).
The methods that come closest to mind for me is calculating the total number of users, working out 20% of this ,sorting the dataframe with sort_values() and then using head() or nlargest(), but I'd like to know if there is a simpler and elegant way.
Can anybody propose a way for this?
Thank you!

Suppose You have dataframe df:
user_id revenue
234 21
2873 20
827 23
12 23
8942 28
498 22
2345 20
239 24
4985 21
947 25
I've flatten revenue distribution to show the idea.
Now calculating step by step:
df = pd.read_clipboard()
df = df.sort_values(by = 'revenue', ascending = False)
df['revenue_cum'] = df['revenue'].cumsum()
df['%revenue_cum'] = df['revenue_cum']/df['revenue'].sum()
df
result:
user_id revenue revenue_cum %revenue_cum
4 8942 28 28 0.123348
9 947 25 53 0.233480
7 239 24 77 0.339207
2 827 23 100 0.440529
3 12 23 123 0.541850
5 498 22 145 0.638767
0 234 21 166 0.731278
8 4985 21 187 0.823789
1 2873 20 207 0.911894
6 2345 20 227 1.000000
Only 2 top users generate 23.3% of total revenue.

This seems to be the case for df.quantile, from pandas documentation if you are looking for the top 20% all you need to do is pass the correct quantile value you desire.
A case example from your dataset:
import pandas as pd
import numpy as np
df = pd.DataFrame({'user_id':[234,2873,827,12,8942],
'revenue':[100,200,489,237,28934]})
df.quantile([0.8,1],interpolation='nearest')
This would print the top 2 rows in value:
user_id revenue
0.8 2873 489
1.0 8942 28934

I usually find useful to use sort_values to see the cumulative effect of every row and then keep rows up to some threshold:
# Sort values from highest to lowest:
df = df.sort_values(by='revenue', ascending=False)
# Add a column with aggregated effect of the row:
df['cumulative_percentage'] = 100*df.revenue.cumsum()/df.revenue.sum()
# Define the threshold I need to analyze and keep those rows:
min_threshold = 30
top_percent = df.loc[df['cumulative_percentage'] <= min_threshold]
The original df will be nicely sorted with a clear indication of the top contributing rows and the created 'top_percent' df will contain the rows that need to be analyzed in particular.

I am assuming you are looking for the cumulative top 20% revenue generating users. Here is a function that will help you get the expected output and even more. Just specify your dataframe, column name of the revenue and the n_percent you are looking for:
import pandas as pd
def n_percent_revenue_generating_users(df, col, n_percent):
df.sort_values(by=[col], ascending=False, inplace=True)
df[f'{col}_cs'] = df[col].cumsum()
df[f'{col}_csp'] = 100*df[f'{col}_cs']/df[col].sum()
df_ = df[df[f'{col}_csp'] > n_percent]
index_nearest = (df_[f'{col}_csp']-n_percent).abs().idxmin()
threshold_revenue = df_.loc[index_nearest, col]
output = df[df[col] >= threshold_revenue].drop(columns=[f'{col}_cs', f'{col}_csp'])
return output
n_percent_revenue_generating_users(df, 'revenue', 20)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract a specific table from an url via Python? - python

This is how you would use Selenium: from selenium import webdriver import time from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() #maximize the window size driver.maximize_window() #navigate to the url driver.get("https://www.google.com/")

Related

Cannot scrape some table using Pandas

How to get distinct rows from pandas dataframe?

Adding a matching value from 3 different DataFrames, not the entire column Python

Scrape a data from a html table

Selecting top % of rows in pandas

Categories

Resources