Turn table-element into Pandas DataFrame - python

I would like to turn a table into a pandas.DataFrame.
URL = 'https://ladieseuropeantour.com/reports-page/?tourn=1202&tclass=rnk&report=tmscores~season=2015~params=P*4ESC04~#/profile'
The element in question is
from selenium import webdriver
from selenium.webdriver.common.by import By
driver.get(URL)
ranking = driver.find_element(By.XPATH, ".//*[#id='maintablelive']")
I tried the following:
import pandas as pd
pd.read_html(ranking.get_attribute('outerHTML'))[0]
I am also using the dropdown-menu to select multiple rounds. When a different round is selected, driver.current_url doesn't change so I think it's not possible to load these new tables with requests or anything.
Please advice!

Instead of using selenium, you want to access the URL's API endpoint.
Finding the API endpoint
You can trace it as follows:
Open the URL in Chrome
Use Ctrl + Shift + J to open DevTools, navigate to Network, select Fetch/XHR from the sub navbar, and refresh the URL.
This will reload the network connections, and when you click on one of the lines that will appear, you can select Response from the second sub navbar to see if they are returning data.
Going through them, we can locate the connection that is responsible for returning the data for the table, namely https://ladieseuropeantour.com/api/let/cache/let/2015/2015-1202-scores-P*4ESC04.json (you can leave out the query part ?randomadd=1673086921319 at the end).
Now, with this knowledge, we can simply load all the data with requests, investigate the type of information that is contained in the JSON and dump the required sub part in a df.
For example, let's recreate the table from your URL. This one:
Code
import requests
import pandas as pd
url = 'https://ladieseuropeantour.com/api/let/cache/let/2015/2015-1202-scores-P*4ESC04.json'
data = requests.get(url).json()
df = pd.DataFrame(data['scores']['scores_entry'])
cols = ['pos','name','nationality','vspar',
'score_R1','score_R2','score_R3','score_R4', 'score']
df_table = df.loc[:,cols]
df_table.head()
pos name nationality vspar score_R1 score_R2 score_R3 \
0 1 Lydia Ko (NZL) NZL -9 70 70 72
1 2 Amy Yang (KOR) KOR -7 73 70 70
2 3 Ariya Jutanugarn (THA) THA -4 69 71 72
3 4 Jenny Shin (KOR) KOR -2 76 71 74
4 4 Ilhee Lee (KOR) KOR -2 68 82 69
score_R4 score
0 71 283
1 72 285
2 76 288
3 69 290
4 71 290
If you check print(df.columns), you'll see that the main df contains all the data that lies behind all of the tables you can select on the URL. So, have a go at that to select whatever you are looking for.

Related

Python Pandas Question: Index / Match with Missing Values + Duplicates + Everything In Between

Basically, I have smaller table of assets purchased this year and a table of assets the company holds. I want to be able to get the value of certain symbols from the table of assets the company holds and merge into the assets purchased dataset. I want to use the CUSIP. If there is a CUSIP in the assets purchased this year that is blank, this code can return blank or NaN. If there are duplicate CUSIPS in the Holdings dataset, then return the first value. I have tried 4 different ways of merging these tables now without much luck. I run into a memory error for some reason
The equivalent excel code would be:
=IFNA(INDEX(asset_holdings!ADMIN_SYMBOLS,MATCH(asset_purchases!CUSIP_n, asset_holdings!CUSIPs, 0)),"")
Holdings Table
CUSIP
SYMBOL
353187EV5
1A
74727PAY7
3A
80413TAJ8
FE
02765UCR3
3G
000000000
3G
74727PAYA
3E
000000000
4E
Purchase Table
CUSIP
SHARES
353187EV5
10
74727PAY7
67
80413TAJ8
35
02765UCR4
3666
74727PAY7
3613
74727PAYA
13
000000000
14
Desired Result
CUSIP
SHARES
SYMBOL
353187EV5
10
1A
74727PAY7
67
3A
80413TAJ8
35
FE
02765UCR4
3666
""
74727PAY7
3613
3A
74727PAYA
13
3E
000000000
14
3G
C:\ProgramData\Continuum\Anaconda\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
1140 join_func = _join_functions[how]
1141
-> 1142 return join_func(lkey, rkey, count, **kwargs)
1143
1144
pandas\_libs\join.pyx in pandas._libs.join.left_outer_join()
MemoryError:
What I tried:
dfnew = dfPurchases.merge(dfHoldings[['CUSIP','SYMBOL']],how='left', on='CUSIP')
dfPurchases = dfPurchases.set_index('CUSIP')
dfPurchases['SYMBOL'] = dfHoldings.lookup(dfHoldings['CUSIP'], df1['SYMBOL'])
enter image description here
Let me elaborate on the question a little bit so that you can review if I have the correct understanding of your question. You want to do a left outer join of purchase dataset with holdings dataset. But, since your holding data set has duplicates for CUSIP ids, It will not be a One-to-one join.
Now you have two options:
Accept multiple rows for one row of the purchase dataset
Make CUSIP id unique in the Holdings dataset and then perform the merge
First way:
import pandas as pd
left = pd.read_csv('purchase.csv')
right = pd.read_csv('holdings.csv')
result = pd.merge(left, right, on="CUSIP", how='left')
print(result)
But, As per your question, the above result isn't acceptable so, We are gonna make CUSIP column unique in the right dataset.
import pandas as pd
left = pd.read_csv('purchase.csv')
right = pd.read_csv('holdings.csv')
# By default it takes first but i have added explicitly for better understanding
right_unique = right.drop_duplicates('CUSIP', keep='first')
result = pd.merge(left, right_unique, on="CUSIP", how='left', validate="many_to_one")
print(result)
Bonus: You can also explore the validation param by putting it into the first version and see the validation errors.

Web scraping table data using beautiful soup

I am trying to learn web scraping in Python for a project using Beautiful Soup by doing the following:
Scraping Kansas City Chiefs active team player name with the college attended. This is the url used https://www.chiefs.com/team/players-roster/.
After compiling, I get an error saying "IndexError: list index out of range".
I don't know if my set classes are wrong. Help would be appreciated.
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
print(player_name,player_university)
TL;DR: Two issues to solve: (1) indexing, (2) HTML element-queries
Indexing
The Python Index Operator is represented by opening and closing square brackets: []. The syntax, however, requires you to put a number inside the brackets.
Example:
So [7] applies indexing to the preceding iterable (all found tds), to get the element with index 7. In Python indices are 0-based, so they start with 0 for the first element.
In your statement, you take all found cells as <td> HTML-elements of the specific classes as iterable and want to get the 8th element, by indexing with [7].
row.find_all('td', class_='sorter-lastname selected')[7]
How to avoid index-errors ?
Are you sure there are any td elements found in the row?
If some are found, can we guarantee that it are always at least 8.
In this case, the were apparently less than 8 elements.
That's why Python would raise an IndexError, e.g. in given script line 15:
Traceback (most recent call last):
File "<stdin>", line 15, in <module>
player_university = row.find_all('td', class_='sorter-lastname selected')[7].text
IndexError: list index out of range
Better test on length before indexing:
import requests
from bs4 import BeautifulSoup
url = "https://www.chiefs.com/team/players-roster/"
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
roster_table = soup.find('table', class_ = 'd3-o-table d3-o-table--row-striping d3-o-table--detailed d3-o-table--sortable {sortlist: [[0,0]]}')
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
print(f"person row: {row}") # debug-print helps to fix element-query
player_name = row.find('td', class_='sorter-lastname selected"')
cells = row.find_all('td', class_='sorter-lastname selected')
player_university = None # define a default to avoid NameError
if len(cells) > 7: # test on minimum length of 8 for index 7
player_university = cells[7].text
print(player_name, player_university)
Element-queries
When the index was fixed, the queried names returned empty results as None, None.
We need to debug (thus I added the print inside the loop) and adjust the queries:
(1) for the university-name:
If you follow RJ's answer and choose the last cell without any class-condition then a negative index like -1 means from backwards, like here: the last. The number of cells should be at least 1 or greater than 0.
(2) for the player-name:
It appears to be in the first cell (also with a CSS-class for sorting), nested either in a link-title <a .. title="Player Name"> or in following sibling as inner text of span > a.
CSS selectors
You may use CSS selectors for that an bs4's select or select_one functions. Then you can select the path like td > ? > ? > a and get the title.
Note: the ? placeholders are left as challenging exercise for you.)
💡️ Tip: most browsers have an inspector (right click on the element, e.g. the player-name), then choose "inspect element" and an HTML source view opens selecting the element. Right-click again to "Copy" the element as "CSS selector".
Further Reading
About indexing, and the magic of negative numbers like [-1]:
AskPython: Indexing in Python - A Complete Beginners Guide
.. a bit further, about slicing:
Real Python: Indexing and Slicing
Research on Beautiful Soup here:
Using BeautifulSoup to extract the title of a link
Get text with BeautifulSoup CSS Selector
I couldn't find a td with class sorter-lastname selected in the source code. You basically need the last td in each row, so this would do:
for person in roster_table.find_all('tbody'):
rows = person.find_all('tr')
for row in rows:
player_name = row.find('td', class_='sorter-lastname selected"')
player_university = row.find_all('td')[-1].text
PS. scraping tables is extremely easy in pandas:
import pandas as pd
df = pd.read_html('https://www.chiefs.com/team/players-roster/')
df[0].to_cvs('output.csv')
It may take a bit longer, but the output is impressive, for example the print(df[0]):
Player # Pos HT WT Age Exp College
0 Josh Pederson NaN TE 6-5 235 24 R Louisiana-Monroe
1 Brandin Dandridge NaN WR 5-10 180 25 R Missouri Western
2 Justin Watson NaN WR 6-3 215 25 4 Pennsylvania
3 Jonathan Woodard NaN DE 6-5 271 28 3 Central Arkansas
4 Andrew Billings NaN DT 6-1 311 26 5 Baylor
.. ... ... .. ... ... ... .. ...
84 James Winchester 41.0 LS 6-3 242 32 7 Oklahoma
85 Travis Kelce 87.0 TE 6-5 256 32 9 Cincinnati
86 Marcus Kemp 85.0 WR 6-4 208 26 4 Hawaii
87 Chris Jones 95.0 DT 6-6 298 27 6 Mississippi State
88 Harrison Butker 7.0 K 6-4 196 26 5 Georgia Tech
[89 rows x 8 columns]

Making a table by using data which getting together 4 different list

I try to make a table by getting together four list.
My code as below:
from selenium import webdriver
import time
driver_path= "C:\\Users\\Bacanli\\Desktop\\chromedriver.exe"
browser=webdriver.Chrome(driver_path)
browser.get("http://www.bddk.org.tr/BultenHaftalik")
time.sleep(3)
Krediler=browser.find_element_by_xpath("//*[#id='tabloListesiItem-253']/span")
Krediler.click()
elements = browser.find_elements_by_css_selector("td.ortala:nth-child(2)")
TPs=browser.find_elements_by_css_selector("td[data-label='TP']")
YPs=browser.find_elements_by_css_selector("td[data-label='YP']")
Toplams=browser.find_elements_by_css_selector("td[data-label='Toplam']")
My intend is that make a new table by getting together elements, TPs, YPs, Toplams.
Thanks for your helping.
Pandas makes this easy for you:
import pandas as pd
df = pd.read_html('http://www.bddk.org.tr/BultenHaftalik')
will create a list of pandas dataframes from html tables on the page. The table you want is df[3].
Result df[3].head():
Unnamed: 0
Sektör / Krediler ( 9 Temmuz 2021 Cuma ) (Milyon TL)
TP
YP
TOPLAM
0
1
Toplam Krediler (2+10)
2.479.94928
1.427.80395
3.907.75323
1
2
Tüketici Kredileri ve Bireysel Kredi Kartları (3+7)
877.62363
30181
877.92544
2
3
Tüketici Kredileri (4+5+6)
710.18775
11070
710.29845
3
4
a) Konut
278.38213
7473
278.45686
4
5
b) Taşıt
14.91958
000
14.91958
export to csv with df[3].to_csv('filename.csv')
(or you could use the export to excel button above the table on the website)

How to extract multiples tables from one PDF file using Pandas and tabula-py

Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:
Table exp in every page
student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4
I want to extract all this tables in one dataframe, First i did
df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)
But i got a messy output so i try this lines of code that looks like this :
[student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4 ,student Score Rang
Maxim 43 34
Nourah 93 5]
so i edited my code like this
import pandas as pd
import tabula
file_path = "filePath.pdf"
# read my file
df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)
It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.
Thus, you can use pandas.concat on its output to concatenate the dataframes:
df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))

Dataframe recursively trace the values until one column is zero

Am trying to analyze a legacy menu system and trace the path of the menu options. The menu system has main menu and followed by sub menu. I am trying to get the details from bottom to top.Here is records i extracted from the csv for the 'Pay' screen.
If we look at it the Pay Menu is called from 3 sub menu. Example Rules and Dispatch. Rules is inturn called from Test Menu
So for the 3 instances of where pay needs to be called. I want to extract as
2-10
18-2-10
98-13-4-4
How is this possible
MOKEY#MO MOMNU#MO MOMNUOPT MOMNUSEQ MOOPTDES MOOPTCMD
111 0 2 20 Dispatch Menu
131 111 10 120 Pay CALL AS650G
283 0 98 980 Utilities Menu
985 3,028 2 30 Rules CALL IS216G PARM(' ')
1,131 985 10 120 Pay CALL AS650G
2,391 283 13 300 Key Performance Indicator Menu
2,434 2,445 4 380 Pay CALL AS650G
2,445 2,391 4 40 Quick Look Weekly Menu
3,028 0 18 190 Test Menu
Below is something i have been doing, and i just have a very basic knowledge on pandas. How can i combine all these statements and get the output
import pandas as pd
statDir = 'C:/Users/jerry/Documents/STAT_TABLES/'
csvFile = statDir + 'menu' + '.csv';
dd = pd.read_csv(csvFile,low_memory=False);
fd1 = dd[dd['MOOPTCMD'].str.contains('AS650G')][['MOKEY#MO','MOMNU#MO','MOMNUOPT']]
print(fd1)
print('==============')
fd2 = dd[dd['MOKEY#MO'].isin(fd1['MOMNU#MO'])][['MOKEY#MO','MOMNU#MO','MOMNUOPT']]
print(fd2)
print('==============')
fd3 = dd[dd['MOKEY#MO'].isin(fd2['MOMNU#MO'])][['MOKEY#MO','MOMNU#MO','MOMNUOPT']]
print(fd3)
print('==============')
fd4 = dd[dd['MOKEY#MO'].isin(fd3['MOMNU#MO'])][['MOKEY#MO','MOMNU#MO','MOMNUOPT']]
print(fd4)
print('==============')
fd5 = dd[dd['MOKEY#MO'].isin(fd4)]
fd5

Categories

Resources