I have this df:
CODIGO NOMBRE Enero Enero Febrero Febrero Marzo Marzo ....
000130 RICA PLAYA 31.3 21.0 31.7 22.0 31.8 22.0
000132 PUERTO PIZARRO 32.5 19.0 32.2 18.0 32.5 17.0
000134 PAPAYAL 31.7 25.0 31.5 27.0 31.8 26.0
000135 EL SALTO 31.1 27.0 31.5 26.0 31.5 26.0
000136 CAÑAVERAL 32.4 17.0 32.0 16.0 32.3 16.0
... ... ... ... ... ...
158317 SUSAPAYA 17.3 20.0 16.8 20.0 17.2 19.0
158321 PALCA 17.9 16.0 17.8 17.0 18.4 16.0
158323 TALABAYA 17.1 12.0 16.7 12.0 17.2 12.0
158326 CAPAZO 13.7 19.0 13.6 19.0 13.5 17.0
158328 PAUCARANI 13.1 15.0 12.9 15.0 13.4 14.0 ....
with 26 columns.
I want to rename the second Enero to N1, and the second Febrero to N2, second Marzo to N3, etc etc like this:
CODIGO NOMBRE Enero N1 Febrero N2 Marzo N3 ....
000130 RICA PLAYA 31.3 21.0 31.7 22.0 31.8 22.0
000132 PUERTO PIZARRO 32.5 19.0 32.2 18.0 32.5 17.0
000134 PAPAYAL 31.7 25.0 31.5 27.0 31.8 26.0
000135 EL SALTO 31.1 27.0 31.5 26.0 31.5 26.0
000136 CAÑAVERAL 32.4 17.0 32.0 16.0 32.3 16.0
... ... ... ... ... ...
158317 SUSAPAYA 17.3 20.0 16.8 20.0 17.2 19.0
158321 PALCA 17.9 16.0 17.8 17.0 18.4 16.0
158323 TALABAYA 17.1 12.0 16.7 12.0 17.2 12.0
158326 CAPAZO 13.7 19.0 13.6 19.0 13.5 17.0
158328 PAUCARANI 13.1 15.0 12.9 15.0 13.4 14.0 ....
So I did:
df.columns['CODIGO','NOMBRE','Enero','N1','Febrero','N2'...... etc etc]
Is there a more efficient or faster way to do this than writing every name?
Assuming duplicated values are in the correct order they can be replaced by modifying the values of columns where duplicated:
m = df.columns.duplicated()
df.columns.values[m] = [f'N{i}' for i in range(1, 1 + m.sum())]
Or with arange and Series:
import numpy as np
df.columns.values[m] = 'N' + pd.Series(np.arange(1, 1 + m.sum()), dtype=str)
Or with cumsum:
df.columns.values[m] = 'N' + pd.Series(m.cumsum()[m], dtype=str)
import pandas as pd
df = pd.DataFrame(columns=['CODIGO', 'NOMBRE', 'Enero', 'Enero',
'Febrero', 'Febrero', 'Marzo', 'Marzo'])
print('Before', df)
m = df.columns.duplicated()
df.columns.values[m] = [f'N{i}' for i in range(1, 1 + m.sum())]
print('After', df)
Before Empty DataFrame
Columns: [CODIGO, NOMBRE, Enero, Enero, Febrero, Febrero, Marzo, Marzo]
Index: []
After Empty DataFrame
Columns: [CODIGO, NOMBRE, Enero, N1, Febrero, N2, Marzo, N3]
Index: []
Related
The following code is a sample DataFrame. How can I bulk assign/modify all the Temp numbers (say convert from deg c to deg f)?
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'],['Day', 'Night'], ['HR', 'Temp']])
# mock some data
data = np.round(np.random.randn(4, 12), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
hd = pd.DataFrame(data, index=index, columns=columns)
print(hd,'\n')
Bob Guido Sue
Day Night Day Night Day Night
HR Temp HR Temp HR Temp HR Temp HR Temp HR Temp
year visit
2013 1 38.0 37.5 21.0 36.6 33.0 37.4 38.0 35.8 15.0 38.5 27.0 37.0
2 47.0 36.5 37.0 36.3 31.0 38.8 37.0 38.4 62.0 34.9 45.0 35.6
2014 1 51.0 35.5 41.0 35.9 26.0 36.7 33.0 36.3 18.0 34.6 39.0 38.0
2 46.0 37.6 29.0 37.3 42.0 37.0 31.0 37.0 47.0 37.3 30.0 36.0
filter the columns for Temp, modify, and reassign the values back:
hd
Bob Guido Sue
Day Night Day Night Day Night
HR Temp HR Temp HR Temp HR Temp HR Temp HR Temp
year visit
2013 1 35.0 38.2 45.0 38.1 38.0 36.7 43.0 40.0 38.0 37.5 35.0 37.8
2 31.0 36.7 30.0 37.7 43.0 36.6 38.0 37.5 33.0 38.2 42.0 35.8
2014 1 32.0 37.7 39.0 35.0 37.0 37.7 51.0 37.5 28.0 39.6 43.0 37.8
2 59.0 37.5 28.0 34.6 60.0 38.0 38.0 36.7 63.0 37.9 25.0 37.2
modified = hd.loc(axis=1)[:,:, 'Temp'].mul(1.8).add(32)
hd.update(modified)
hd
Bob Guido Sue
Day Night Day Night Day Night
HR Temp HR Temp HR Temp HR Temp HR Temp HR Temp
year visit
2013 1 35.0 100.76 45.0 100.58 38.0 98.06 43.0 104.00 38.0 99.50 35.0 100.04
2 31.0 98.06 30.0 99.86 43.0 97.88 38.0 99.50 33.0 100.76 42.0 96.44
2014 1 32.0 99.86 39.0 95.00 37.0 99.86 51.0 99.50 28.0 103.28 43.0 100.04
2 59.0 99.50 28.0 94.28 60.0 100.40 38.0 98.06 63.0 100.22 25.0 98.96
I'm looking to use Dash to make a DataFrame I created interactive and in a clean looking format. It only needs to be a table with the external stylesheet included - I'll mess around with the styles when I can get the code to run correctly.
When I print the DataFrame, it comes out ok, as seen below, but it's missing the first column header.
R HR RBI SB AVG ... QS SV+H K ERA WHIP
Democracy . 186.0 45.0 164.0 32.0 0.261 ... 18.0 15.0 244.0 2.17 1.05
Wassup Pham 181.0 55.0 198.0 20.0 0.263 ... 12.0 34.0 226.0 2.52 0.99
Myrtle Bea. 180.0 50.0 153.0 9.0 0.262 ... 17.0 21.0 236.0 3.33 1.13
The Rotter. 176.0 46.0 183.0 21.0 0.270 ... 25.0 13.0 275.0 2.41 0.85
Scranton S. 172.0 56.0 164.0 15.0 0.272 ... 24.0 18.0 265.0 2.45 1.01
New York N. 164.0 56.0 203.0 13.0 0.287 ... 28.0 0.0 297.0 2.84 1.05
Springfiel. 156.0 39.0 154.0 15.0 0.251 ... 11.0 21.0 236.0 3.65 1.18
Collective. 151.0 38.0 150.0 33.0 0.283 ... 10.0 25.0 214.0 2.41 1.05
Cron Job 146.0 33.0 145.0 20.0 0.244 ... 14.0 22.0 237.0 2.79 1.01
Patrick's . 142.0 37.0 162.0 19.0 0.252 ... 9.0 24.0 253.0 2.92 1.01
I'm thinking it's possible that the lack of a column header is causing the entire column to be lost when converting to a Dash DataTable, but I'm not sure what to do to fix it.
Here's my code, from the printing of the DataFrame, to the Dash app creation and layout, to running the code locally.
print(statsdf_transposed)
######################
app = Dash(__name__, external_stylesheets=[dbc.themes.LUX])
app.layout = html.Div([
html.H4('The Show - Season Stats'),
dash_table.DataTable(
id='stats_table',
columns=[{"name": i, "id": i}
for i in statsdf_transposed.columns],
data=statsdf_transposed.to_dict('records'),
)
])
if __name__ == '__main__':
app.run_server(debug=True)
Thank you in advance for any help this community could offer!
URL = 'https://www.basketball-reference.com/leagues/NBA_2019.html'
content = requests.get(URL)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find_all('table', {'class' : 'sortable stats_table now_sortable'})
rows = table.find_all('td')
for i in rows:
print(i.get_text())
I want to get content of the table with team per game stats from this website but I got error
>>>AttributeError: 'NoneType' object has no attribute 'find_all'
The table that you want is dynamically loaded, meaning it not loaded into the html when you first make a request to the page. So, the table you are searching for does not yet exist.
To scrape sites that use javascript, you can look into using selenium webdriver and PhantomJS, better described by this post –> https://stackoverflow.com/a/26440563/13275492
Actually you can use pandas.read_html() which will read the all tables in nice format. it's will return tables as list. so you can access it as DataFrame with index such as print(df[0]) for example
import pandas as pd
df = pd.read_html("https://www.basketball-reference.com/leagues/NBA_2019.html")
print(df)
The tables (with the exception of a few) in these sports reference sites are within the comments. You would need to pull out the comments, then render these tables with pandas.
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.basketball-reference.com/leagues/NBA_2019.html"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
comments = pageSoup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'table' in each and 'id="team-stats-per_game"' in each:
df = pd.read_html(each, attrs = {'id': 'team-stats-per_game'})[0]
Output:
print (df)
Rk Team G MP FG ... STL BLK TOV PF PTS
0 1.0 Milwaukee Bucks* 82 241.2 43.4 ... 7.5 5.9 13.9 19.6 118.1
1 2.0 Golden State Warriors* 82 241.5 44.0 ... 7.6 6.4 14.3 21.4 117.7
2 3.0 New Orleans Pelicans 82 240.9 43.7 ... 7.4 5.4 14.8 21.1 115.4
3 4.0 Philadelphia 76ers* 82 241.5 41.5 ... 7.4 5.3 14.9 21.3 115.2
4 5.0 Los Angeles Clippers* 82 241.8 41.3 ... 6.8 4.7 14.5 23.3 115.1
5 6.0 Portland Trail Blazers* 82 242.1 42.3 ... 6.7 5.0 13.8 20.4 114.7
6 7.0 Oklahoma City Thunder* 82 242.1 42.6 ... 9.3 5.2 14.0 22.4 114.5
7 8.0 Toronto Raptors* 82 242.4 42.2 ... 8.3 5.3 14.0 21.0 114.4
8 9.0 Sacramento Kings 82 240.6 43.2 ... 8.3 4.4 13.4 21.4 114.2
9 10.0 Washington Wizards 82 243.0 42.1 ... 8.3 4.6 14.1 20.7 114.0
10 11.0 Houston Rockets* 82 241.8 39.2 ... 8.5 4.9 13.3 22.0 113.9
11 12.0 Atlanta Hawks 82 242.1 41.4 ... 8.2 5.1 17.0 23.6 113.3
12 13.0 Minnesota Timberwolves 82 241.8 41.6 ... 8.3 5.0 13.1 20.3 112.5
13 14.0 Boston Celtics* 82 241.2 42.1 ... 8.6 5.3 12.8 20.4 112.4
14 15.0 Brooklyn Nets* 82 243.7 40.3 ... 6.6 4.1 15.1 21.5 112.2
15 16.0 Los Angeles Lakers 82 241.2 42.6 ... 7.5 5.4 15.7 20.7 111.8
16 17.0 Utah Jazz* 82 240.9 40.4 ... 8.1 5.9 15.1 21.1 111.7
17 18.0 San Antonio Spurs* 82 241.5 42.3 ... 6.1 4.7 12.1 18.1 111.7
18 19.0 Charlotte Hornets 82 241.8 40.2 ... 7.2 4.9 12.2 18.9 110.7
19 20.0 Denver Nuggets* 82 240.6 41.9 ... 7.7 4.4 13.4 20.0 110.7
20 21.0 Dallas Mavericks 82 241.2 38.8 ... 6.5 4.3 14.2 20.1 108.9
21 22.0 Indiana Pacers* 82 240.3 41.3 ... 8.7 4.9 13.7 19.4 108.0
22 23.0 Phoenix Suns 82 242.4 40.1 ... 9.0 5.1 15.6 23.6 107.5
23 24.0 Orlando Magic* 82 241.2 40.4 ... 6.6 5.4 13.2 18.6 107.3
24 25.0 Detroit Pistons* 82 242.1 38.8 ... 6.9 4.0 13.8 22.1 107.0
25 26.0 Miami Heat 82 240.6 39.6 ... 7.6 5.5 14.7 20.9 105.7
26 27.0 Chicago Bulls 82 242.7 39.8 ... 7.4 4.3 14.1 20.3 104.9
27 28.0 New York Knicks 82 241.2 38.2 ... 6.8 5.1 14.0 20.9 104.6
28 29.0 Cleveland Cavaliers 82 240.9 38.9 ... 6.5 2.4 13.5 20.0 104.5
29 30.0 Memphis Grizzlies 82 242.4 38.0 ... 8.3 5.5 14.0 22.0 103.5
30 NaN League Average 82 241.6 41.1 ... 7.6 5.0 14.1 20.9 111.2
[31 rows x 25 columns]
Continuing on my previous question link (things are explained there), I now have obtained an array. However, I don't know how to use this array, but that is a further question. The point of this question is, there are NaN values in the 63 x 2 column that I created and I want the rows with NaN values deleted so that I can use the data (once I ask another question on how to graph and export as x , y arrays)
Here's what I have. This code works.
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = [df.iloc[:, [0, 1]]]
The sample of the .csv file is located in the link.
I tried inputting
data1.dropna()
but it didn't work.
I want the NaN values/rows to drop so that I'm left with a 28 x 2 array. (I am using the first column with actual values as an example).
Thank you.
Try
import pandas as pd
df = pd.read_csv("~/Truncated raw data hcl.csv")
data1 = df.iloc[:, [0, 1]]
cleaned_data = data1.dropna()
You were probably getting an Exception like "List does not have a method 'dropna'". That's because your data1 was not a Pandas DataFrame, but a List - and inside that list was a DataFrame.
However the answer is already given, Though i would like to put some thoughts across this.
Importing Your dataFrame taking the example dataset from your earlier post you provided:
>>> import pandas as pd
>>> df = pd.read_csv("so.csv")
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
22 11.0 30.0 29.7 29.6 ... 39.3 NaN 43.8 44.3
23 11.5 30.0 29.8 29.7 ... 40.2 NaN 43.8 44.3
24 12.0 30.0 29.8 29.7 ... 40.9 NaN 43.9 44.3
25 12.5 30.1 29.8 29.7 ... 41.4 NaN 43.9 44.3
26 13.0 30.1 29.8 29.8 ... 41.8 NaN 43.9 44.4
27 13.5 30.1 29.9 29.8 ... 42.0 NaN 43.9 44.4
28 14.0 30.1 29.9 29.8 ... 42.1 NaN NaN 44.4
29 14.5 NaN 29.9 29.8 ... 42.3 NaN NaN 44.4
30 15.0 NaN 29.9 NaN ... 42.4 NaN NaN NaN
31 15.5 NaN NaN NaN ... 42.4 NaN NaN NaN
However, It good to clean the data beforehand and then process the data as you desired hence dropping the NA values during import itself will be significantly useful.
>>> df = pd.read_csv("so.csv").dropna() <-- dropping the NA here itself
>>> df
time 1mnaoh trial 1 1mnaoh trial 2 1mnaoh trial 3 ... 5mnaoh trial 1 5mnaoh trial 2 5mnaoh trial 3 5mnaoh trial 4
0 0.0 23.2 23.1 23.1 ... 23.3 24.3 24.1 24.1
1 0.5 23.2 23.1 23.1 ... 23.4 24.3 24.1 24.1
2 1.0 23.2 23.1 23.1 ... 23.5 24.3 24.1 24.1
3 1.5 23.2 23.1 23.1 ... 23.6 24.3 24.1 24.1
4 2.0 23.3 23.2 23.2 ... 23.7 24.5 24.7 25.1
5 2.5 24.0 23.5 23.5 ... 23.8 27.2 26.7 28.1
6 3.0 25.4 24.4 24.1 ... 23.9 31.4 29.8 31.3
7 3.5 26.9 25.5 25.1 ... 23.9 35.1 33.2 34.4
8 4.0 27.8 26.5 26.2 ... 24.0 37.7 35.9 36.8
9 4.5 28.5 27.3 27.0 ... 24.0 39.7 38.0 38.7
10 5.0 28.9 27.9 27.7 ... 24.0 40.9 39.6 40.2
11 5.5 29.2 28.2 28.3 ... 24.0 41.9 40.7 41.0
12 6.0 29.4 28.5 28.6 ... 24.1 42.5 41.6 41.2
13 6.5 29.5 28.8 28.9 ... 24.1 43.1 42.3 41.7
14 7.0 29.6 29.0 29.1 ... 24.1 43.4 42.8 42.3
15 7.5 29.7 29.2 29.2 ... 24.0 43.7 43.1 42.9
16 8.0 29.8 29.3 29.3 ... 24.2 43.8 43.3 43.3
17 8.5 29.8 29.4 29.4 ... 27.0 43.9 43.5 43.6
18 9.0 29.9 29.5 29.5 ... 30.8 44.0 43.6 43.8
19 9.5 29.9 29.6 29.5 ... 33.9 44.0 43.7 44.0
20 10.0 30.0 29.7 29.6 ... 36.2 44.0 43.7 44.1
21 10.5 30.0 29.7 29.6 ... 37.9 44.0 43.8 44.2
and lastly cast your dataFrame as you wish:
>>> df = [df.iloc[:, [0, 1]]]
# new_df = [df.iloc[:, [0, 1]]] <-- if you don't want to alter actual dataFrame
>>> df
[ time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0]
Better Solution:
While looking at the end result, i see you are just concerning about the particular columns those are 'time' & '1mnaoh trial 1' hence idealistic would be to use usecole option which will reduce your memory footprint for the search across the data because you just opted the only columns which are useful for you and then use dropna() which will give you wanted you wanted i believe.
>>> df = pd.read_csv("so.csv", usecols=['time', '1mnaoh trial 1']).dropna()
>>> df
time 1mnaoh trial 1
0 0.0 23.2
1 0.5 23.2
2 1.0 23.2
3 1.5 23.2
4 2.0 23.3
5 2.5 24.0
6 3.0 25.4
7 3.5 26.9
8 4.0 27.8
9 4.5 28.5
10 5.0 28.9
11 5.5 29.2
12 6.0 29.4
13 6.5 29.5
14 7.0 29.6
15 7.5 29.7
16 8.0 29.8
17 8.5 29.8
18 9.0 29.9
19 9.5 29.9
20 10.0 30.0
21 10.5 30.0
22 11.0 30.0
23 11.5 30.0
24 12.0 30.0
25 12.5 30.1
26 13.0 30.1
27 13.5 30.1
28 14.0 30.1
I'm trying to filter a groupby to contain only those rows in the group from group beginning to first local max, and I'm having some trouble.
To select the local max, I'm using x.B.diff().fillna(1) >= 0).cumprod()) == 1].tail(1)
To get the rows I want, I figured I'd try to use groupby filter and try to get rows with indices smaller than the index of the first local max of the group. (Maybe there's a better way?)
Here's what I'm working on so far:
df.groupby('Flag').filter(lambda x: x.index.values < x.index.get_loc(x[((x.B.diff().fillna(1) >= 0).cumprod()) == 1].tail(1)))
With this I'm currently getting a TypeError that says that one of the rows is an invalid key. I'm assuming I've got some malformed code in the line above.
Sample Data:
Flag B
60738 10.0 27.2
60739 10.0 27.3
60740 10.0 27.4
60741 10.0 27.6
60742 10.0 27.8
60743 10.0 28.1
60744 10.0 28.4
60745 10.0 28.7
60746 10.0 29.0
60747 10.0 29.3
60748 10.0 29.6
60749 10.0 29.9
60750 10.0 29.9
60751 10.0 29.9
60752 10.0 29.9
60753 10.0 29.9
60754 10.0 30.1
60755 10.0 30.4
60756 10.0 30.6
60757 10.0 30.9
60758 10.0 31.1
60759 10.0 31.3
60760 10.0 31.6
60761 10.0 31.9
60762 10.0 32.3
60763 10.0 32.6
60764 10.0 33.0
60765 10.0 33.1
60766 10.0 33.3
60767 10.0 33.5
60768 10.0 33.9
60769 10.0 34.3
60770 10.0 34.6
60771 10.0 35.0
60772 10.0 35.4
60773 10.0 35.7
60774 10.0 36.1
60775 10.0 36.2
60776 10.0 36.1
60777 10.0 36.0
60778 10.0 35.8
60779 10.0 35.5
60780 10.0 35.0
60781 10.0 34.6
60782 10.0 34.0
60783 10.0 33.6
60784 10.0 33.3
60785 10.0 33.0
60786 10.0 32.7
60787 10.0 32.4
I believe for this group, 10, I'd like to see the grouping contain indexes 60738-60775
I think you need scipy
from scipy.signal import argrelextrema
df.groupby('Flag').apply(lambda x :x.iloc[argrelextrema(x['B'].values, np.greater)[0][0],:])
Out[1508]:
60775 Flag B
Flag
10.0 10.0 36.2