Change values of one column in pandas dataframe - python

How can I change the values of the column 4 to 1 and -1, so that Iris-setosa is replace with 1 and Iris-virginica replaced with -1?
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 Iris-virginica
121 5.6 2.8 4.9 2.0 Iris-virginica
122 7.7 2.8 6.7 2.0 Iris-virginica
123 6.3 2.7 4.9 1.8 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica
126 6.2 2.8 4.8 1.8 Iris-virginica
I would appreciate the help.

You can use replace
d = {'Iris-setosa': 1, 'Iris-virginica': -1}
df['4'].replace(d,inplace = True)
0 1 2 3 4
0 5.1 3.5 1.4 0.2 1
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 5.4 3.9 1.7 0.4 1
6 4.6 3.4 1.4 0.3 1
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 -1
121 5.6 2.8 4.9 2.0 -1
122 7.7 2.8 6.7 2.0 -1
123 6.3 2.7 4.9 1.8 -1
124 6.7 3.3 5.7 2.1 -1
125 7.2 3.2 6.0 1.8 -1
126 6.2 2.8 4.8 1.8 -1

df.iloc[df["4"]=="Iris-setosa","4"]=1
df.iloc[df["4"]=="Iris-virginica","4"]=-1

I would do something like this
def encode_row(self, row):
if row[4] == "Iris-setosa":
return 1
return -1
df_test[4] = df_test.apply(lambda row : self.encode_row(row), axis=1)
assuming that df_test is your data frame

Sounds like
df['4'] = np.where(df['4'] == 'Iris-setosa', 1, -1)
should do the job

Related

How to iterate through multiple urls (teams) to combine NBA players names and stats into one dataframe?

I am still learning web scraping and would appreciate any help that I can get. Thanks to help from the community I was able to successfully scrape NBA player data (player name and player stats) and concatenate the data into one dataframe.
Here is the code below:
import pandas as pd
import requests
url = 'https://www.espn.com/nba/team/stats/_/name/lal/season/2020/seasontype/2'
df = pd.read_html(url)
df_concat = pd.concat([df[0], df[1], df[3]], axis=1)
I would now like to iterate through multiple urls to get data for different teams and then combine all of the different teams into one dataframe.
Here is the code that I have so far:
import pandas as pd
teams = ['chi','den','lac']
for team in teams:
print(team)
url = 'https://www.espn.com/nba/team/stats/_/name/{team}/season/2020/seasontype/2'.format(team=team)
print(url)
df = pd.read_html(url)
df_concat = pd.concat([df[0], df[1], df[3]], axis=1)
I tried changing 'lal' in the url to the variable team. When I ran this script the scrape was really, really slow and only gave me a dataframe for the team 'lac', not 'chi' or 'den. Any advice on the best way to do this? I have never tried scraping multiple urls.
Again, I would the data for each team combined into one large dataframe if possible. Thanks in advance for any help that you may offer. I will learn a lot from this project. =)
The principle is the same, use pd.concat with list of dataframes. For example:
import requests
import pandas as pd
teams = ["chi", "den", "lac"]
dfs_to_concat = []
for team in teams:
print(team)
url = "https://www.espn.com/nba/team/stats/_/name/{team}/season/2020/seasontype/2".format(
team=team
)
print(url)
df = pd.read_html(url)
df_concat = pd.concat([df[0], df[1], df[3]], axis=1)
dfs_to_concat.append(df_concat)
df_final = pd.concat(dfs_to_concat)
print(df_final)
df_final.to_csv("data.csv", index=False)
Prints:
chi
https://www.espn.com/nba/team/stats/_/name/chi/season/2020/seasontype/2
den
https://www.espn.com/nba/team/stats/_/name/den/season/2020/seasontype/2
lac
https://www.espn.com/nba/team/stats/_/name/lac/season/2020/seasontype/2
Name GP GS MIN PTS OR DR REB AST STL BLK TO PF AST/TO PER FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% 2PM 2PA 2P% SC-EFF SH-EFF
0 Zach LaVine SG 60 60.0 34.8 25.5 0.7 4.1 4.8 4.2 1.5 0.5 3.4 2.2 1.2 19.52 9.0 20.0 45.0 3.1 8.1 38.0 4.5 5.6 80.2 5.9 11.9 49.7 1.276 0.53
1 Lauri Markkanen PF 50 50.0 29.8 14.7 1.2 5.1 6.3 1.5 0.8 0.5 1.6 1.9 0.9 14.32 5.0 11.8 42.5 2.2 6.3 34.4 2.5 3.1 82.4 2.8 5.5 51.8 1.247 0.52
2 Coby White PG 65 1.0 25.8 13.2 0.4 3.1 3.5 2.7 0.8 0.1 1.7 1.8 1.6 11.92 4.8 12.2 39.4 2.0 5.8 35.4 1.6 2.0 79.1 2.8 6.4 43.0 1.085 0.48
3 Otto Porter Jr. SF 14 9.0 23.6 11.9 0.9 2.5 3.4 1.8 1.1 0.4 0.8 2.2 2.3 15.87 4.4 10.0 44.3 1.7 4.4 38.7 1.4 1.9 70.4 2.7 5.6 48.7 1.193 0.53
4 Wendell Carter Jr. C 43 43.0 29.2 11.3 3.2 6.2 9.4 1.2 0.8 0.8 1.7 3.8 0.7 15.51 4.3 8.0 53.4 0.1 0.7 20.7 2.6 3.5 73.7 4.1 7.3 56.4 1.411 0.54
5 Thaddeus Young PF 64 16.0 24.9 10.3 1.5 3.5 4.9 1.8 1.4 0.4 1.6 2.1 1.1 13.36 4.2 9.4 44.8 1.2 3.5 35.6 0.7 1.1 58.3 3.0 5.9 50.1 1.097 0.51
6 Tomas Satoransky SG 65 64.0 28.9 9.9 1.2 2.7 3.9 5.4 1.2 0.1 2.0 2.1 2.7 13.52 3.6 8.5 43.0 1.0 3.1 32.2 1.6 1.9 87.6 2.7 5.4 49.1 1.169 0.49
7 Chandler Hutchison F 28 10.0 18.8 7.8 0.6 3.2 3.9 0.9 1.0 0.3 1.0 1.7 1.0 12.45 2.9 6.3 45.7 0.4 1.4 31.6 1.6 2.8 59.0 2.4 4.9 49.6 1.246 0.49
8 Kris Dunn PG 51 32.0 24.9 7.3 0.5 3.2 3.6 3.4 2.0 0.3 1.3 3.1 2.5 12.15 3.0 6.7 44.4 0.6 2.2 25.9 0.8 1.1 74.1 2.4 4.5 53.5 1.091 0.49
9 Denzel Valentine SG 36 5.0 13.6 6.8 0.3 1.8 2.1 1.2 0.7 0.2 0.7 1.4 1.7 13.09 2.7 6.6 40.9 1.3 3.8 33.6 0.2 0.2 75.0 1.4 2.8 51.0 1.038 0.51
10 Luke Kornet C 36 14.0 15.5 6.0 0.6 1.7 2.3 0.9 0.3 0.7 0.4 1.5 2.3 12.70 2.3 5.2 43.8 0.9 3.0 28.7 0.6 0.8 71.4 1.4 2.2 64.6 1.150 0.52
11 Daniel Gafford C 43 7.0 14.2 5.1 1.2 1.3 2.5 0.5 0.3 1.3 0.7 2.3 0.7 16.21 2.2 3.1 70.1 0.0 0.0 0.0 0.7 1.4 53.3 2.2 3.1 70.1 1.642 0.70
12 Shaquille Harrison G 43 10.0 11.3 4.9 0.5 1.5 2.0 1.1 0.8 0.4 0.4 1.3 2.6 17.81 1.8 3.8 46.7 0.4 1.0 38.1 0.9 1.2 78.0 1.4 2.9 49.6 1.267 0.52
13 Ryan Arcidiacono PG 58 4.0 16.0 4.5 0.3 1.6 1.9 1.7 0.5 0.1 0.6 1.7 2.6 9.04 1.6 3.8 40.9 0.9 2.4 39.1 0.5 0.7 71.1 0.6 1.4 43.9 1.186 0.53
14 Cristiano Felicio PF 22 0.0 17.5 3.9 2.5 2.1 4.6 0.7 0.5 0.1 0.8 1.5 0.9 12.79 1.5 2.5 63.0 0.0 0.1 0.0 0.8 1.0 78.3 1.5 2.4 65.4 1.593 0.63
15 Adam Mokoka SG 11 0.0 10.2 2.9 0.6 0.3 0.9 0.4 0.4 0.0 0.2 1.5 2.0 8.18 1.1 2.5 42.9 0.5 1.4 40.0 0.2 0.4 50.0 0.5 1.2 46.2 1.143 0.54
16 Max Strus SG 2 0.0 3.0 2.5 0.5 0.0 0.5 0.0 0.0 0.0 0.0 0.5 0.0 30.82 1.0 1.5 66.7 0.0 0.5 0.0 0.5 0.5 100.0 1.0 1.0 100.0 1.667 0.67
17 Total 65 NaN NaN 106.8 10.5 31.4 41.9 23.2 10.0 4.1 14.6 21.8 1.6 NaN 39.6 88.6 44.7 12.2 35.1 34.8 15.5 20.5 75.5 27.4 53.5 51.1 1.205 0.52
0 Nikola Jokic C 73 73.0 32.0 19.9 2.3 7.5 9.7 7.0 1.2 0.6 3.1 3.0 2.3 24.97 7.7 14.7 52.8 1.1 3.5 31.4 3.4 4.1 81.7 6.6 11.2 59.4 1.359 0.56
1 Jamal Murray PG 59 59.0 32.3 18.5 0.8 3.2 4.0 4.8 1.1 0.3 2.2 1.7 2.2 17.78 6.9 15.2 45.6 1.9 5.5 34.6 2.8 3.1 88.1 5.0 9.7 51.9 1.220 0.52
2 Will Barton SF 58 58.0 33.0 15.1 1.3 5.0 6.3 3.7 1.1 0.5 1.5 2.1 2.4 15.70 5.7 12.7 45.0 1.9 5.0 37.5 1.8 2.3 76.7 3.9 7.8 49.8 1.184 0.52
3 Jerami Grant SF 71 24.0 26.6 12.0 0.8 2.7 3.5 1.2 0.7 0.8 0.9 2.2 1.4 14.46 4.3 8.9 47.8 1.4 3.5 38.9 2.1 2.8 75.0 2.9 5.4 53.7 1.342 0.56
4 Paul Millsap PF 51 48.0 24.3 11.6 1.9 3.8 5.7 1.6 0.9 0.6 1.4 2.9 1.2 16.96 4.1 8.6 48.2 1.1 2.4 43.5 2.3 2.8 81.6 3.1 6.2 50.0 1.349 0.54
5 Gary Harris SG 56 55.0 31.8 10.4 0.5 2.4 2.9 2.1 1.4 0.3 1.1 2.1 2.0 9.78 3.9 9.3 42.0 1.3 3.8 33.3 1.3 1.6 81.5 2.6 5.5 47.9 1.119 0.49
6 Michael Porter Jr. SF 55 8.0 16.4 9.3 1.2 3.5 4.7 0.8 0.5 0.5 0.9 1.8 0.9 19.84 3.5 7.0 50.9 1.1 2.7 42.2 1.1 1.3 83.3 2.4 4.3 56.4 1.337 0.59
7 Monte Morris PG 73 12.0 22.4 9.0 0.3 1.5 1.9 3.5 0.8 0.2 0.7 1.0 4.8 14.98 3.6 7.8 45.9 0.9 2.4 37.8 1.0 1.2 84.3 2.7 5.4 49.5 1.166 0.52
8 Malik Beasley SG * 41 0.0 18.2 7.9 0.2 1.7 1.9 1.2 0.8 0.1 0.9 1.2 1.3 10.51 2.9 7.3 38.9 1.4 3.9 36.0 0.8 0.9 86.8 1.4 3.4 42.1 1.080 0.49
9 Mason Plumlee C 61 1.0 17.3 7.2 1.6 3.6 5.2 2.5 0.5 0.6 1.3 2.3 1.9 18.86 2.9 4.7 61.5 0.0 0.1 0.0 1.4 2.5 53.5 2.9 4.6 62.5 1.517 0.61
10 PJ Dozier SG 29 0.0 14.2 5.8 0.3 1.6 1.9 2.2 0.5 0.2 0.9 1.6 2.3 11.66 2.2 5.4 41.4 0.6 1.7 34.7 0.7 1.0 72.4 1.7 3.7 44.4 1.070 0.47
11 Bol Bol C 7 0.0 12.4 5.7 0.7 2.0 2.7 0.9 0.3 0.9 1.4 1.6 0.6 14.41 2.0 4.0 50.0 0.6 1.3 44.4 1.1 1.4 80.0 1.4 2.7 52.6 1.429 0.57
12 Torrey Craig SF 58 27.0 18.5 5.4 1.1 2.2 3.3 0.8 0.4 0.6 0.4 2.3 1.9 10.79 2.1 4.6 46.1 0.8 2.4 32.6 0.4 0.6 61.1 1.4 2.3 60.3 1.171 0.54
13 Keita Bates-Diop SF * 7 0.0 14.0 5.3 0.6 1.9 2.4 0.0 0.3 0.6 0.4 1.0 0.0 12.13 1.9 4.0 46.4 0.4 1.3 33.3 1.1 1.4 80.0 1.4 2.7 52.6 1.321 0.52
14 Troy Daniels G * 6 0.0 12.7 4.3 0.0 1.0 1.0 0.5 0.5 0.0 0.5 1.2 1.0 5.35 1.7 4.7 35.7 1.0 3.3 30.0 0.0 0.0 0.0 0.7 1.3 50.0 0.929 0.46
15 Juancho Hernangomez PF * 34 0.0 12.4 3.1 0.7 2.1 2.8 0.6 0.1 0.1 0.5 0.9 1.2 6.89 1.1 3.2 34.5 0.4 1.8 25.0 0.5 0.7 64.0 0.7 1.5 46.0 0.973 0.41
16 Jordan McRae G * 4 0.0 8.0 2.3 0.3 1.0 1.3 1.0 0.5 0.3 0.0 0.5 inf 16.74 0.5 1.5 33.3 0.5 1.0 50.0 0.8 1.0 75.0 0.0 0.5 0.0 1.500 0.50
17 Tyler Cook F * 2 0.0 9.5 2.0 1.0 1.0 2.0 0.0 1.0 0.0 1.0 0.5 0.0 11.31 0.5 1.0 50.0 0.0 0.0 0.0 1.0 1.0 100.0 0.5 1.0 50.0 2.000 0.50
18 Noah Vonleh F * 7 0.0 4.3 1.9 0.4 0.7 1.1 0.3 0.0 0.0 0.3 0.6 1.0 17.61 0.7 0.9 83.3 0.1 0.1 100.0 0.3 0.6 50.0 0.6 0.7 80.0 2.167 0.92
19 Vlatko Cancar SF 14 0.0 3.2 1.2 0.4 0.4 0.7 0.2 0.1 0.1 0.2 0.5 1.0 11.45 0.4 1.1 40.0 0.1 0.4 16.7 0.3 0.3 100.0 0.4 0.6 55.6 1.133 0.43
20 Jarred Vanderbilt PF * 9 0.0 4.6 1.1 0.3 0.6 0.9 0.2 0.3 0.1 0.8 0.7 0.3 7.20 0.6 0.8 71.4 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.8 71.4 1.429 0.71
21 Total 73 NaN NaN 111.3 10.8 33.4 44.1 26.7 8.0 4.6 13.1 20.3 2.0 NaN 42.0 88.9 47.3 11.0 30.6 35.9 16.2 20.9 77.7 31.1 58.3 53.3 1.252 0.53
0 Kawhi Leonard SF 57 57.0 32.4 27.1 0.9 6.1 7.1 4.9 1.8 0.6 2.6 2.0 1.9 26.91 9.3 19.9 47.0 2.2 5.7 37.8 6.2 7.1 88.6 7.2 14.2 50.6 1.362 0.52
1 Paul George SG 48 48.0 29.6 21.5 0.5 5.2 5.7 3.9 1.4 0.4 2.6 2.4 1.5 21.14 7.1 16.3 43.9 3.3 7.9 41.2 4.0 4.5 87.6 3.9 8.4 46.4 1.321 0.54
2 Montrezl Harrell C 63 2.0 27.8 18.6 2.6 4.5 7.1 1.7 0.6 1.1 1.7 2.3 1.0 23.26 7.5 12.9 58.0 0.0 0.3 0.0 3.7 5.6 65.8 7.5 12.6 59.3 1.445 0.58
3 Lou Williams SG 65 8.0 28.7 18.2 0.5 2.6 3.1 5.6 0.7 0.2 2.8 1.2 2.0 17.38 6.0 14.4 41.8 1.7 4.8 35.2 4.5 5.2 86.1 4.3 9.6 45.1 1.266 0.48
4 Marcus Morris Sr. SF * 19 19.0 28.9 10.1 0.6 3.5 4.1 1.4 0.7 0.7 1.3 2.7 1.1 8.96 3.9 9.2 42.5 1.4 4.4 31.0 0.9 1.2 81.8 2.5 4.7 53.3 1.103 0.50
5 Reggie Jackson PG * 17 6.0 21.3 9.5 0.4 2.6 3.0 3.2 0.3 0.2 1.6 2.2 1.9 12.66 3.4 7.5 45.3 1.5 3.7 41.3 1.1 1.2 90.5 1.9 3.8 49.2 1.258 0.55
6 Landry Shamet SG 53 30.0 27.4 9.3 0.1 1.8 1.9 1.9 0.4 0.2 0.8 2.7 2.4 8.51 3.0 7.4 40.4 2.1 5.6 37.5 1.2 1.4 85.5 0.9 1.8 49.5 1.258 0.55
7 Ivica Zubac C 72 70.0 18.4 8.3 2.7 4.8 7.5 1.1 0.2 0.9 0.8 2.3 1.3 21.75 3.3 5.3 61.3 0.0 0.0 0.0 1.7 2.3 74.7 3.3 5.3 61.6 1.548 0.61
8 Patrick Beverley PG 51 50.0 26.3 7.9 1.1 4.1 5.2 3.6 1.1 0.5 1.3 3.1 2.8 12.54 2.9 6.7 43.1 1.6 4.0 38.8 0.6 0.9 66.0 1.3 2.6 49.6 1.188 0.55
9 JaMychal Green PF 63 1.0 20.7 6.8 1.2 4.9 6.2 0.8 0.5 0.4 0.9 2.8 0.9 11.11 2.4 5.6 42.9 1.5 3.8 38.7 0.6 0.8 75.0 0.9 1.8 51.8 1.222 0.56
10 Maurice Harkless SF * 50 38.0 22.8 5.5 0.9 3.1 4.0 1.0 1.0 0.6 0.9 2.4 1.0 9.70 2.2 4.3 51.6 0.5 1.5 37.0 0.5 0.8 57.1 1.7 2.9 59.0 1.267 0.58
11 Patrick Patterson PF 59 18.0 13.2 4.9 0.6 2.0 2.6 0.7 0.1 0.1 0.4 0.9 2.0 11.57 1.6 3.9 40.8 1.1 2.9 39.0 0.6 0.7 81.4 0.5 1.0 45.9 1.253 0.55
12 Mfiondu Kabengele F 12 0.0 5.3 3.5 0.1 0.8 0.9 0.2 0.2 0.2 0.2 0.8 1.0 18.28 1.2 2.7 43.8 0.8 1.7 45.0 0.4 0.4 100.0 0.4 1.0 41.7 1.313 0.58
13 Rodney McGruder SG 56 4.0 15.6 3.3 0.5 2.2 2.7 0.6 0.5 0.1 0.4 1.3 1.5 6.75 1.3 3.2 39.8 0.4 1.6 27.0 0.3 0.6 55.9 0.9 1.6 52.2 1.033 0.46
14 Amir Coffey SG 18 1.0 8.8 3.2 0.2 0.7 0.9 0.8 0.3 0.1 0.4 1.1 1.8 8.55 1.3 3.0 42.6 0.3 1.1 31.6 0.3 0.6 54.5 0.9 1.9 48.6 1.074 0.48
15 Jerome Robinson SG * 42 1.0 11.3 2.9 0.1 1.3 1.4 1.1 0.3 0.2 0.6 1.3 1.8 4.86 1.1 3.2 33.8 0.5 1.6 28.4 0.3 0.5 57.9 0.6 1.6 39.1 0.897 0.41
16 Joakim Noah C 5 0.0 10.0 2.8 1.0 2.2 3.2 1.4 0.2 0.2 1.2 1.8 1.2 11.11 0.8 1.6 50.0 0.0 0.0 0.0 1.2 1.6 75.0 0.8 1.6 50.0 1.750 0.50
17 Terance Mann SG 41 6.0 8.8 2.4 0.2 1.1 1.3 1.3 0.3 0.1 0.4 1.1 2.9 10.58 0.9 1.9 46.8 0.2 0.5 35.0 0.4 0.7 66.7 0.7 1.4 50.8 1.253 0.51
18 Derrick Walton Jr. G * 23 1.0 9.7 2.2 0.1 0.6 0.7 1.0 0.2 0.0 0.2 0.8 5.5 8.43 0.7 1.6 47.2 0.4 0.9 42.9 0.3 0.4 77.8 0.3 0.7 53.3 1.389 0.60
19 Johnathan Motley F 13 0.0 3.2 2.2 0.2 0.5 0.8 0.6 0.2 0.0 0.4 0.5 1.6 28.53 0.8 1.2 73.3 0.1 0.1 100.0 0.4 0.5 71.4 0.8 1.1 71.4 1.867 0.77
20 Total 72 NaN NaN 116.3 10.7 37.0 47.7 23.7 7.1 4.7 14.0 22.1 1.7 NaN 41.6 89.2 46.6 12.4 33.5 37.1 20.8 26.3 79.1 29.1 55.8 52.2 1.304 0.54
and creates data.csv:

Is there short Pandas method chain for assigning grouped nth value?

I use nth value as columns without row aggregation.
Because I want to create a feature that can be tracked by using the window function and the aggregation function at any time.
R:
library(tidyverse)
iris %>% arrange(Species, Sepal.Length) %>% group_by(Species) %>%
mutate(cs = cumsum(Sepal.Length), cs4th = cumsum(Sepal.Length)[4]) %>%
slice(c(1:4))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species cs cs4th
<dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 4.3 3 1.1 0.1 setosa 4.3 17.5
2 4.4 2.9 1.4 0.2 setosa 8.7 17.5
3 4.4 3 1.3 0.2 setosa 13.1 17.5
4 4.4 3.2 1.3 0.2 setosa 17.5 17.5
5 4.9 2.4 3.3 1 versicolor 4.9 20
6 5 2 3.5 1 versicolor 9.9 20
7 5 2.3 3.3 1 versicolor 14.9 20
8 5.1 2.5 3 1.1 versicolor 20 20
9 4.9 2.5 4.5 1.7 virginica 4.9 22
10 5.6 2.8 4.9 2 virginica 10.5 22
11 5.7 2.5 5 2 virginica 16.2 22
12 5.8 2.7 5.1 1.9 virginica 22 22
Python: Too long and verbose!
import numpy as np
import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
iris.sort_values(['species','sepal_length']).assign(
index_species=lambda x: x.groupby('species').cumcount(),
cs=lambda x: x.groupby('species').sepal_length.cumsum(),
tmp=lambda x: np.where(x.index_species==3, x.cs, 0),
cs4th=lambda x: x.groupby('species').tmp.transform(sum)
).iloc[list(range(0,4))+list(range(50,54))+list(range(100,104))]
sepal_length sepal_width petal_length ... cs tmp cs4th
13 4.3 3.0 1.1 ... 4.3 0.0 17.5
8 4.4 2.9 1.4 ... 8.7 0.0 17.5
38 4.4 3.0 1.3 ... 13.1 0.0 17.5
42 4.4 3.2 1.3 ... 17.5 17.5 17.5
57 4.9 2.4 3.3 ... 4.9 0.0 20.0
60 5.0 2.0 3.5 ... 9.9 0.0 20.0
93 5.0 2.3 3.3 ... 14.9 0.0 20.0
98 5.1 2.5 3.0 ... 20.0 20.0 20.0
106 4.9 2.5 4.5 ... 4.9 0.0 22.0
121 5.6 2.8 4.9 ... 10.5 0.0 22.0
113 5.7 2.5 5.0 ... 16.2 0.0 22.0
101 5.8 2.7 5.1 ... 22.0 22.0 22.0
Python : My better solution(not smart. There is room for improvement about specifications of groupby )
iris.sort_values(['species','sepal_length']).assign(
cs=lambda x: x.groupby('species').sepal_length.transform('cumsum'),
cs4th=lambda x: x.merge(
x.groupby('species', as_index=False).nth(3).loc[:,['species','cs']],on='species')
.iloc[:,-1]
)
This doesn't work in a good way
iris.groupby('species').transform('nth(3)')
Here is an updated solution, using Pandas, which is still longer than what you will get with dplyr:
import seaborn as sns
import pandas as pd
iris = sns.load_dataset('iris')
iris['cs'] = (iris
.sort_values(['species','sepal_length'])
.groupby('species')['sepal_length']
.transform('cumsum'))
M = (iris
.sort_values(['species','cs'])
.groupby('species')['cs'])
groupby has a nth function that gets you a row per group : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.nth.html
iris = (iris
.sort_values(['species','cs'])
.reset_index(drop=True)
.merge(M.nth(3), how='left', on='species')
.rename(columns={'cs_x':'cs',
'cs_y':'cs4th'})
)
iris.head()
sepal_length sepal_width petal_length petal_width species cs cs4th
0 4.3 3.0 1.1 0.1 setosa 4.3 17.5
1 4.4 2.9 1.4 0.2 setosa 8.7 17.5
2 4.4 3.0 1.3 0.2 setosa 13.1 17.5
3 4.4 3.2 1.3 0.2 setosa 17.5 17.5
4 4.5 2.3 1.3 0.3 setosa 22.0 17.5
Update: 16/04/2021 ... Below is a better way to achieve the OP's goal:
(iris
.sort_values(['species', 'sepal_length'])
.assign(cs = lambda df: df.groupby('species')
.sepal_length
.transform('cumsum'),
cs4th = lambda df: df.groupby('species')
.cs
.transform('nth', 3)
)
.groupby('species')
.head(4)
)
sepal_length sepal_width petal_length petal_width species cs cs4th
13 4.3 3.0 1.1 0.1 setosa 4.3 17.5
8 4.4 2.9 1.4 0.2 setosa 8.7 17.5
38 4.4 3.0 1.3 0.2 setosa 13.1 17.5
42 4.4 3.2 1.3 0.2 setosa 17.5 17.5
57 4.9 2.4 3.3 1.0 versicolor 4.9 20.0
60 5.0 2.0 3.5 1.0 versicolor 9.9 20.0
93 5.0 2.3 3.3 1.0 versicolor 14.9 20.0
98 5.1 2.5 3.0 1.1 versicolor 20.0 20.0
106 4.9 2.5 4.5 1.7 virginica 4.9 22.0
121 5.6 2.8 4.9 2.0 virginica 10.5 22.0
113 5.7 2.5 5.0 2.0 virginica 16.2 22.0
101 5.8 2.7 5.1 1.9 virginica 22.0 22.0
Now you can do it in a non-verbose way as you did in R with datar in python:
>>> from datar.datasets import iris
>>> from datar.all import f, arrange, mutate, cumsum, slice
>>>
>>> (iris >>
... arrange(f.Species, f.Sepal_Length) >>
... group_by(f.Species) >>
... mutate(cs=cumsum(f.Sepal_Length), cs4th=cumsum(f.Sepal_Length)[3]) >>
... slice(f[1:4]))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species cs cs4th
0 4.3 3.0 1.1 0.1 setosa 4.3 17.5
1 4.4 2.9 1.4 0.2 setosa 8.7 17.5
2 4.4 3.0 1.3 0.2 setosa 13.1 17.5
3 4.4 3.2 1.3 0.2 setosa 17.5 17.5
4 4.9 2.4 3.3 1.0 versicolor 4.9 20.0
5 5.0 2.0 3.5 1.0 versicolor 9.9 20.0
6 5.0 2.3 3.3 1.0 versicolor 14.9 20.0
7 5.1 2.5 3.0 1.1 versicolor 20.0 20.0
8 4.9 2.5 4.5 1.7 virginica 4.9 22.0
9 5.6 2.8 4.9 2.0 virginica 10.5 22.0
10 5.7 2.5 5.0 2.0 virginica 16.2 22.0
11 5.8 2.7 5.1 1.9 virginica 22.0 22.0
[Groups: ['Species'] (n=3)]
I am the author of the package. Feel free to submit issues if you have any questions.

pandas dataframe assign doesn't update the dataframe

I made a pandas dataframe of the Iris dataset and I want to put 4 extra column in it. The content of the columns have to be SepalRatio, PetalRatio, SepalMultiplied, PetalMultiplied. I used the assign() function of the DataFrame to add this four columns but the DataFrame remains the samen.
My code to add column is :
iris.assign(SepalRatio = iris['SepalLengthCm'] / `iris['SepalWidthCm']).assign(PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm']).assign(SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm']).assign(PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])`
When executing in Jupyter notebook a correct table is shown but if I use the print statement the four column aren't added.
Output in Jupyter notebook :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species SepalRatio PetalRatio SepalMultiplied PetalMultiplied
0 1 5.1 3.5 1.4 0.2 Iris-setosa 1.457143 7.000000 17.85 0.28
1 2 4.9 3.0 1.4 0.2 Iris-setosa 1.633333 7.000000 14.70 0.28
2 3 4.7 3.2 1.3 0.2 Iris-setosa 1.468750 6.500000 15.04 0.26
3 4 4.6 3.1 1.5 0.2 Iris-setosa 1.483871 7.500000 14.26 0.30
4 5 5.0 3.6 1.4 0.2 Iris-setosa 1.388889 7.000000 18.00 0.28
5 6 5.4 3.9 1.7 0.4 Iris-setosa 1.384615 4.250000 21.06 0.68
6 7 4.6 3.4 1.4 0.3 Iris-setosa 1.352941 4.666667 15.64 0.42
7 8 5.0 3.4 1.5 0.2 Iris-setosa 1.470588 7.500000 17.00 0.30
8 9 4.4 2.9 1.4 0.2 Iris-setosa 1.517241 7.000000 12.76 0.28
9 10 4.9 3.1 1.5 0.1 Iris-setosa 1.580645 15.000000 15.19 0.15
output after printing the dataframe :
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
5 6 5.4 3.9 1.7 0.4
6 7 4.6 3.4 1.4 0.3
7 8 5.0 3.4 1.5 0.2
8 9 4.4 2.9 1.4 0.2
9 10 4.9 3.1 1.5 0.1
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
5 Iris-setosa
6 Iris-setosa
7 Iris-setosa
8 Iris-setosa
9 Iris-setosa
You need assign output to variable like:
iris = iris.assign(SepalRatio = iris['SepalLengthCm'] / iris['SepalWidthCm']).assign(PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm']).assign(SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm']).assign(PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])
Beter is use only one assign:
iris = iris.assign(SepalRatio = iris['SepalLengthCm'] / iris['SepalWidthCm'],
PetalRatio = iris['PetalLengthCm'] / iris['PetalWidthCm'],
SepalMultiplied = iris['SepalLengthCm'] * iris['SepalWidthCm'],
PetalMultiplied = iris['PetalLengthCm'] * iris['PetalWidthCm'])

Comparing two DataFrames elements with different DataFrame

I have a DataFrame which has multiple samples
df1 = [[ 0 1 2 3 4 5 6 7 8 9 10 11 \
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 13.4 5.2 7.7 -2.1 1.6 -4.1 -0.5 8.2 15.9 12.9 11.8 9.3
2 -3.1 -0.6 -5.1 -0.5 -4.1 0.5 -3.6 -5.6 -9.7 -3.6 -4.7 -5.7
3 -10.3 -1.0 -9.8 0.5 -3.6 1.0 -1.5 -1.6 -5.1 -4.6 -13.3 -10.7
4 0.0 -5.6 -4.1 1.5 3.0 -1.0 2.6 6.7 12.3 6.6 -0.5 1.0
5 6.2 0.5 2.6 1.1 1.6 0.5 2.0 0.0 -0.5 0.5 7.7 5.6
6 -1.6 5.1 6.1 -1.1 -2.1 0.0 -1.5 -9.2 -13.9 -7.1 1.5 -0.5
7 -6.1 -4.1 -1.5 -1.5 -0.5 -0.5 -0.5 -2.6 2.6 -2.6 -6.6 -3.1
8 -0.5 -4.1 -6.7 1.5 0.0 0.5 1.0 8.2 8.7 4.1 -3.1 0.6
9 5.1 7.2 4.6 0.6 1.0 0.0 0.5 -0.5 -7.7 -0.5 5.6 0.5
10 0.5 2.6 5.7 -2.1 2.1 -1.0 0.0 -8.2 -10.2 -6.2 3.6 0.0
11 -5.1 -7.2 -4.6 1.0 -1.0 2.0 -2.0 8.7 14.3 10.3 2.1 8.2
12 0.5 -2.6 -1.6 3.1 -0.5 3.1 4.1 10.8 15.9 9.7 2.5 7.7
13 2.6 2.1 0.0 0.0 -0.6 -0.5 8.7 -8.7 -15.9 -13.3 -3.0 -9.3
14 -2.1 4.6 1.0 -2.6 1.1 0.0 0.0 -6.2 -10.8 -9.7 -1.1 -4.1
15 4.6 5.1 6.2 -0.5 7.7 3.1 -3.6 8.2 19.0 11.7 7.2 12.9
16 1.6 -6.1 -2.6 0.5 5.1 2.0 1.0 0.0 5.7 5.7 2.1 1.0
17 -11.8 -10.8 -10.7 -1.5 -6.2 -1.0 -3.1 -10.7 -23.1 -11.3 -6.7 -12.8
18 -6.2 -0.5 -0.5 0.0 -7.7 -3.6 -7.7 0.0 -3.1 0.0 -4.1 -0.6
19 9.8 3.6 4.1 1.5 -2.0 -4.6 -1.0 11.2 25.1 14.9 1.5 8.8
20 7.7 3.0 2.0 0.0 3.6 2.5 3.1 1.6 3.1 2.0 6.2 2.5
21 -1.1 3.1 3.1 -0.5 7.2 7.2 2.0 -11.3 -26.7 -12.8 3.1 -2.5
22 0.5 -1.0 0.0 0.5 3.0 0.5 -2.0 -0.5 -2.0 -1.0 -3.1 1.0
23 1.1 -2.1 -2.6 -1.5 -0.5 -0.5 -2.6 9.2 23.1 6.6 -1.0 1.0
24 -2.1 3.6 2.1 -1.0 1.1 3.6 1.6 -3.6 -9.3 -5.1 2.0 -2.0
25 1.6 4.6 5.1 2.0 -4.7 -2.6 1.0 -2.0 -11.8 -2.0 2.1 3.0
26 0.5 -1.5 -2.6 1.1 -7.7 -9.2 0.5 6.1 15.4 5.6 -2.1 0.0
27 -6.2 -11.3 -11.8 -0.6 -1.5 2.1 5.1 -3.6 -2.0 -4.6 -6.7 -11.2
28 -1.0 -4.1 -1.0 0.6 1.0 8.7 7.7 -4.1 -10.3 -2.1 0.0 -1.1
29 8.2 10.3 10.3 0.0 -0.5 0.5 3.6 3.1 6.2 4.7 6.2 10.3
.. ... ... ... ... ... ... ... ... ... ... ... ...
98 -1.0 -4.1 -1.0 1.1 0.5 -3.1 -6.7 1.5 10.3 2.5 -1.0 -6.1
99 5.6 9.8 5.1 -1.6 1.6 -1.0 4.7 11.8 18.4 22.6 11.8 13.3
100 -1.0 2.0 1.1 0.5 -1.1 1.5 11.2 -5.6 -4.1 8.2 4.1 9.3
101 -4.6 -9.2 -5.7 2.6 -1.0 1.6 -0.5 -10.8 -14.3 -16.4 -8.2 -14.4
102 1.5 0.0 2.6 -1.5 0.5 -0.5 -10.7 4.6 -2.1 -13.3 -0.5 -3.1
103 -2.5 -2.6 1.5 -1.6 -4.1 -2.6 -5.2 2.6 2.6 -2.6 -3.1 2.6
104 -7.7 -6.7 -7.2 1.6 -6.1 -4.1 1.1 -5.6 -2.1 -2.1 -10.3 -12.8
105 2.6 7.2 -1.0 0.0 2.0 -9.2 0.0 4.1 1.6 4.1 -1.0 -3.1
106 5.6 7.2 7.2 -2.6 5.6 -1.1 -2.1 4.6 1.5 7.2 4.1 9.7
107 -0.5 -5.6 1.0 3.1 1.6 8.8 1.0 -4.6 -1.5 -7.7 -2.6 -6.6
108 -4.6 -3.6 -8.7 1.5 -2.6 0.0 2.6 0.5 5.1 -4.6 -2.5 -7.7
109 -3.6 1.5 -5.6 -5.6 -7.7 -10.8 -7.7 2.5 -0.5 5.1 0.5 4.1
110 0.0 2.1 4.1 0.0 -1.5 -1.5 -8.7 -7.7 -11.3 -8.2 0.5 0.5
111 7.2 6.6 9.7 4.6 9.7 10.2 8.7 -4.1 -7.2 -7.7 3.6 2.5
112 5.1 2.6 2.1 -1.6 5.2 5.1 5.6 1.6 3.6 7.7 1.0 8.8
113 -7.7 -5.1 2.5 -2.5 -8.2 -8.7 -11.8 5.6 16.9 11.3 -1.0 -1.1
114 -7.2 1.5 10.8 2.0 -6.2 -8.2 -1.0 4.6 9.8 10.3 6.1 2.6
115 -2.6 -1.5 -2.1 0.5 5.1 5.7 12.8 -9.2 -22.1 -10.3 1.1 2.6
116 -2.0 -8.2 -9.7 -0.5 3.1 9.2 3.1 -9.2 -16.9 -24.6 -11.8 -11.8
117 0.5 1.5 5.6 0.0 -5.1 -1.5 -3.6 13.3 24.1 6.2 -1.6 -3.1
118 1.5 7.2 6.7 -0.5 1.0 -2.1 2.1 8.7 10.3 18.4 9.3 10.8
119 8.3 6.1 3.1 1.0 8.2 4.1 -0.5 -16.4 -22.6 -6.6 1.5 2.0
120 9.2 5.7 5.6 1.6 2.6 1.0 -10.3 -5.6 2.6 -2.1 2.1 0.0
121 -2.1 -4.1 -1.5 -2.6 -2.1 -3.6 -8.2 13.3 22.5 9.7 7.1 4.1
122 -5.6 -2.6 -2.6 -2.0 -1.0 1.1 6.2 4.6 -2.5 -0.5 -1.0 -0.5
123 4.1 7.2 6.7 2.5 -1.5 2.0 11.2 -7.2 -14.4 -6.1 -1.0 -1.0
124 3.1 1.5 4.6 -1.0 0.0 -1.0 -2.5 -3.5 -2.0 0.0 1.5 1.0
125 -2.1 -2.0 -3.6 -2.1 1.5 1.0 -6.7 -1.1 1.5 0.0 -5.1 -5.1
126 1.6 3.0 -0.5 1.1 0.0 -1.5 1.0 1.6 4.6 0.5 -4.6 0.0
127 -2.1 -1.0 -2.6 1.0 0.5 -6.2 4.1 7.1 12.3 3.6 2.0 3.6
12 13
0 NaN NaN
1 12.8 14.9
2 -3.0 -7.7
3 -5.7 -10.8
4 2.6 1.5
5 4.1 4.1
6 -2.1 1.1
7 -7.7 -2.6
8 1.6 -3.6
9 4.1 0.5
10 -5.7 3.1
11 2.1 2.6
12 8.2 3.6
13 -2.1 -1.6
14 -2.5 -2.0
15 6.1 5.6
16 0.5 0.0
17 -5.1 -11.3
18 -2.0 -6.1
19 1.5 8.7
20 2.1 7.2
21 -0.6 -0.5
22 2.1 3.0
23 2.0 2.6
24 -5.1 -3.6
25 2.1 2.6
26 2.5 3.6
27 -13.8 -12.3
28 -3.6 -4.1
29 11.3 11.2
.. ... ...
98 6.2 1.0
99 12.3 9.7
100 -2.1 2.1
101 -7.2 -6.2
102 3.6 -1.5
103 -4.1 -3.1
104 -12.8 -7.2
105 3.6 0.6
106 13.8 4.6
107 -2.5 -3.1
108 -11.3 -7.2
109 -2.6 -3.1
110 4.1 1.1
111 7.2 4.6
112 4.1 4.1
113 -4.1 -2.1
114 3.6 3.1
115 0.5 -4.1
116 -11.3 -12.3
117 0.0 3.1
118 9.3 11.8
119 0.5 3.6
120 -1.1 1.5
121 2.6 -0.5
122 -2.6 -2.6
123 -1.5 4.1
124 2.1 1.0
125 -1.1 -1.0
126 0.5 5.7
127 0.6 -2.6
[128 rows x 14 columns x 60 samples]]
I have around 60 of these, then I have another DataFrame which of size df2 = (1x14)
What I wanted to do is, to check if the values of row in df2 is equal to or greater than or less than the corresponding rows in df1 is so it should give me values of 1, 0 or -1 for each element in that row.
which should look something like this
0 1 0 0 0 0 -1 0 -0....
-1 0 1 1 0 -1 0 0 ....
.
.
can anyone help me with this?
OK I think the following should work:
In [208]:
# load some data
t="""1 13.4 5.2 7.7 -2.1 1.6 -4.1 -0.5 8.2 15.9 12.9 11.8 9.3
2 -3.1 -0.6 -5.1 -0.5 -4.1 0.5 -3.6 -5.6 -9.7 -3.6 -4.7 -5.7
3 -10.3 -1.0 -9.8 0.5 -3.6 1.0 -1.5 -1.6 -5.1 -4.6 -13.3 -10.7
4 0.0 -5.6 -4.1 1.5 3.0 -1.0 2.6 6.7 12.3 6.6 -0.5 1.0
5 6.2 0.5 2.6 1.1 1.6 0.5 2.0 0.0 -0.5 0.5 7.7 5.6
6 -1.6 5.1 6.1 -1.1 -2.1 0.0 -1.5 -9.2 -13.9 -7.1 1.5 -0.5
7 -6.1 -4.1 -1.5 -1.5 -0.5 -0.5 -0.5 -2.6 2.6 -2.6 -6.6 -3.1
8 -0.5 -4.1 -6.7 1.5 0.0 0.5 1.0 8.2 8.7 4.1 -3.1 0.6
9 5.1 7.2 4.6 0.6 1.0 0.0 0.5 -0.5 -7.7 -0.5 5.6 0.5"""
df = pd.read_csv(io.StringIO(t), delim_whitespace=True, header = None, index_col=[0])
df.reset_index(inplace=True, drop=True)
df
Out[208]:
1 2 3 4 5 6 7 8 9 10 11 12
0 13.4 5.2 7.7 -2.1 1.6 -4.1 -0.5 8.2 15.9 12.9 11.8 9.3
1 -3.1 -0.6 -5.1 -0.5 -4.1 0.5 -3.6 -5.6 -9.7 -3.6 -4.7 -5.7
2 -10.3 -1.0 -9.8 0.5 -3.6 1.0 -1.5 -1.6 -5.1 -4.6 -13.3 -10.7
3 0.0 -5.6 -4.1 1.5 3.0 -1.0 2.6 6.7 12.3 6.6 -0.5 1.0
4 6.2 0.5 2.6 1.1 1.6 0.5 2.0 0.0 -0.5 0.5 7.7 5.6
5 -1.6 5.1 6.1 -1.1 -2.1 0.0 -1.5 -9.2 -13.9 -7.1 1.5 -0.5
6 -6.1 -4.1 -1.5 -1.5 -0.5 -0.5 -0.5 -2.6 2.6 -2.6 -6.6 -3.1
7 -0.5 -4.1 -6.7 1.5 0.0 0.5 1.0 8.2 8.7 4.1 -3.1 0.6
8 5.1 7.2 4.6 0.6 1.0 0.0 0.5 -0.5 -7.7 -0.5 5.6 0.5
now use nested np.where to mask the df using gt and lt to set 1 and -1 respectively 0 when it's equal if both conditions not met:
In [213]:
df1 = pd.DataFrame(np.arange(12)).astype(float)
df = pd.DataFrame(np.where(df.gt(df1.squeeze(), axis=0), 1, np.where(df.lt(df1.squeeze(), axis=0), -1, 0)))
df
Out[213]:
0 1 2 3 4 5 6 7 8 9 10 11
0 1 1 1 -1 1 -1 -1 1 1 1 1 1
1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
3 -1 -1 -1 -1 0 -1 -1 1 1 1 -1 -1
4 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 1 1
5 -1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1
6 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
7 -1 -1 -1 -1 -1 -1 -1 1 1 -1 -1 -1
8 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
9 0 0 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0
The above should work so long as the indices and column labels match
Setup
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.choice(range(10), (20, 10)))
df2 = pd.Series(np.random.choice(range(10), (10,)))
try:
1 * (df1 > df2) - (df1 < df2)
0 1 2 3 4 5 6 7 8 9
0 -1 1 1 -1 -1 0 -1 -1 1 -1
1 -1 1 -1 -1 0 0 -1 -1 -1 -1
2 -1 1 1 -1 -1 -1 -1 -1 -1 1
3 0 1 1 -1 -1 -1 -1 -1 0 -1
4 -1 1 1 -1 -1 -1 1 -1 -1 -1
5 -1 1 1 -1 -1 -1 1 -1 -1 -1
6 -1 1 1 -1 0 1 -1 -1 -1 -1
7 -1 1 1 0 -1 -1 1 -1 -1 -1
8 -1 1 1 -1 -1 -1 -1 -1 -1 1
9 -1 1 1 -1 -1 -1 0 -1 1 1
10 -1 1 1 -1 -1 -1 -1 -1 -1 1
11 -1 1 1 -1 -1 -1 1 -1 1 0
12 -1 1 0 -1 -1 -1 -1 0 1 -1
13 1 1 1 -1 -1 1 -1 -1 -1 -1
14 -1 1 -1 -1 -1 0 1 -1 1 -1
15 -1 1 1 -1 -1 1 0 -1 -1 -1
16 -1 0 1 1 -1 -1 1 -1 1 0
17 -1 1 1 -1 -1 1 -1 -1 -1 1
18 -1 1 1 -1 -1 -1 -1 -1 1 1
19 -1 1 1 -1 -1 -1 -1 0 -1 -1

converting daily data temperature into months in pandas

I am trying to convert 10 years(1991-2000) daily data temperature into months in pandas using python 2.7. I have taken the data from the web page ("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/201j_EN.htm"). But I have got a trouble. The data looks as follows:
` datum d_ta d_tx d_tn d_rs d_rf d_ss
---------- ----- ----- ----- ----- ---- -----
1991-01-01 3.0 5.4 1.5 0.2 1 0.0
1991-01-02 4.0 7.2 1.9 0.0 1 6.8
1991-01-03 6.0 8.8 3.6 0.0 1 2.5
1991-01-04 3.7 7.6 2.3 . 2.9
1991-01-05 4.9 7.2 1.5 . 0.0
1991-01-06 2.7 6.2 0.5 . 0.9
1991-01-07 4.0 8.4 1.9 . 3.2
1991-01-08 6.7 8.9 4.6 0.0 0 0.0
1991-01-09 4.1 8.0 3.0 0.3 0 0.0
1991-01-10 4.2 8.1 2.4 0.0 0 0.2
1991-01-11 4.7 6.9 3.6 . 0.7
1991-01-12 7.0 9.8 3.2 . 0.1
1991-01-13 6.3 8.2 4.6 . 0.0
1991-01-14 3.7 6.8 2.2 . 4.7
1991-01-15 0.7 3.4 -1.0 . 7.6
1991-01-16 -1.4 1.4 -3.0 . 7.5
1991-01-17 -2.5 2.1 -5.0 . 8.1
1991-01-18 -1.8 4.0 -5.1 . 7.0
1991-01-19 -3.0 0.1 -4.0 . 5.8
1991-01-20 -2.8 0.5 -5.2 . 5.6
1991-01-21 -5.0 -1.7 -7.8 . 0.0
1991-01-22 -3.3 -1.8 -4.2 . 0.0
1991-01-23 -1.7 0.4 -2.5 . 0.0
1991-01-24 0.0 3.2 -1.6 . 2.2
1991-01-25 1.1 5.1 -0.9 . 6.4
1991-01-26 0.6 4.5 -0.5 . 7.1
1991-01-27 -1.5 2.2 -4.0 . 0.0
1991-01-28 1.3 5.6 -0.8 . 3.8
1991-01-29 0.7 2.6 -0.4 . 1.1
1991-01-30 0.3 4.0 -1.2 . 7.3
1991-01-31 -5.0 -0.2 -7.4 . 8.0
1991-02-01 -8.1 -3.7 -11.7 . 7.6
1991-02-02 -7.0 -2.0 -10.2 . 7.4
1991-02-03 -5.3 0.8 -9.9 . 7.8
1991-02-04 -5.1 -2.3 -7.7 0.1 4 3.7
1991-02-05 -7.5 -4.4 -8.3 . 2.6
1991-02-06 -7.1 -2.2 -11.0 2.0 4 4.9
1991-02-07 -1.8 0.0 -2.7 2.7 4 0.0
1991-02-08 -1.8 0.4 -3.6 21.8 4 0.0
1991-02-09 0.8 2.0 -0.2 1.3 1 0.0
1991-02-10 1.6 3.4 -0.2 3.4 1 0.0
1991-02-11 0.7 2.5 -0.5 1.1 4 0.0
1991-02-12 -0.5 1.2 -1.0 4.7 4 0.0
1991-02-13 -2.0 -0.8 -2.6 0.0 4 0.0
1991-02-14 -1.8 1.4 -3.5 0.1 4 6.3
1991-02-15 -4.2 -0.8 -6.4 . 8.4
1991-02-16 -5.6 -2.4 -9.5 0.1 4 1.5
1991-02-17 -1.3 1.9 -3.8 . 8.3
1991-02-18 -1.3 4.5 -5.5 . 8.5
1991-02-19 -1.5 3.6 -4.7 . 5.8
1991-02-20 -1.4 4.7 -5.4 . 7.3
1991-02-21 1.0 6.1 -2.1 . 6.9
1991-02-22 4.1 10.1 0.5 . 3.2
1991-02-23 5.1 9.7 2.9 . 7.5
1991-02-24 6.0 8.6 5.5 0.0 1 1.8
1991-02-25 3.6 9.2 0.6 . 8.1
1991-02-26 3.9 9.3 1.2 . 2.9
1991-02-27 3.1 6.5 0.3 . 8.8
1991-02-28 1.4 5.3 -2.4 . 4.3
1991-03-01 1.7 3.5 -0.2 . 0.0
1991-03-02 2.4 3.3 1.7 0.8 4 0.0
1991-03-03 3.1 3.8 1.7 . 0.0
1991-03-04 4.3 6.2 2.7 . 1.5
1991-03-05 3.0 5.7 0.6 . 1.2
.........`
Somebody please can help me how I can convert it into months. Thanks!
After copying the table into memory starting from the numbers:
import pandas, bs4, requests, itertools, io
html = requests.get("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/201j_EN.htm").text
soup = bs4.BeautifulSoup(html)
# the manual way:
# data = pandas.read_clipboard(names=["datum", "d_ta", "d_tx", "d_tn", "d_rs", "d_rf", "d_ss"], index_col='datum', parse_dates=['datum'])
# the automatic way:
table_html = '\n'.join(itertools.islice(map(lambda _: _.text, soup.find_all("pre")), 3, None))
data = pandas.read_table(io.StringIO(table_html), header=None, sep='\s+', index_col=0, parse_dates=0,
names=["datum", "d_ta", "d_tx", "d_tn", "d_rs", "d_rf", "d_ss"])
data.resample('m').mean()
You can, of course, use a different aggregation function other than the mean. The output:
d_ta d_tx d_tn d_rf d_ss
datum
1991-01-31 1.345161 4.609677 -0.574194 3.000000 1.583333
1991-02-28 -1.142857 2.592857 -3.639286 5.157143 1.516667
1991-03-31 8.158065 12.093548 5.141935 2.645161 0.775000
1991-04-30 9.920000 14.570000 6.510000 4.066667 4.450000
1991-05-31 13.396774 17.780645 9.738710 4.529032 4.280000
...

Categories

Resources