Selecting data in pandas based on conditions

Selecting data in pandas based on conditions - python

I am importing data from excel using Pandas and it looks like below,
time Column1 Column2 Column3 ID
0 1.0 181.359 -1.207 9.734 10
1 2.0 181.357 -1.179 9.729 10
2 3.0 181.357 -0.713 9.732 10
3 602.0 179.148 505.520 17.774 1810
4 603.0 179.153 506.824 17.765 1810
5 604.0 179.128 506.169 17.773 1810
6 605.0 179.129 504.141 17.776 1810
7 606.0 179.165 505.214 17.774 1810
8 3003.0 180.032 278.810 17.748 2010
9 3004.0 180.025 279.382 17.749 2010
10 16955.0 450.377 7.271 17.710 4510
11 16956.0 450.375 6.806 17.720 4510
12 16957.0 450.368 7.428 17.710 4510
13 16958.0 450.372 7.892 17.723 4510
14 16959.0 450.359 8.085 17.714 4510
I want to pick up values from the Column1, 2 & 3 based on certain value of ID.
For example, if I give ID=1810 I should get values from Column1, 2 & 3 corresponding to 1810 (row 3 to 7).
I am using numpy.where function to get the correct row number
a = np.where(data['ID'] == 1810)
but could not find out how to select Column data based on that. Thank you in advance for help!

Use pandas.DataFrame.loc: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
df.loc[df['ID'] == 1810][['Column1', 'Column2', 'Column3']]

Related

fastest way to access dataframe cell by colums values?

I have the following dataframe :
time bk1_lvl0_id bk2_lvl0_id pr_ss order_upto_level initial_inventory leadtime1 leadtime2 adjusted_leadtime
0 2020 1000 3 16 18 17 3 0.100000 1
1 2020 10043 3 65 78 72 12 0.400000 1
2 2020 1005 3 0 1 1 9 0.300000 1
3 2020 1009 3 325 363 344 21 0.700000 1
4 2020 102 3 0 1 1 7 0.233333 1
I want a function to get the pr_ss for example for (bk1_lvl0_id=1000,bk2_lvl0_id=3).
that's the code i've tried but it takes time :
def get_safety_stock(df,bk1,bk2):
##a function that returns the safety stock for any given (bk1,bk2)
for index,row in df.iterrows():
if (row["bk1_lvl0_id"]==bk1) and (row["bk2_lvl0_id"]==bk2):
return int(row["pr_ss"])
break

If your dataframe has no duplicate values based on bk1_lvl0_id and bk2_lvl0_id, You can make function as follows:
def get_safety_stock(df,bk1,bk2):
return df.loc[df.bk1_lvl0_id.eq(bk1) & df.bk2_lvl0_id.eq(bk2), 'pr_ss'][0]
Note that its accessing the first value in the Series which shouldnt be an issue if there are no duplicates in data. If you want all of them, just remove the [0] from the end and it should give you the whole series. This can be called as follows:
get_safety_stock(df, 1000,3)
>>>16

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000

ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

How to Group by the mean of specific columns in Python

In the dataframe below:
import pandas as pd
import numpy as np
df= {
'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'],
'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19],
'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12],
'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
}
df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7'])
df.info()
I want to do the following:
Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7)
Export all the outputs as one xlsx file (refer to the figure below)
The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated)
My attempt:
# Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7
df1 = df[['AIC_TRX', 'diff', 'series',
'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4',
'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']]
#Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7'
df1.groupby('series').Grwth_Time1.agg(['mean'])
Thanks in advance

You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time.
df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3',
'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx")
Output:
AIC_TRX 1 2 3 4
series
1 150.78 208.07 146.87 229.28
2 162.34 217.76 182.54 244.73
4 188.53 229.48 189.57 269.91
8 197.69 139.51 199.97 249.19
AIC_TRX 1 2 3 4
series
1 250.78 308.07 346.87 329.28
2 262.34 317.70 382.54 347.73
4 288.53 329.81 369.59 369.91
8 297.69 339.15 399.97 349.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 270.84 318.73 398.75 494.85
2 282.14 327.47 432.18 509.39
4 298.53 369.63 449.78 515.52
8 306.69 389.59 473.55 539.23
AIC_TRX 1 2 3 4
series
1 25.78 30.07 34.87 29.28
2 22.34 17.70 32.54 34.73
4 28.53 29.81 36.59 36.91
8 27.69 33.15 39.97 34.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 27.84 18.73 38.75 13.85
2 28.14 27.47 24.18 9.39
4 29.53 36.63 24.78 15.52
8 30.69 38.59 21.55 39.23

Just use the df.apply method to average across each column based on series and AIC_TRX grouping.
result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1)
Result:
series AIC_TRX
1 1 0 120.738
2 4 156.281
3 8 170.285
4 12 196.270
2 1 1 122.358
2 5 152.758
3 9 184.494
4 13 205.175
4 1 2 135.471
2 6 171.968
3 10 187.825
4 14 214.907
8 1 3 142.183
2 7 162.849
3 11 196.851
4 15 216.455
dtype: float64

How to calculate the expanding mean of all the columns across the DataFrame and add to DataFrame

I am trying to calculate the means of all previous rows for each column of the DataFrame and add the calculated mean column to the DataFrame.
I am using a set of nba games data that contains 20+ features (columns) that I am trying to calculate the means for. Example of the dataset is below. (Note. "...." represent rest of the feature columns)
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25
Example for calculating two of the columns:
dataset = pd.read_csv('nba.games.stats.csv')
df = dataset
df['Game_mean'] = (df.groupby('Team')['TeamPoints'].apply(lambda x: x.shift().expanding().mean()))
df['TeamPoints_mean'] = (df.groupby('Team')['OpponentsPoints'].apply(lambda x: x.shift().expanding().mean()))
Again, the code only calculates the mean and adding the column to the DataFrame one at a time. Is there a way to get the column means and add them to the DataFrame without doing one at a time? For loop? Example of what I am looking for is below.
Team TeamPoints OpponentPoints.... TeamPoints_mean OpponentPoints_mean ...("..." = mean columns of rest of the feature columns)
ATL 102 109 .... nan nan
ATL 102 92 .... 102 109
ATL 92 94 .... 102 100.5
BOS 119 122 .... 98.67 98.33
BOS 103 96 .... 103.75 104.25

Try this one:
(0) sample input:
>>> df
col1 col2 col3
0 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797
2 0.042541 1.196383 6.568839
3 4.784911 0.444671 8.019933
4 3.831556 0.902672 0.198920
5 3.672763 2.236639 1.528215
6 0.792616 2.604049 0.373296
7 2.281992 2.563639 1.500008
8 4.096861 0.598854 4.934116
9 3.632607 1.502801 0.241920
Then processing:
(1) side table to get all the means on the side (I didn't find cummulative mean function, so went with cumsum + count)
>>> df_side=df.assign(col_temp=1).cumsum()
>>> df_side
col1 col2 col3 col_temp
0 1.490977 1.784433 0.852842 1.0
1 5.217640 4.629801 8.619638 2.0
2 5.260182 5.826184 15.188477 3.0
3 10.045093 6.270855 23.208410 4.0
4 13.876649 7.173527 23.407330 5.0
5 17.549412 9.410166 24.935545 6.0
6 18.342028 12.014215 25.308841 7.0
7 20.624021 14.577855 26.808849 8.0
8 24.720882 15.176708 31.742965 9.0
9 28.353489 16.679509 31.984885 10.0
>>> for el in df.columns:
... df_side["{}_mean".format(el)]=df_side[el]/df_side.col_temp
>>> df_side=df_side.drop([el for el in df.columns] + ["col_temp"], axis=1)
>>> df_side
col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842
1 2.608820 2.314901 4.309819
2 1.753394 1.942061 5.062826
3 2.511273 1.567714 5.802103
4 2.775330 1.434705 4.681466
5 2.924902 1.568361 4.155924
6 2.620290 1.716316 3.615549
7 2.578003 1.822232 3.351106
8 2.746765 1.686301 3.526996
9 2.835349 1.667951 3.198489
(2) joining back, on index:
>>> df_final=df.join(df_side)
>>> df_final
col1 col2 col3 col1_mean col2_mean col3_mean
0 1.490977 1.784433 0.852842 1.490977 1.784433 0.852842
1 3.726663 2.845369 7.766797 2.608820 2.314901 4.309819
2 0.042541 1.196383 6.568839 1.753394 1.942061 5.062826
3 4.784911 0.444671 8.019933 2.511273 1.567714 5.802103
4 3.831556 0.902672 0.198920 2.775330 1.434705 4.681466
5 3.672763 2.236639 1.528215 2.924902 1.568361 4.155924
6 0.792616 2.604049 0.373296 2.620290 1.716316 3.615549
7 2.281992 2.563639 1.500008 2.578003 1.822232 3.351106
8 4.096861 0.598854 4.934116 2.746765 1.686301 3.526996
9 3.632607 1.502801 0.241920 2.835349 1.667951 3.198489

I am trying to calculate the means of all previous rows for each column of the DataFrame
To get all of the columns, you can do:
df_means = df.join(df.cumsum()/
df.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
However, if Team is a column rather the index, you'd want to get rid of it:
df_data = df.drop('Teams', axis=1)
df_means = df.join(df_data.cumsum()/
df_data.applymap(lambda x:1).cumsum(),
r_suffix = "_mean")
You could also do
import numpy as np
df_data = df[[col for col in df.columns
if np.issubdtype(df[col],np.number)]]
Or manually define a list of columns that you want to take the mean of, cols_for_mean, and then do
df_data = df[cols_for_mean]

Data sorting from Excel sheet

i tried to sort the values of particular row in data frame, the values are sorting but index values are not changing....i want to change the index values also according to the sorted data
rld=pd.read_excel(r"C:\Users\DELL\nagrajun sagar reservoir data - Copy.xlsx")
rl = rld.iloc[:,1].sort_values()
rl
output:
15 0.043
3 0.370
17 0.391
2 0.823
16 1.105
1 1.579
0 2.070
12 2.235
4 2.728
18 4.490
9 4.905
13 5.036
14 5.074
11 6.481
10 6.613
6 6.806
7 6.807
8 6.824
5 6.841
Name: 2 October, dtype: float64
rl[0]
output:
2.07
I expected rl[0] as 0.043 but actual result is 2.07 which is index value of before sorted list...

I suppose you can try reset_index() with (drop=True)
Something like rl=rl.reset_index(drop=True) in your case or you can do it while sorting like:
rl = rld.iloc[:,1].sort_values().reset_index(drop=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selecting data in pandas based on conditions - python

Use pandas.DataFrame.loc: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html df.loc[df['ID'] == 1810][['Column1', 'Column2', 'Column3']]

Related

fastest way to access dataframe cell by colums values?

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

How to Group by the mean of specific columns in Python

How to calculate the expanding mean of all the columns across the DataFrame and add to DataFrame

Data sorting from Excel sheet

Categories

Resources