I am attempting to apply several operations that I usually do easily in R to the sample dataset below, using Python/Pandas.
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10
QUER.MAC 9 8 3 5 6 0 5 0 0 0
QUER.VEL 8 9 8 7 0 0 0 0 0 0
CARY.OVA 6 6 2 7 0 2 0 0 0 0
PRUN.SER 3 5 6 6 6 4 5 0 4 1
QUER.ALB 5 4 9 9 7 7 4 6 0 2
JUGL.NIG 2 0 0 0 3 5 6 4 3 0
QUER.RUB 3 4 0 6 9 8 7 6 4 3
JUGL.CIN 0 0 5 0 2 0 0 2 0 2
ULMU.AME 2 2 4 5 6 0 5 0 2 5
TILI.AME 0 0 0 0 2 7 6 6 7 6
ULMU.RUB 4 0 2 2 5 7 8 8 8 7
CARY.COR 0 0 0 0 0 5 6 4 0 3
OSTR.VIR 0 0 0 0 0 0 7 4 6 5
ACER.SAC 0 0 0 0 0 5 4 8 8 9
After reading the data from a text file with
import numpy as np
import pandas as pd
df = pd.read_csv("sample.txt", header=0, index_col=0, delimiter=' ')
I want to: (1) get the frequency of values larger than zero for each column; (2) get the sum of values in each column; (3) find the maximum value in each column.
I managed to obtain (2) using
N = df.apply(lambda x: np.sum(x))
But could not figure out how to achieve (1) and (3).
I need generic solutions, that are not dependent on the names of the columns, because I want to apply these operations on any number of similar matrices (which of course will have different labels and numbers of columns/rows).
Thanks in advance for any hints and suggestions.
Your 1st
df.gt(0).sum()
2nd
df.sum()
3rd
df.max()
You can use mask and describe to get a bunch of stats by column.
df.mask(df <= 0).describe().T
Output:
count mean std min 25% 50% 75% max
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0
The reason to use mask is that count counts all non-NaN values, so masking anything that is < or = to 0 will make then NaN for count.
And, finally, we can add "sum" too, using assign:
df.mask(df<=0).describe().T.assign(sum=df.sum())
Output:
count mean std min 25% 50% 75% max sum
S1 9.0 4.666667 2.549510 2.0 3.00 4.0 6.00 9.0 42
S2 7.0 5.428571 2.439750 2.0 4.00 5.0 7.00 9.0 38
S3 8.0 4.875000 2.642374 2.0 2.75 4.5 6.50 9.0 39
S4 8.0 5.875000 2.031010 2.0 5.00 6.0 7.00 9.0 47
S5 9.0 5.111111 2.368778 2.0 3.00 6.0 6.00 9.0 46
S6 9.0 5.555556 1.878238 2.0 5.00 5.0 7.00 8.0 50
S7 11.0 5.727273 1.272078 4.0 5.00 6.0 6.50 8.0 63
S8 9.0 5.333333 2.000000 2.0 4.00 6.0 6.00 8.0 48
S9 8.0 5.250000 2.314550 2.0 3.75 5.0 7.25 8.0 42
S10 10.0 4.300000 2.540779 1.0 2.25 4.0 5.75 9.0 43
Related
I have the following dataset:
my_df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9],
'machine':['A','A','A','B','B','A','B','B','A'],
'prod':['button','tack','pin','button','tack','pin','clip','clip','button'],
'qty':[100,50,30,70,60,15,200,180,np.nan],
'hours':[4,3,1,3,2,0.5,5,6,np.nan],
'day':[1,1,1,1,1,1,2,2,2]})
my_df['prod_rate']=my_df['qty']/my_df['hours']
my_df
id machine prod qty hours day prod_rate
0 1 A button 100.0 4.0 1 25.000000
1 2 A tack 50.0 3.0 1 16.666667
2 3 A pin 30.0 1.0 1 30.000000
3 4 B button 70.0 3.0 1 23.333333
4 5 B tack 60.0 2.0 1 30.000000
5 6 A pin 15.0 0.5 1 30.000000
6 7 B clip 200.0 5.0 2 40.000000
7 8 B clip 180.0 6.0 2 30.000000
8 9 A button NaN NaN 2 NaN
And I want to count the daily activities, except when there is a NaN (which means that the machine was paralyzed due to failure).
I tried this code:
my_df['activities']=my_df.groupby(['day','machine'])['machine']\
.transform(lambda x: x['machine'].count() if x['qty'].notna() else np.nan)
But it returns me an error: KeyError: 'qty'
This is the expected result:
id machine prod qty hours day prod_rate activities
0 1 A button 100.0 4.0 1 25.000000 4
1 2 A tack 50.0 3.0 1 16.666667 4
2 3 A pin 30.0 1.0 1 30.000000 4
3 4 B button 70.0 3.0 1 23.333333 2
4 5 B tack 60.0 2.0 1 30.000000 2
5 6 A pin 15.0 0.5 1 30.000000 4
6 7 B clip 200.0 5.0 2 40.000000 2
7 8 B clip 180.0 6.0 2 30.000000 2
8 9 A button NaN NaN 2 NaN NaN
Please, could you help me fix my lambda expression? It will help me for this question and for other operations too.
Although I prefer the solution from #steele-farnsworth, here is what OP requested. for the lambda to work
my_df['activities'] = my_df.groupby(['day','machine'])['qty']\
.transform(lambda x: x.count() if x.notna().all() else np.nan)
print(my_df)
Prints
id machine prod qty hours day prod_rate activities
0 1 A button 100.0 4.0 1 25.000000 4.0
1 2 A tack 50.0 3.0 1 16.666667 4.0
2 3 A pin 30.0 1.0 1 30.000000 4.0
3 4 B button 70.0 3.0 1 23.333333 2.0
4 5 B tack 60.0 2.0 1 30.000000 2.0
5 6 A pin 15.0 0.5 1 30.000000 4.0
6 7 B clip 200.0 5.0 2 40.000000 2.0
7 8 B clip 180.0 6.0 2 30.000000 2.0
8 9 A button NaN NaN 2 NaN NaN
You can do the calculation as normal, and then fill in the NaNs where they are wanted afterwards.
>>> my_df['activities'] = my_df.groupby(['day', 'machine'])['machine'].transform('count')
>>> my_df.loc[my_df['qty'].isna(), 'activities'] = np.NaN
>>> my_df
id machine prod qty hours day prod_rate activities
0 1 A button 100.0 4.0 1 25.000000 4.0
1 2 A tack 50.0 3.0 1 16.666667 4.0
2 3 A pin 30.0 1.0 1 30.000000 4.0
3 4 B button 70.0 3.0 1 23.333333 2.0
4 5 B tack 60.0 2.0 1 30.000000 2.0
5 6 A pin 15.0 0.5 1 30.000000 4.0
6 7 B clip 200.0 5.0 2 40.000000 2.0
7 8 B clip 180.0 6.0 2 30.000000 2.0
8 9 A button NaN NaN 2 NaN NaN
You should avoid using lambdas as much as possible in the context of Pandas, as they are not vectorized (and will therefore run slower) and are less communicative than using existing, idiomatic Pandas methods.
I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.
One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0
Given df
df = pd.DataFrame({'distance': [0,1,2,np.nan,3,4,5,np.nan,np.nan,6]})
distance
0 0.0
1 1.0
2 2.0
3 NaN
4 3.0
5 4.0
6 5.0
7 NaN
8 NaN
9 6.0
I want to replace the nans with the inbetween mean
Expected output:
distance
0 0.0
1 1.0
2 2.0
3 2.5
4 3.0
5 4.0
6 5.0
7 5.5
8 5.5
9 6.0
I have seen this_answer but it's for a grouping which isn't my case and I couldn't find anything else.
If you don't want df.interpolate you can compute the mean of the surrounding values manually with df.bfill and df.ffill
(df.ffill() + df.bfill()) / 2
Out:
distance
0 0.0
1 1.0
2 2.0
3 2.5
4 3.0
5 4.0
6 5.0
7 5.5
8 5.5
9 6.0
How about using linear interpolation?
print(df.distance.interpolate())
0 0.000000
1 1.000000
2 2.000000
3 2.500000
4 3.000000
5 4.000000
6 5.000000
7 5.333333
8 5.666667
9 6.000000
Name: distance, dtype: float64
This is my dataframe:
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
id value
0 1 5
1 1 6
2 1 NaN
3 2 NaN
4 2 8
5 2 4
6 2 NaN
7 2 10
8 3 NaN
This is my expected output:
id value
0 1 5
1 1 6
2 1 7
3 2 NaN
4 2 8
5 2 4
6 2 2
7 2 10
8 3 NaN
This is my current output using this code:
df.value.interpolate(method="krogh")
0 5.000000
1 6.000000
2 9.071429
3 10.171429
4 8.000000
5 4.000000
6 2.357143
7 10.000000
8 36.600000
Basically, I want to do two important things here:
Groupby ID then Interpolate using only above values not below row values
This should do the trick:
df["value_interp"]=df.value.combine_first(df.groupby("id")["value"].apply(lambda y: y.expanding().apply(lambda x: x.interpolate(method="krogh").to_numpy()[-1], raw=False)))
Outputs:
id value value_interp
0 1.0 5.0 5.0
1 1.0 6.0 6.0
2 1.0 NaN 7.0
3 2.0 NaN NaN
4 2.0 8.0 8.0
5 2.0 4.0 4.0
6 2.0 NaN 0.0
7 2.0 10.0 10.0
8 3.0 NaN NaN
(It interpolates based only on the previous values within the group - hence index 6 will return 0 not 2)
You can group by id and then loop over groups to make interpolations. For id = 2 interpolation will not give you value 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
data = []
for name, group in df.groupby('id'):
group_interpolation = group.interpolate(method='krogh', limit_direction='forward', axis=0)
data.append(group_interpolation)
df = (pd.concat(data)).round(1)
Output:
id value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 4.7
7 2.0 10.0
8 3.0 NaN
Current pandas.Series.interpolate does not support what you want so to achieve your goal you need to do 2 grouby's that will account for your desire to use only previous rows. The idea is as follows: to combine into one group only missing value (!!!) and previous rows (it might have limitations if you have several missing values in a row, but it serves well for your toy example)
Suppose we have a df:
print(df)
ID Value
0 1 5.0
1 1 6.0
2 1 NaN
3 2 NaN
4 2 8.0
5 2 4.0
6 2 NaN
7 2 10.0
8 3 NaN
Then we will combine any missing values within a group with previous rows:
df["extrapolate"] = df.groupby("ID")["Value"].apply(lambda grp: grp.isnull().cumsum().shift().bfill())
print(df)
ID Value extrapolate
0 1 5.0 0.0
1 1 6.0 0.0
2 1 NaN 0.0
3 2 NaN 1.0
4 2 8.0 1.0
5 2 4.0 1.0
6 2 NaN 1.0
7 2 10.0 2.0
8 3 NaN NaN
You may see, that when grouped by ["ID","extrapolate"] the missing value will fall into the same group as nonnull values of previous rows.
Now we are ready to do extrapolation (with spline of order=1):
df.groupby(["ID","extrapolate"], as_index=False).apply(lambda grp:grp.interpolate(method="spline",order=1)).drop("extrapolate", axis=1)
ID Value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 0.0
7 2.0 10.0
8 NaN NaN
Hope this helps.
I have a pandas dataframe with 11 columns. I want to add the sum of all values of columns 9 and column 10 to the end of table. So far I tried 2 methods:
Assigning the data to the cell with dataframe.iloc[rownumber, 8]. This results in an out of bound error.
Creating a vector with some blank: ' ' by using the following code:
total = ['', '', '', '', '', '', '', '', dataframe['Column 9'].sum(), dataframe['Column 10'].sum(), '']
dataframe = dataframe.append(total)
The result was not nice as it added the total vector as a vertical vector at the end rather than a horizontal one. What can I do to solve the issue?
You need use pandas.DataFrame.append with ignore_index=True
so use:
dataframe=dataframe.append(dataframe[['Column 9','Column 10']].sum(),ignore_index=True).fillna('')
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['col1']=[1,2,3,4]
df['col2']=[2,3,4,5]
df['col3']=[5,6,7,8]
df['col4']=[5,6,7,8]
Using Append:
df=df.append(df[['col2','col3']].sum(),ignore_index=True)
print(df)
col1 col2 col3 col4
0 1.0 2.0 5.0 5.0
1 2.0 3.0 6.0 6.0
2 3.0 4.0 7.0 7.0
3 4.0 5.0 8.0 8.0
4 NaN 14.0 26.0 NaN
Whitout NaN values:
df=df.append(df[['col2','col3']].sum(),ignore_index=True).fillna('')
print(df)
col1 col2 col3 col4
0 1 2.0 5.0 5
1 2 3.0 6.0 6
2 3 4.0 7.0 7
3 4 5.0 8.0 8
4 14.0 26.0
Create new DataFrame with sums. This example DataFrame has columns 'a' and 'b'. df1 is the DataFrame what need to be summed up and df3 is one line DataFrame only with sums:
data = [[df1.a.sum(),df1.b.sum()]]
df3 = pd.DataFrame(data,columns=['a','b'])
Then append it to end:
df1.append(df3)
simply try this:(replace test with your dataframe name)
row wise sum(which you have asked for):
test['Total'] = test[['col9','col10']].sum(axis=1)
print(test)
column wise sum:
test.loc['Total'] = test[['col9','col10']].sum()
test.fillna('',inplace=True)
print(test)
IICU , this is what you need (change numbers 8 & 9 to suit your needs)
df['total']=df.iloc[ : ,[8,9]].sum(axis=1) #horizontal sum
df['total1']=df.iloc[ : ,[8,9]].sum().sum() #Vertical sum
df.loc['total2']=df.iloc[ : ,[8,9]].sum() # vertical sum in rows for only columns 8 & 9
Example
a=np.arange(0, 11, 1)
b=np.random.randint(10, size=(5,11))
df=pd.DataFrame(columns=a, data=b)
0 1 2 3 4 5 6 7 8 9 10
0 0 5 1 3 4 8 6 6 8 1 0
1 9 9 8 9 9 2 3 8 9 3 6
2 5 7 9 0 8 7 8 8 7 1 8
3 0 7 2 8 8 3 3 0 4 8 2
4 9 9 2 5 2 2 5 0 3 4 1
**output**
0 1 2 3 4 5 6 7 8 9 10 total total1
0 0.0 5.0 1.0 3.0 4.0 8.0 6.0 6.0 8.0 1.0 0.0 9.0 48.0
1 9.0 9.0 8.0 9.0 9.0 2.0 3.0 8.0 9.0 3.0 6.0 12.0 48.0
2 5.0 7.0 9.0 0.0 8.0 7.0 8.0 8.0 7.0 1.0 8.0 8.0 48.0
3 0.0 7.0 2.0 8.0 8.0 3.0 3.0 0.0 4.0 8.0 2.0 12.0 48.0
4 9.0 9.0 2.0 5.0 2.0 2.0 5.0 0.0 3.0 4.0 1.0 7.0 48.0
total2 NaN NaN NaN NaN NaN NaN NaN NaN 31.0 17.0 NaN NaN NaN