Pandas: add column based on groupby with condition - python

I have a dataframe with four columns: id1, id2, age, stime. For example
df = pd.DataFrame(np.array([[1, 1, 3, pd.to_datetime('2020-01-10 00:30:16')],
[2, 1, 10, pd.to_datetime('2020-01-27 00:20:20')],
[3, 1, 60, pd.to_datetime('2020-01-26 00:10:08')],
[4, 2, 1, pd.to_datetime('2020-01-13 00:20:19')],
[5, 2, 2, pd.to_datetime('2020-01-12 00:40:17')],
[6, 2, 3, pd.to_datetime('2020-01-10 00:10:53')],
[7, 3, 20, pd.to_datetime('2020-01-21 00:20:57')],
[8, 3, 40, pd.to_datetime('2020-01-20 00:10:38')],
[9, 3, 60, pd.to_datetime('2020-01-01 00:30:38')],
]),
columns=['id1', 'id2', 'age', 'stime'])
I want to add a column where the value is the maximum value of age, that also has a matching id2 and was within the last 2 weeks of the stime for that row. So for the above example I want to get
df2 = pd.DataFrame(np.array([[1, 1, 3, pd.to_datetime('2020-01-10 00:30:16'), 3],
[2, 1, 10, pd.to_datetime('2020-01-27 00:20:20'), 60],
[3, 1, 60, pd.to_datetime('2020-01-26 00:10:08'), 60],
[4, 2, 1, pd.to_datetime('2020-01-13 00:20:19'), 3],
[5, 2, 2, pd.to_datetime('2020-01-12 00:40:17'), 3],
[6, 2, 3, pd.to_datetime('2020-01-10 00:10:53'), 3],
[7, 3, 20, pd.to_datetime('2020-01-21 00:20:57'), 40],
[8, 3, 40, pd.to_datetime('2020-01-20 00:10:38'), 40],
[9, 3, 60, pd.to_datetime('2020-01-01 00:30:38'), 60]
]),
columns=['id1', 'id2', 'age', 'stime', 'max_age_last_2w'])
As the df I want to do this is on is very large, any help on how to do this efficiently would be greatly appreciated - thanks in advance!

Try:
df['max_age_last_2w'] = df.groupby(['id2', pd.Grouper(key='stime', freq='2W', closed='right')])['age'].transform('max')
Output:
id1 id2 age stime max_age_last_2w
0 1 1 3 2020-01-10 00:30:16 3
1 2 1 10 2020-01-27 00:20:20 60
2 3 1 60 2020-01-26 00:10:08 60
3 4 2 1 2020-01-13 00:20:19 3
4 5 2 2 2020-01-12 00:40:17 3
5 6 2 3 2020-01-10 00:10:53 3
6 7 3 20 2020-01-21 00:20:57 40
7 8 3 40 2020-01-20 00:10:38 40
8 9 3 60 2020-01-01 00:30:38 60

Related

Pandas data frame index

if I have a Series
s = pd.Series(1, index=[1,2,3,5,6,9,10])
But, I need a standard index = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], with index[4, 7, 8] values equal to zeros.
So I expect the updated series will be
s = pd.Series([1,1,1,0,1,1,0,0,1,1], index=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
How should I update the series?
Thank you in advance!
Try this:
s.reindex(range(1,s.index.max() + 1),fill_value=0)
Output:
1 1
2 1
3 1
4 0
5 1
6 1
7 0
8 0
9 1
10 1

Appending columns to other columns in Pandas

Given the dataframe:
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
What is the easiest way to append the third column to the first and the fourth column to the second?
The result should look like.
d = {'col1': [1, 2, 3, 4, 7, 7, 8, 12, 1, 11], 'col2': [4, 5, 6, 9, 5, 12, 13, 14, 15, 16],
I need to use this for a script with different column names, thus referencing columns by name is not possible. I have tried something along the lines of df.iloc[:,x] to achieve this.
You can use:
out = pd.concat([subdf.set_axis(['col1', 'col2'], axis=1)
for _, subdf in df.groupby(pd.RangeIndex(df.shape[1]) // 2, axis=1)])
print(out)
# Output
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
You can change the column names and concat:
pd.concat([df[['col1', 'col2']],
df[['col3', 'col4']].set_axis(['col1', 'col2'], axis=1)])
Add ignore_index=True to reset the index in the process.
Output:
col1 col2
0 1 4
1 2 5
2 3 6
3 4 9
4 7 5
0 7 12
1 8 13
2 12 14
3 1 15
4 11 16
Or, using numpy:
N = 2
pd.DataFrame(
df
.values.reshape((-1,df.shape[1]//2,N))
.reshape(-1,N,order='F'),
columns=df.columns[:N]
)
This may not be the most efficient solution but, you can do it using the pd.concat() function in pandas.
First convert your initial dict d into a pandas Dataframe and then apply the concat function.
d = {'col1': [1, 2, 3, 4, 7], 'col2': [4, 5, 6, 9, 5], 'col3': [7, 8, 12, 1, 11], 'col4': [12, 13, 14, 15, 16]}
df = pd.DataFrame(d)
d_2 = {'col1':pd.concat([df.iloc[:,0],df.iloc[:,2]]),'col2':pd.concat([df.iloc[:,1],df.iloc[:,3]])}
d_2 is your required dict. Convert it to a dataframe if you need it to,
df_2 = pd.DataFrame(d_2)

Multiply Data in Two Dataframes of Different Sizes and Keep Non-Similar Data

I am trying to multiply data column-wise in two dataframes with different number of columns, but I don't want to return NaNs for the columns in the bigger dataframe. Say the dataframes are:
import pandas as pd
data1 = {
'A': [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
'B': ['D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D', 'D'],
'C': [2, 4, 1, 0, 2, 1, 3, 0, 7, 8],
'428': [1, 10, 5, 8, 2, 7, 10, 0, 3, 5],
'424': [9, 2, 6, 8, 9, 1, 7, 3, 8, 6],
'425': [4, 2, 8, 7, 9, 6, 10, 5, 9, 9]
}
data2 = {
'428': [1, 10, 5, 8, 2, 7, 10, 0, 3, 5],
'424': [9, 2, 6, 8, 9, 1, 7, 3, 8, 6],
'425': [4, 2, 8, 7, 9, 6, 10, 5, 9, 9]
}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
When I do df1.mul(df2) I get the following result:
424 425 428 A B C
0 81 16 1 NaN NaN NaN
1 4 4 100 NaN NaN NaN
2 36 64 25 NaN NaN NaN
3 64 49 64 NaN NaN NaN
4 81 81 4 NaN NaN NaN
5 1 36 49 NaN NaN NaN
6 49 100 100 NaN NaN NaN
7 9 25 0 NaN NaN NaN
8 64 81 9 NaN NaN NaN
9 36 81 25 NaN NaN NaN
However, what I want to achieve is like the data below:
424 425 428 A B C
0 81 16 1 3 D 2
1 4 4 100 3 D 4
2 36 64 25 3 D 1
3 64 49 64 3 D 0
4 81 81 4 3 D 2
5 1 36 49 3 D 1
6 49 100 100 3 D 3
7 9 25 0 3 D 0
8 64 81 9 3 D 7
9 36 81 25 3 D 8
I know I can achieve what I want by doing
df1['428'] = df1['428'] * df2['428']
df1['424'] = df1['424'] * df2['424']
df1['425'] = df1['425'] * df2['425']
However, because the columns that need to be multiplied are a lot, it's not the best solution. Also, I cannot do df1.mul(df2, fill_value=1) because column B in df1 has strings.
try:
col=['428','424','425']
df1[col]=df1[col].mul(df2[col],axis=0,fill_value=1)

Is there any method to append test data with predicted data?

I have 1 random array of tested dataset like array=[[5, 6 ,7, 1], [5, 6 ,7, 4], [5, 6 ,7, 3]] and 1 array of predicted data like array_pred=[10, 3, 4] both with the equal length. Now I want to append this result like this in 1 res_array = [[5, 6 ,7, 1, 10], [5, 6 ,7, 4, 3], [5, 6 ,7, 3, 4]]. I don't know what to say it but I want this type of result in python. Actually I have to store it in a dataframe and then have to generate an excel file from this data. this is what I want. Is it possible??
Use numpy.vstack for join arrays, convert to Series and then to excel:
a = np.hstack((array, np.array(array_pred)[:, None]))
#thank you #Ch3steR
a = np.column_stack([array, array_pred])
print(a)
0 [5, 6, 7, 1, 10]
1 [5, 6, 7, 4, 3]
2 [5, 6, 7, 3, 4]
dtype: object
s = pd.Series(a.tolist())
print (s)
0 [5, 6, 7, 1, 10]
1 [5, 6, 7, 4, 3]
2 [5, 6, 7, 3, 4]
dtype: object
s.to_excel(file, index=False)
Or if need flatten values convert to DataFrame, Series and use concat:
df = pd.concat([pd.DataFrame(array), pd.Series(array_pred)], axis=1, ignore_index=True)
print(df)
0 1 2 3 4
0 5 6 7 1 10
1 5 6 7 4 3
2 5 6 7 3 4
And then:
df.to_excel(file, index=False)

pandas grouped with aggregation stats across all dataframe columns

I am grouping data in a pandas dataframe and using some aggregation functions to generate results data. Input data:
A B C D E F
0 aa 5 3 2 2 2
1 aa 3 2 2 3 3
2 ac 2 0 2 7 7
3 ac 9 2 3 8 8
4 ac 2 3 7 3 3
5 ad 0 0 0 1 1
6 ad 9 9 9 9 9
7 ad 6 6 6 6 6
8 ad 3 3 3 3 3
The pandas grouped function seems to only operate on one column at a time but I want to generate the statistic on all columns in my df. For example, I can use the function grouped['C'].agg([np.mean, len]) to generate the statistics on column 'C' but what if I want to generate these statistics on all columns A - F?
The output from this is:
A count_C mean_C
0 aa 2 2.500000
1 ac 3 1.666667
2 ad 4 4.500000
But what I want is:
A count_B mean_B count_C mean_C count_D mean_D etc...
0 aa 2 4.000000 2 2.500000 2 2.0 etc...
1 ac 3 4.333333 3 2.500000 3 4.0
2 ad 4 4.500000 4 2.500000 4 4.5
Is there any easy way to do the group by with aggregation in a single command? If not, is there an easy way to iterate over all columns and merge in new aggregation statistics results for each column?
Here's my full code so far:
import pandas as pd
import numpy as np
import pprint as pp
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
#group, aggregate, convert object to df, sort index
grouped = test_dataframe.groupby(['A'])
grouped_stats = grouped['C'].agg([np.mean, len])
grouped_stats = pd.DataFrame(grouped_stats).reset_index()
grouped_stats.rename(columns = {'mean':'mean_C', 'len':'count_C'}, inplace=True)
grouped_stats.sort_index(axis=1, inplace=True)
print "Input: "
pp.pprint(test_dataframe)
print "Output: "
pp.pprint(grouped_stats)
You don't have to call grouped['B'] grouped['C'] one by one, simply pass your entire groupby object and pandas will apply the aggregate functions to all columns.
import pandas as pd
test_dataframe = pd.DataFrame({
'A' : ['aa', 'aa', 'ac', 'ac', 'ac', 'ad', 'ad', 'ad', 'ad'],
'B' : [5, 3, 2, 9, 2, 0, 9, 6, 3],
'C' : [3, 2, 0, 2, 3, 0, 9, 6, 3],
'D' : [2, 2, 2, 3, 7, 0, 9, 6, 3],
'E' : [2, 3, 7, 8, 3, 1, 9, 6, 3],
'F' : [2, 3, 7, 8, 3, 1, 9, 6, 3]
})
agg_funcs = ['count', 'mean']
test_dataframe = test_dataframe.groupby(['A']).agg(agg_funcs)
columns = 'B C D E F'.split()
names = [y + '_' + x for x in columns for y in agg_funcs]
test_dataframe.columns = names
Out[89]:
count_B mean_B count_C mean_C count_D mean_D count_E mean_E count_F mean_F
A
aa 2 4.0000 2 2.5000 2 2.0 2 2.50 2 2.50
ac 3 4.3333 3 1.6667 3 4.0 3 6.00 3 6.00
ad 4 4.5000 4 4.5000 4 4.5 4 4.75 4 4.75

Categories

Resources