I have a dataset that looks like his:
ID date
1 O1-01-2012
1 05-02-2012
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 19-05-2012
2 07-08-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 15-04-2013
3 17-05-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I am working with Python and I would like to select the 3 last dates for each ID. Here is the dataset I would like to have:
ID date
1 25-06-2013
1 14-12-2013
1 10-04-2014
2 10-09-2014
2 27-11-2015
2 01-12-2015
3 22-05-2015
3 30-10-2016
3 02-11-2016
I used this code to select the very last date for each ID:
df_2=df.sort_values(by=['date']).drop_duplicates(subset='ID',keep='last')
But how can I select more than one date (for example the 3 last dates, or 4 last dates, etc)?
You might use groupby and tail following way to get 2 last items from each group:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2,2,2,3,3,3],'value':['A','B','C','D','E','F','G','H','I']})
df2 = df.groupby('ID').tail(2)
print(df2)
Output:
ID value
1 1 B
2 1 C
4 2 E
5 2 F
7 3 H
8 3 I
Note that for simplicity sake I used other (already sorted) data for building df.
can try this:
df.sort_values(by=['date']).groupby('ID').tail(3).sort_values(['ID', 'date'])
I tried this but with a non-datetime data type
a = [1,1,1,1,1,2,2,2,2,2,3,3,3,3,3]
b = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
import pandas as pd
import numpy as np
a = np.array([a,b])
df=pd.DataFrame(a.T,columns=['ID','Date'])
# the tail would give you the last n number of elements you are interested in
df_ = df.groupby('ID').tail(3)
df_
output:
ID Date
2 1 c
3 1 d
4 1 e
7 2 h
8 2 i
9 2 j
12 3 m
13 3 n
14 3 o
Related
I would like to drop the [] for a given df
df=pd.DataFrame(dict(a=[1,2,4,[],5]))
Such that the expected output will be
a
0 1
1 2
2 4
3 5
Edit:
or to make thing more interesting, what if we have two columns and some of the cell is with [] to be dropped.
df=pd.DataFrame(dict(a=[1,2,4,[],5],b=[2,[],1,[],6]))
One way is to get the string repr and filter:
df = df[df['a'].map(repr)!='[]']
Output:
a
0 1
1 2
2 4
4 5
For multiple columns, we could apply the above:
out = df[df.apply(lambda c: c.map(repr)).ne('[]').all(axis=1)]
Output:
a b
0 1 2
2 4 1
4 5 6
You can't use equality directly as pandas will try to align a Series and a list, but you can use isin:
df[~df['a'].isin([[]])]
output:
a
0 1
1 2
2 4
4 5
To act on all columns:
df[~df.isin([[]]).any(1)]
output:
a b
0 1 2
2 4 1
4 5 6
Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1
I am using apply to leverage one dataframe to manipulate a second dataframe and return results. Here is a simplified example that I realize could be more easily answered with "in" logic, but for now let's keep the use of .apply() as a constraint:
import pandas as pd
df1 = pd.DataFrame({'Name':['A','B'],'Value':range(1,3)})
df2 = pd.DataFrame({'Name':['A']*3+['B']*4+['C'],'Value':range(1,9)})
def filter_df(x, df):
return df[df['Name']==x['Name']]
df1.apply(filter_df, axis=1, args=(df2, ))
Which is returning:
0 Name Value
0 A 1
1 A 2
2 ...
1 Name Value
3 B 4
4 B 5
5 ...
dtype: object
What I would like to see instead is one formated DataFrame with Name and Value headers. All advice appreciated!
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
In my opinion, this cannot be done solely based on apply, you need pandas.concat:
result = pd.concat(df1.apply(filter_df, axis=1, args=(df2,)).to_list())
print(result)
Output
Name Value
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 B 7
This sounds a bit weird, but I think that's exactly what I needed now:
I got several pandas dataframes that contains columns with float numbers, for example:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
Now I want to add a column, with only one row, and the value is equal to the average of column 'a', in this case, is 3.0. So the new dataframe will looks like this:
a b c average
0 0 1 2 3.0
1 3 4 5
2 6 7 8
And all the rows below are empty.
I've tried things like df['average'] = np.mean(df['a']) but that give me a whole column of 3.0. Any help will be appreciated.
Assign a series, this is cleaner.
df['average'] = pd.Series(df['a'].mean(), index=df.index[[0]])
Or, even better, assign with loc:
df.loc[df.index[0], 'average'] = df['a'].mean().item()
Filling NaNs is straightforward, you can do
df['average'] = df['average'].fillna('')
df
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
Can do something like:
df['average'] = [np.mean(df['a'])]+['']*(len(df)-1)
Here is a full example:
import pandas as pd
import numpy as np
df = pd.DataFrame(
[(0,1,2), (3,4,5), (6,7,8)],
columns=['a', 'b', 'c'])
print(df)
a b c
0 0 1 2
1 3 4 5
2 6 7 8
df['average'] = ''
df['average'][0] = df['a'].mean()
print(df)
a b c average
0 0 1 2 3
1 3 4 5
2 6 7 8
I want to make a table with all available products for every customer. However, I only have a table with the combination of product and customer if it was bought. I want to make a new table that also included the product that were not bought by the customer. The current table looks as follows:
The table I want to end up with is:
Could anyone help me how to do this in pandas?
One way to do this is to use pd.MultiIndex and reindex:
df = pd.DataFrame({'Product':list('ABCDEF'),
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
indx = pd.MultiIndex.from_product([df['Product'].unique(),
df['Customer'].unique()],
names=['Product','Customer'])
df.set_index(['Product','Customer'])\
.reindex(indx, fill_value=0)\
.reset_index()\
.sort_values(['Customer','Product'])
Output:
Product Customer Amount
0 A 1 4
3 B 1 5
6 C 1 0
9 D 1 0
12 E 1 0
15 F 1 0
1 A 2 0
4 B 2 0
7 C 2 3
10 D 2 0
13 E 2 0
16 F 2 0
2 A 3 0
5 B 3 0
8 C 3 0
11 D 3 1
14 E 3 1
17 F 3 2
You can also create a pivot to do what you want in one line. Note that the output format is different -- it's a pandas.DataFrame.pivot rather than a standard pandas data frame. But if you're not especially fussed about that (depends on how you intend to use the final table), the following code does the job.
df = pd.DataFrame({'Product':['A','B','C','D','E','F'],
'Customer':[1,1,2,3,3,3],
'Amount':[4,5,3,1,1,2]})
pivot_df = df.pivot(index='Product',
columns='Customer',
values='Amount').fillna(0).astype('int')
Output:
Customer 1 2 3
Product
A 4 0 0
B 5 0 0
C 0 3 0
D 0 0 1
E 0 0 1
F 0 0 2
df.pivot creates NaN values when there are no corresponding entries in the original df (it creates a NaN value for Product A and Customer 2, for instance). NaNs are float values, so all the 'Amounts' in the pivot are implicitly converted into floats. This is why I use fillna(0) to convert the NaN values into 0s, and then finally change the dtype back to int.