Groupby to count the number of calls on different days by id - python

Given a dataframe like the one below:
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
I need to create another dataframe containing only the id and the number of calls made on different days. An example of output is as follows:
Id | Count
1 | 1
2 | 2
3 | 1
What I'm trying so far:
df2 = df.groupby(['id','date']).size().reset_index().rename(columns={0:'COUNT'})
df2
However, the way out is far from desired. Can anyone help?

You can make use of .nunique() [pandas-doc] to count the unique days per id:
table.groupby('id').date.nunique()
This gives us a series:
>>> df.groupby('id').date.nunique()
id
1 1
2 2
3 1
Name: date, dtype: int64
You can make use of .to_frame() [pandas-doc] to convert it to a dataframe:
>>> df.groupby('id').date.nunique().to_frame('count')
count
id
1 1
2 2
3 1

You can use pd.Dataframe function to convert the result into a dataframe and further rename the columns as per you like.
import pandas as pd
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
x = pd.DataFrame(df.groupby('id').date.nunique().reset_index())
x.columns = ['Id', 'Count']
print(x)

Related

Initialize dataframe with two columns which have one of the them all zeros

I have a list and I would like to convert it to a pandas dataframe. In the second column, I want to give all zeros but I got "object of type 'int' has no len()" error. The thing I did is this:
df = pd.DataFrame([all_equal_timestamps['A'], 0], columns=['data','label'])
How can i add second column with all zeros to this dataframe in the easiest manner and why did the code above give me this error?
Not sure what is in all_equal_timestamps, so I presume it's a list of elements. Do you mean to get this result?
import pandas as pd
all_equal_timestamps = {'A': ['1234', 'aaa', 'asdf']}
df = pd.DataFrame(all_equal_timestamps['A'], columns=['data']).assign(label=0)
# df['label'] = 0
print(df)
Output:
data label
0 1234 0
1 aaa 0
2 asdf 0
If you're creating a DataFrame with a list of lists, you'd expect something like this
df = pd.DataFrame([ all_equal_timestamps['A'], '0'*len(all_equal_timestamps['A']) ], columns=['data', 'label', 'anothercol'])
print(df)
Output:
data label anothercol
0 1234 aaa asdf
1 0 0 0
you can add a column named as "new" with all zero by using
df['new'] = 0
You can do it all in one line with assign:
timestamps = [1,0,3,5]
pd.DataFrame({"Data":timestamps}).assign(new=0)
Output:
Data new
0 1 0
1 0 0
2 3 0
3 5 0

Pandas unique count of number of records in a column where mutliple values are present in another column

I'm trying to count the unique number of Customer_Key where the column Broad_Category has both the values A and B grouped by values in column Month. The sample dataframe is as follows
Customer_Key
Category
Month
ck123
A
2
ck234
A
2
ck234
B
2
ck680
A
3
ck123
B
3
ck123
A
3
ck356
B
3
ck345
A
4
The expected outcome is
Month
Unique Customers
2
1
3
1
4
0
I'm not able to think of something here. Any lead/help will be appreciated. Thanks in advance.emphasized text
Here is one way to accomplish it
first its grouping by Month and Customer that get us the customer within a month with the count of the categories. The result is further grouped by Month and we choose the max count.
Decrementing the count give us the required count of unique customer belonging to both categories
hope it helps
df2=df.groupby(['Month','Customer_Key']).count().reset_index().groupby(['Month'])['Category'].max().reset_index()
df2['Category'] = df2['Category'] -1
df2.rename(columns={'Category': 'Unique Cusomter'}, inplace=True)
df2
Month Unique Cusomter
0 2 1
1 3 1
2 4 0
Try something like that:
df.groupby(['Customer_Key', 'Month']) \
.sum() \
.query("Category in ('AB','BA')") \
.groupby('Month') \
.count() \
.rename(columns={'Category': 'Unique Customers'})
Edit...
The issue with this solution is that it does not count months with 0. I have prepared a fix:
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
data = StringIO("""ck123 A 2
ck234 A 2
ck234 B 2
ck680 A 3
ck123 B 3
ck123 A 3
ck356 B 3
ck345 A 4""")
df1 = df.groupby(['Customer_Key', 'Month']) \
.sum() \
.reset_index()
def map_categories(row):
if row['Category'] in ('AB', 'BA'):
return 1
else:
return 0
df1['Unique Customers'] = df1.apply(lambda row: map_categories(row), axis=1)
df1 = df1.groupby('Month')['Unique Customers'].sum().reset_index()

Pandas DataFrames: Extract Information and Collapse Columns

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

Grouping data from multiple columns in data frame into summary view

I have a data frame as below and would like to create summary information as shown. Can you please help how this can be done in pandas.
Data-frame:
import pandas as pd
ds = pd.DataFrame(
[{"id":"1","owner":"A","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"2","owner":"A","delivery":"2-Jan","priority":"Medium","exception":""},{"id":"3","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"4","owner":"B","delivery":"1-Jan","priority":"High","exception":"No Bill"},{"id":"5","owner":"C","delivery":"1-Jan","priority":"High","exception":""},{"id":"6","owner":"C","delivery":"2-Jan","priority":"High","exception":""},{"id":"7","owner":"C","delivery":"","priority":"High","exception":""}]
)
Result:
Use:
#crosstab and rename empty string column
df = pd.crosstab(ds['owner'], ds['delivery']).rename(columns={'':'No delivery Date'})
#change positions of columns - first one to last one
df = df[df.columns[1:].tolist() + df.columns[:1].tolist()]
#get counts by comparing and sum of True values
df['high_count'] = ds['priority'].eq('High').groupby(ds['owner']).sum().astype(int)
df['exception_count'] = ds['exception'].eq('No Bill').groupby(ds['owner']).sum().astype(int)
#convert id to string and join with ,
df['ids'] = ds['id'].astype(str).groupby(ds['owner']).agg(','.join)
#index to column
df = df.reset_index()
#reove index name delivery
df.columns.name = None
print (df)
owner 1-Jan 2-Jan No delivery Date high_count exception_count ids
0 A 1 1 0 1 1 1,2
1 B 2 0 0 2 2 3,4
2 C 1 1 1 3 0 5,6,7

efficiently flattening a large multiidex in pandas

I have a very large DataFrame that looks like this:
A B
SPH2008 3/21/2008 1 2
3/21/2008 1 2
3/21/2008 1 2
SPM2008 6/21/2008 1 2
6/21/2008 1 2
6/21/2008 1 2
And I have the following code which is intended to flatten and acquire the unique pairs of the two indeces into a new DF:
indeces = [df.index.get_level_values(0), df.index.get_level_values(1)]
tmp = pd.DataFrame(data=indeces).T.drop_duplicates()
tmp.columns = ['ID', 'ExpirationDate']
tmp.sort_values('ExpirationDate', inplace=True)
However, this operation takes a remarkably long amount of time. Is there a more efficient way to do this?
pandas.DataFrame.index.drop_duplicates
pd.DataFrame([*df.index.drop_duplicates()], columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008
With older versions of Python that can't unpack in that way
pd.DataFrame(df.index.drop_duplicates().tolist(), columns=['ID', 'ExpirationDate'])
IIUC, You can also groupby the levels of your multiindex, then create a dataframe from that with your desired columns:
>>> pd.DataFrame(df.groupby(level=[0,1]).groups.keys(), columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008

Categories

Resources