Grouby followed by counts as new column in pandas - python

I wanted to perform a groupby on an account ID and then perform a count of values after group by and give their counts as a new column.
How can I do it in pandas.
Eg:
Account Id Values
1 Open
2 Closed
1 Open
3 Closed
2 Open
Output must be:
Account Id Open Closed
1 2 0
2 1 1
3 0 1

Use a groupby and value_counts to get the initial counts you want. Then unstack the multiindex to get a DataFrame and set null values to 0 to get the final results:
import pandas as pd
# Defining DataFrame
df = pd.DataFrame(index=range(5))
df['Account Id'] = [1, 2, 1, 3, 2]
df['Values'] = ['Open', 'Closed', 'Open', 'Closed', 'Open']
grouped = df.groupby('Account Id')['Values'].value_counts()
# Remove the multiindex present
grouped = grouped.unstack()
# Set null values to 0
result = grouped.where(pd.notnull(grouped), 0)
Output of result:
Closed Open
Account Id
1 0 2
2 1 1
3 1 0
(Sorry, I'm not sure how to properly represent the DataFrame)

This would also return the dataframe for groupby object:
grouped_df = df.groupby(["Account Id","Values"])
grouped_df.size().reset_index(name = "Count")

Related

pandas combine multiple row into one, and update other columns [duplicate]

I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0

How to create a dataframe with a named-index and a unnamed-default-subindex

I want to create a dataframe with index of dates. But in one date there would be one record or more.
so I wanna create a dataframe like :
A B
2021-11-12 1 0 0
2 1 1
2021-11-13 1 0 0
2 1 0
3 0 1
so could I append any row with the same date into this dataframe, and the subindex would be auto-increased?
Or is there any other way to save records with the same date index in one dataframe?
Use:
#remove counter level
df = df.reset_index(level=1, drop=True)
#add new row
#your code
#correct add new row after last datetime
df = df.sort_index()
#add subindex
df = df.set_index(df.groupby(level=0).cumcount().add(1), append=True)

Groupby to count the number of calls on different days by id

Given a dataframe like the one below:
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
I need to create another dataframe containing only the id and the number of calls made on different days. An example of output is as follows:
Id | Count
1 | 1
2 | 2
3 | 1
What I'm trying so far:
df2 = df.groupby(['id','date']).size().reset_index().rename(columns={0:'COUNT'})
df2
However, the way out is far from desired. Can anyone help?
You can make use of .nunique() [pandas-doc] to count the unique days per id:
table.groupby('id').date.nunique()
This gives us a series:
>>> df.groupby('id').date.nunique()
id
1 1
2 2
3 1
Name: date, dtype: int64
You can make use of .to_frame() [pandas-doc] to convert it to a dataframe:
>>> df.groupby('id').date.nunique().to_frame('count')
count
id
1 1
2 2
3 1
You can use pd.Dataframe function to convert the result into a dataframe and further rename the columns as per you like.
import pandas as pd
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
x = pd.DataFrame(df.groupby('id').date.nunique().reset_index())
x.columns = ['Id', 'Count']
print(x)

how to detect value changed in python, pandas in each object

180762508,1268510763,374723980,293,20180402035748,198,25,1,1
180762508,1268503685,374717256,307,20180402035758,225,38,1,1
180762508,1268492506,374708540,236,20180402035808,222,52,1,1
180762508,1268485868,374697563,248,20180402035818,197,47,1,1
180762508,1268482430,374688520,272,20180402035828,196,31,1,1
180707764,1270608366,374988433,246,20180402035925,66,37,1,0
180707764,1270620899,374992366,222,20180402035935,68,49,1,0
first column is unique id and the last column is my interest
I wanna know how can I find last column is changed from 0 to 1
I made a really big data frame with this dataset in pandas
import glob
import pandas as pd
path = r"1\1"
allFiles = glob.glob(path+"\*.DAT")
list=[]
for filename in allFiles:
df = pd.read_csv(filename, header = None)
list.append(df)
a = pd.concat(list)
a.head()
this is all I did
I don't have error but I wanna know the algorithm that I can find the last columns' value changed in each unique id
my goal is made a data frame that
first column is unique id and second, third column is latitude, longitude which is in third, second columns in my dataset and the time stamp which is in 5th columns that last column's value is changed from 0 to 1
If I understood you, you need to get the 5th row, where the change from 0 to 1, in the last column, takes place.
I made a dataframe with your first and last column (by the way, you said the 1st column is some kind of unique id, but I see repeated numbers), anyway based on your sample data, one possible solution is:
import pandas as pd
data = [[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180762508,1],[180707764,0],[180707764,0]]
df = pd.DataFrame(data,columns=['my_id','interest'])
#new dataframe to compare the column interest
df2 = df.loc[df['interest'] != df['interest'].shift(-1)]
#output:
# my_id interest
# 4 180762508 1
# 6 180707764 0
imax = df2.index.max() #index after the change
imin = df2.index.min() #index before the change
for i in range(imin,imax,1):
i
#the row with the change in the original dataframe
print(df.loc[i])
Hi and thanks for posting. It looks like the first column doesn't have unique values, so I'm guessing you want to index returned or timestamp returned?
In any case, here's a sample of what might work for you if you want to find when the interest column for an ID changes from 0 to 1:
import pandas as pd
# Provided data
raw_str = """
180762508,1268510763,374723980,293,20180402035748,198,25,1,1 180762508,1268503685,374717256,307,20180402035758,225,38,1,1 180762508,1268492506,374708540,236,20180402035808,222,52,1,1 180762508,1268485868,374697563,248,20180402035818,197,47,1,1 180762508,1268482430,374688520,272,20180402035828,196,31,1,1 180707764,1270608366,374988433,246,20180402035925,66,37,1,0 180707764,1270620899,374992366,222,20180402035935,68,49,1,0
"""
# Replace newline and split on single whitespace
chunks = raw_str.replace('\n', '').split(' ')
# Create simple dictionary for ID, timestamp, and interest columns
ddict = {}
ddict['id'] = [i.split(',')[0] for i in chunks]
ddict['timestamp'] = [i.split(',')[4] for i in chunks]
ddict['interest'] = [i.split(',')[-1] for i in chunks]
# Convert dictionary to pandas DataFrame
df = pd.DataFrame(ddict)
# Create dictionary for sample data
# This is an existing ID with timestamp in the future and 1 as interest
tdict = {
'id': '180707764',
'timestamp': '20180402035945',
'interest': '1',
}
What df looks like:
id timestamp interest
0 180707764 20180402035925 0
1 180707764 20180402035935 0
2 180707764 20180402035945 1
3 180762508 20180402035748 1
4 180762508 20180402035758 1
5 180762508 20180402035808 1
6 180762508 20180402035818 1
7 180762508 20180402035828 1
Continuing on:
# Append that dictionary to your dataframe and sort by id, timestamp
df = df.append(pd.Series(tdict), ignore_index=True).copy(deep=True)
df = df.sort_values(['id', 'timestamp']).reset_index(drop=True)
# Shift dataframe back 1 period by rows
df2 = pd.DataFrame(df.shift(periods=-1, axis=0)
# Merge that dataframe with our original dataframe by index values
# We're dropping an extra id column and renaming our primary id column for aesthetics
df3 = df.merge(df2, left_index=True, right_index=True, suffixes=('_prev', '_curr'))
df3 = df3.drop('id_curr', axis=1).rename(columns={'id_prev': 'id'})
What df3 looks like:
id timestamp_prev interest_prev timestamp_curr interest_curr
0 180707764 20180402035925 0 20180402035935 0
1 180707764 20180402035935 0 20180402035945 1
2 180707764 20180402035945 1 20180402035748 1
3 180762508 20180402035748 1 20180402035758 1
4 180762508 20180402035758 1 20180402035808 1
5 180762508 20180402035808 1 20180402035818 1
6 180762508 20180402035818 1 20180402035828 1
7 180762508 20180402035828 1 NaN NaN
Now we can just create a conditional to return the row where interest changed from 0 to 1:
In[0]: df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]
Which returns:
timestamp_prev interest_prev id_curr timestamp_curr interest_curr
1 20180402035935 0 180707764 20180402035945 1
You can also return specific columns by adding those onto the end of the result set:
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')]['timestamp_y']
df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')][['id', 'timestamp_y']]
Or use the original dataframe (df) and .iloc to get specified data:
df.iloc[df3[(df3['interest_prev'] == '0') & (df3['interest_curr'] == '1')].index, :]
Out:
id timestamp interest
1 180707764 20180402035935 0

Sort a column within groups in Pandas

I am new to pandas. I'm trying to sort a column within each group. So far, I was able to group first and second column values together and calculate the mean value in third column. But I am still struggling to sort 3rd column.
This is my input dataframe
This is my dataframe after applying groupby and mean function
I used the following line of code to group input dataframe,
df_o=df.groupby(by=['Organization Group','Department']).agg({'Total Compensation':np.mean})
Please let me know how to sort the last column for each group in 1st column using pandas.
It seems you need sort_values:
#for return df add parameter as_index=False
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
Sample:
df = pd.DataFrame({'Organization Group':['a','b','a','a'],
'Department':['d','f','a','a'],
'Total Compensation':[1,8,9,1]})
print (df)
Department Organization Group Total Compensation
0 d a 1
1 f b 8
2 a a 9
3 a a 1
df_o=df.groupby(['Organization Group','Department'],
as_index=False)['Total Compensation'].mean()
print (df_o)
Organization Group Department Total Compensation
0 a a 5
1 a d 1
2 b f 8
df_o = df_o.sort_values(['Total Compensation','Organization Group'])
print (df_o)
Organization Group Department Total Compensation
1 a d 1
0 a a 5
2 b f 8

Categories

Resources