Suppose I have a dataframe that looks like this
d = {'User' : ['A', 'A', 'B', 'C', 'C', 'C'],
'time':[1,2,3,4,4,4],
'state':['CA', 'CA', 'ID', 'OR','OR','OR']}
df = pd.DataFrame(data = d)
Now suppose I want to create new dataframe that takes the average and median of time, grabs the users state, and generate a new column as well that counts the number of times that user appears in the User column, i.e.
d = {'User' : ['A', 'B', 'C'],
'avg_time':[1.5,3,4],
'median_time':[1.5,3,4],
'state':['CA','ID','OR'],
'user_count':[2,1,3]}
df_res = pd.DataFrame(data=d)
I know that I can do a group by mean statement like this
df.groupby(['User'], as_index=False).mean().groupby('User')['time'].mean()
This gives me a pandas series, and I assume I can make this into a dataframe if I wanted but how would I do the latter above for all the other columns I am interested in?
Try using pd.NamedAgg:
df.groupby('User').agg(avg_time=('time','mean'),
mean_time=('time','median'),
state=('state','first'),
user_count=('time','count')).reset_index()
Output:
User avg_time mean_time state user_count
0 A 1.5 1.5 CA 2
1 B 3.0 3.0 ID 1
2 C 4.0 4.0 OR 3
You can even pass multiple aggregate functions for the columns in the form of dictionary, something like this:
out = df.groupby('User').agg({'time': [np.mean, np.median], 'state':['first']})
time state
mean median first
User
A 1.5 1.5 CA
B 3.0 3.0 ID
C 4.0 4.0 OR
It gives multi-level columns, you can either drop the level or just join them:
>>> out.columns = ['_'.join(col) for col in out.columns]
time_mean time_median state_first
User
A 1.5 1.5 CA
B 3.0 3.0 ID
C 4.0 4.0 OR
Related
Suppose I have a dataframe like this
d = {'User':['A', 'A', 'B'],
'time':[1,2,3],
'state':['CA', 'CA', 'OR'],
'type':['cd', 'dvd', 'cd']}
df = pd.Dataframe(data=d)
I want to create a function that where I will pass in a single users dataframe so for example
user_df = df[df['User'] == 'A']
Then the function will return a single row data frame that will look like this
d = {'User':['A'],
'avg_time':[1.5],
'state':['CA'],
'cd':[1],
'dvd':[1]}
res_df = pd.Dataframe(data=d)
Then that function will be used to apply this across the entire dataframe of users, so I will have
def some_function():
Then I will write df.groupby('User').apply(some_function). Then I will have this as the resulting new dataframe
d = {'User':['A','B'],
'avg_time':[1.5, 3],
'state':['CA', 'OR'],
'cd':[1, 1],
'dvd':[1, 0]}
final_df = pd.Dataframe(data=d)
I know I can grab values for the df like this
avg_time = user_df['time'].mean()
state = user_df['state'].iloc[0]
type_counts = user_df['type'].value_counts().to_dict()
But I am not sure how to tranform this into a results row dataframe. Any help is appreciated. The reasoning on why I want to do it in this way instead of .agg() is because I am going to parallelize this function to make it run faster since I will have a very large dataframe.
IIUC,
def aggUser(df):
a = pd.DataFrame({'avg_time':df['time'].mean(),
'state': [df['state'].iloc[0]]})
b = df['type'].value_counts().to_frame().T.reset_index(drop=True)
return pd.concat([a,b], axis=1).set_axis(df['User'].iloc[[0]])
pd.concat([aggUser(df.query('User == "A"')),
aggUser(df.query('User == "B"'))])
Output:
avg_time state cd dvd
User
A 1.5 CA 1 1.0
B 3.0 OR 1 NaN
df.groupby('User', group_keys=False).apply(aggUser)
Output:
avg_time state cd dvd
User
A 1.5 CA 1 1.0
B 3.0 OR 1 NaN
I have tried to search on Stackoverflow for the answer to this and while there are similar answers, I have tried to adapt the accepted answers and I'm struggling to achieve the result I want.
I have a dataframe:
df = pd.DataFrame({'Customer':
['A', 'B', 'C', 'D'],
'Sales':
[100, 200, 300, 400],
'Cost':
[2.25, 2.50, 2.10, 3.00]})
and another one:
split = pd.DataFrame({'Customer':
['B', 'D']})
I want to create two new dataframes from the original dataframe df, one containing the data from the split dataframe and the other one containing data, not in the split. I need the original structure of df to remain in both of the newly created dataframes.
I have explored isin, merge, drop and loops but there must be an elegant way to what appears to be a simple solution?
Use Series.isin with boolean indexing for filtering, ~ is for inverse boolen mask:
mask = df['Customer'].isin(split['Customer'])
df1 = df[mask]
print (df1)
Customer Sales Cost
1 B 200 2.5
3 D 400 3.0
df2 = df[~mask]
print (df2)
Customer Sales Cost
0 A 100 2.25
2 C 300 2.10
Another solution, also working if need match multiple columns with DataFrame.merge (if no parameter on it join by all columns), use outer join with indicator parameter:
df4 = df.merge(split, how='outer', indicator=True)
print (df4)
Customer Sales Cost _merge
0 A 100 2.25 left_only
1 B 200 2.50 both
2 C 300 2.10 left_only
3 D 400 3.00 both
And again filtering by different masks:
df11 = df4[df4['_merge'] == 'both']
print (df11)
Customer Sales Cost _merge
1 B 200 2.5 both
3 D 400 3.0 both
df21 = df4[df4['_merge'] == 'left_only']
print (df21)
Customer Sales Cost _merge
0 A 100 2.25 left_only
2 C 300 2.10 left_only
I have two dataframes (d, containing date1,name) and another (d1, containing date2,name,rank). I need to join these two on name such that for each row in first dataframe I assign the latest rank as of date1.
i.e. d1.name = d2.name and d2.date2 is latest d1.date1
What is the easiest way of doing this?
import pandas as pd
d = pd.DataFrame({'date' : ['20070105', '20130105', '20150102',
'20170106', '20190106'], 'name': ['a', 'b', 'a', 'b', 'a']})
d
date name
0 20070105 a
1 20130105 b
2 20150102 a
3 20170106 b
4 20190106 a
d1 = pd.DataFrame({'date' : ['20140105', '20160105', '20180103',
'20190106'], 'rank' : [1, 2, 1,5], 'name': ['a', 'b', 'a', '
...: b']})
d1
date name rank
0 20140105 a 1
1 20160105 b 2
2 20180103 a 2
3 20190106 b 1
I'm expecting 'rank' to be added to 'd' and have output like this:
date name Rank
0 20070105 a NaN
1 20130105 b NaN
2 20150102 a 1
3 20170106 b 2
4. 20190106 a 2
I assume you required this.
sort your second dataframe in ascending order with date, then drop_duplicates with keep='last', Now apply pd.merge with first dataframe with processed second dataframe.
df2=df2.sort_values(on='date')
temp=df2.drop_duplicates(subset=['name'], keep='last')
print (pd.merge(df1,temp, on=['name'], how='left'))
Note: As you failed to post sample input and output, I assumes column name and variable like the above. For exact result provide sample input and output.
I am struggling to figure out how to develop a square matrix given a format like
a a 0
a b 3
a c 4
a d 12
b a 3
b b 0
b c 2
...
To something like:
a b c d e
a 0 3 4 12 ...
b 3 0 2 7 ...
c 4 3 0 .. .
d 12 ...
e . ..
in pandas. I developed a method which I thinks works but takes forever to run because it has to iterate through each column and row for every value starting from the beginning each time using for loops. I feel like I'm definitely reinventing the wheel here. This also isnt realistic for my dataset given how many columns and rows there are. Is there something similar to R's cast function in python which can do this significantly faster?
You could use df.pivot:
import pandas as pd
df = pd.DataFrame([['a', 'a', 0],
['a', 'b', 3],
['a', 'c', 4],
['a', 'd', 12],
['b', 'a', 3],
['b', 'b', 0],
['b', 'c', 2]], columns=['X','Y','Z'])
print(df.pivot(index='X', columns='Y', values='Z'))
yields
Y a b c d
X
a 0.0 3.0 4.0 12.0
b 3.0 0.0 2.0 NaN
Here, index='X' tells df.pivot to use the column labeled 'X' as the index, and columns='Y' tells it to use the column labeled 'Y' as the column index.
See the docs for more on pivot and other reshaping methods.
Alternatively, you could use pd.crosstab:
print(pd.crosstab(index=df.iloc[:,0], columns=df.iloc[:,1],
values=df.iloc[:,2], aggfunc='sum'))
Unlike df.pivot which expects each (a1, a2) pair to be unique, pd.crosstab
(with agfunc='sum') will aggregate duplicate pairs by summing the associated
values. Although there are no duplicate pairs in your posted example, specifying
how duplicates are supposed to be aggregated is required when the values
parameter is used.
Also, whereas df.pivot is passed column labels, pd.crosstab is passed
array-likes (such as whole columns of df). df.iloc[:, i] is the ith column
of df.
I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly