I have df below:
df = pd.DataFrame({
'ID': ['a', 'a', 'a', 'b', 'c', 'c'],
'V1': [False, False, True, True, False, True],
'V2': ['A', 'B', 'C', 'B', 'B', 'C']
})
I want to achieve the following. For each unique ID, the bottom row is True (this is V1). I want to count how many times each unique value of V2 occurs where V1==True. This part would be achieved by something like:
df.groupby('V2').V1.sum()
However, I also want to add, for each unique value of V2, a column indicating how many times that value occurred after the point where V1==True for the V2 value indicated by the row. I understand this might sound confusing; here's how the output woud look like in this example:
df
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 2 0
It is important that the solution is general enough to be applicable on a similar case with more unique values than just A, B and C.
UPDATE
As a bonus, I am also interested in how, instead of the count, one can instead return the sum of some value column, under the same conditions, divided by the corresponding "count" in the rows. Example: suppose we now depart from df below instead:
df = pd.DataFrame({
'ID': ['a', 'a', 'a', 'b', 'c', 'c'],
'V1': [False, False, True, True, False, True],
'V2': ['A', 'B', 'C', 'B', 'B', 'C'],
'V3': [1, 2, 3, 4, 5, 6],
})
The output would need to sum V3 for the cases indicated by the counts in the solution by #jezrael, and divide that number by V1. The output would instead look like:
df
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 3.5 0
First aggregate sum:
df1 = df.groupby('V2').V1.sum().astype(int).reset_index()
print (df1)
V2 V1
0 A 0
1 B 1
2 C 2
Then grouping by ID and create heper column by last value by GroupBy.transform and last, then remove last rows of ID by Series.duplicated and use crosstab for counts, add all possible unique values of V2 and last append to df1 by DataFrame.join:
val = df['V2'].unique()
df['new'] = df.groupby('ID').V2.transform('last')
df = df[df.duplicated('ID', keep='last')]
df = pd.crosstab(df['new'], df['V2']).reindex(columns=val, index=val, fill_value=0)
df = df1.join(df, on='V2')
print (df)
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 2 0
UPDATE
The updated part of the question should be possible to achieve by changing the crosstab part with pivot table:
df = df.pivot_table(
index='n',
columns='V2',
aggfunc=({
'V3': 'mean'
})
).V3.reindex(columns=v, index=v, fill_value=0)
Related
I have a pandas data frame like this:
df = pd.DataFrame({"Id": [1, 1, 1, 2, 2, 2, 2],
"Letter": ['A', 'B', 'C', 'A', 'D', 'B', 'C']})
How can I add a new column efficiently, "Merge" such that it concatenates all the values from the column "letter" by "Id", so the final data frame would look like this:
You can groupby Id column then transform
df['Merge'] = df.groupby('Id').transform(lambda x: '-'.join(x))
print(df)
Id Letter Merge
0 1 A A-B-C
1 1 B A-B-C
2 1 C A-B-C
3 2 A A-D-B-C
4 2 D A-D-B-C
5 2 B A-D-B-C
6 2 C A-D-B-C
Thanks for sammywemmy pointing out lambda is needless here, so you can use a simpler form
df['Merge'] = df.groupby('Id').transform('-'.join)
I have a dataframe with a unique index and columns 'users', 'tweet_time' and 'tweet_id'.
I want to count the number of duplicate tweet_time values per user.
users = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
tweet_times = ['01-01-01 01:00', '02-02-02 02:00', '03-03-03 03:00', '09-09-09 09:00',
'04-04-04 04:00', '04-04-04 04:00', '05-05-05 05:00', '09-09-09 09:00',
'06-06-06 06:00', '06-06-06 06:00', '07-07-07 07:00', '07-07-07 07:00']
d = {'users': users, 'tweet_times': tweet_times}
df = pd.DataFrame(data=d)
Desired Output
A: 0
B: 1
C: 2
I manage to get the desired output (except for the A: 0) using the code below. But is there a more pythonic / efficient way to do this?
# group by both columns
df2 = pd.DataFrame(df.groupby(['users', 'tweet_times']).tweet_id.count())
# filter out values < 2
df3 = df2[df2.tweet_id > 1]
# turn multi-index level 1 into column
df3.reset_index(level=[1], inplace=True)
# final groupby
df3.groupby('users').tweet_times.count()
We can use crosstab to create a frequency table then check for counts greater than 1 to create a boolean mask then sum this mask along axis=1
pd.crosstab(df['users'], df['tweet_times']).gt(1).sum(1)
users
A 0
B 1
C 2
dtype: int64
This works,
df1 = pd.DataFrame(df.groupby(['users'])['tweet_times'].value_counts()).reset_index(level = 0)
df1.groupby('users')['tweet_times'].apply(lambda x: sum(x>1))
users
A 0
B 1
C 2
Name: tweet_times, dtype: int64
you can use a custom boolean with your groupby.
the keep=False returns True when a value is duplicated and false if not.
# df['tweet_times'] = pd.to_datetime(df['tweet_times'],errors='coerce')
df.groupby([df.duplicated(subset=['tweet_times'],keep=False),'users']
).nunique().loc[True]
tweet_times
users
A 0
B 1
C 2
There might be a simpler way, but this is all I can come up with for now :)
df.groupby("users")["tweet_times"].agg(lambda x: x.count() - x.nunique()).rename("count_dupe")
Output:
users
A 0
B 1
C 2
Name: count_dupe, dtype: int64
This looks quite pythonic to me:
df.groupby("users")["tweet_times"].count() - df.groupby("users")["tweet_times"].nunique()
Output:
users
A 0
B 1
C 2
Name: tweet_times, dtype: int64
I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a
I am not sure if this is possible. I have two dataframes df1 and df2 which are presented like this:
df1 df2
id value id value
a 5 a 3
c 9 b 7
d 4 c 6
f 2 d 8
e 2
f 1
They will have many more entries in reality than presented here. I would like to create a third dataframe df3 based on the values in df1 and df2. Any values in df1 would take precedence over values in df2 when writing to df3 (if the same id is present in both df1 and df2) so in this example I would return:
df3
id value
a 5
b 7
c 9
d 4
e 2
f 2
I have tried using df2 as the base (df2 will have all of the id's present for the whole universe) and then overwriting the value for id's that are present in df1, but cannot find the merge syntax to do this.
You could use combine_first, provided that you first make the DataFrame index id (so that the values get aligned by id):
In [80]: df1.set_index('id').combine_first(df2.set_index('id')).reset_index()
Out[80]:
id value
0 a 5.0
1 b 7.0
2 c 9.0
3 d 4.0
4 e 2.0
5 f 2.0
Since you mentioned merging, you might be interested in seeing that
you could merge df1 and df2 on id, and then use fillna to replace NaNs in df1's the value column with values from df2's value column:
df1 = pd.DataFrame({'id': ['a', 'c', 'd', 'f'], 'value': [5, 9, 4, 2]})
df2 = pd.DataFrame({'id': ['a', 'b', 'c', 'd', 'e', 'f'], 'value': [3, 7, 6, 8, 2, 1]})
result = pd.merge(df2, df1, on='id', how='left', suffixes=('_x', ''))
result['value'] = result['value'].fillna(result['value_x'])
result = result[['id', 'value']]
print(result)
yields the same result, though the first method is simpler.
I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly