I have dataframe look like this:
user_id article_id set_tags
1 31 true
1 32 false
1 35 false
2 11 false
2 11 true
3 56 true
I want to get the result like this:
user_id total_articles set_tags_true set_tags_false
1 3 1 2
2 2 1 1
3 1 1 0
I'm new to this, how can i do this?
I tried to use groupby.count(), but it doesn't seem like it's correct.
import pandas as pd
df = pd.DataFrame(
data = [[1,31,True],[1,32,False],[1,35,False],[2,11,False],[2,11,True],[3,56,True]],
columns=['user_id','article_id','set_tags']
)
df
user_id article_id set_tags
0 1 31 True
1 1 32 False
2 1 35 False
3 2 11 False
4 2 11 True
5 3 56 True
output_df = df.groupby('user_id').agg({'article_id':'nunique', 'set_tags':['sum', (lambda x:sum(~x))]})
output_df.columns = ['total_articles','set_tags_True','set_tags_False']
output_df
total_articles set_tags_True set_tags_False
user_id
1 3 1 2
2 1 1 1
3 1 1 0
If you want the total_articles entry for user_id 2 to be 2 instead of 1, just replace nunique with count.
Related
I am trying to create a column 'count' on a pandas DF that cumulatively counts when field 'boolean' is True but resets and stays at 0 when 'boolean' is False. Also needs to be grouped by the ID column, so the count resets when looking at a new ID. No loops please as working with a big data set
Used the code from the following question which works but need to add a group by to include the ID column grouping
Pandas Dataframe - Row Iteration with Resetting Count-Value by Condition without loop
Expected output below: (ID, Boolean columns already exist, just need to create Count)
ID Boolean Count
1 True 1
1 True 2
1 True 3
1 True 4
1 True 5
1 False 0
1 False 0
1 False 0
1 False 0
1 True 1
1 True 2
1 True 3
2 True 1
2 True 2
2 True 3
2 True 4
2 False 0
2 False 0
2 False 0
2 True 1
2 True 2
2 True 3
Identify blocks by using cumsum on inverted boolean mask, then group the dataframe by ID and blocks and use cumsum on Boolean to create a counter
b = (~df['Boolean']).cumsum()
df['Count'] = df.groupby(['ID', b])['Boolean'].cumsum()
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
df['Count'] = df.groupby('ID')['Boolean'].diff()
df = df.fillna(False)
df['Count'] = df.groupby('ID')['Count'].cumsum()
df['Count'] = df.groupby(['ID', 'Count'])['Boolean'].cumsum()
df
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
You can use a column shift for ID and Boolean columns to identify the groups to do the groupby on. Then do a cumsum for each of those groups.
groups = ((df['ID']!=df['ID'].shift()) | (df['Boolean']!=df['Boolean'].shift())).cumsum()
df.assign(Count2=df.groupby(groups)['Boolean'].cumsum())
Result
ID Boolean Count Count2
0 1 True 1 1
1 1 True 2 2
2 1 True 3 3
3 1 True 4 4
4 1 True 5 5
5 1 False 0 0
6 1 False 0 0
7 1 False 0 0
8 1 False 0 0
9 1 True 1 1
10 1 True 2 2
11 1 True 3 3
12 2 True 1 1
13 2 True 2 2
14 2 True 3 3
15 2 True 4 4
16 2 False 0 0
17 2 False 0 0
18 2 False 0 0
19 2 True 1 1
20 2 True 2 2
21 2 True 3 3
I have a pandas dataframe like below.
id A B C
0 1 1 1 1
1 1 5 7 2
2 2 6 9 3
3 3 1 5 4
4 3 4 6 2
After evaluating conditions,
id A B C a_greater_than_b b_greater_than_c c_greater_than_a
0 1 1 1 1 False False False
1 1 5 7 2 False True False
2 2 6 9 3 False True False
3 3 1 5 4 False True True
4 3 4 6 2 False True False
And after evaluating conditions, want to aggregate the results per id.
id a_greater_than_b b_greater_than_c c_greater_than_a
1 False False False
2 False True False
3 False True False
The logic is not fully clear, but you can combine pandas.get_dummies and aggregation per group (here I am assuming the min as your example showed that 1/1/0 -> 0 and 1/1/1 -> 1, but you can use other logics, e.g. last if you want to get the last row per group after sorting by date):
out = (pd
.get_dummies(df[['color', 'size']])
.groupby(df['id'])
.min()
)
print(out)
Output:
color_blue color_yellow size_l
id
A1 0 0 1
sorry, I have a bit of a trouble explaining the problem in title
By accident we pivoted our Pandas Dataframe to this:
df = pd.DataFrame(np.array([[1,1,2], [1,2,1], [2,1,2], [2,2,2],[3,1,3]]),columns=['id', '3s', 'score'])
id 3s score
1 1 2
1 2 1
2 1 2
2 2 2
3 1 3
But we need to unstack this so df will look like this (the original version): The '3s' column 'unpivots' to the discrete set by 3 ordered columns with 0s and 1s, which add in order. So if we had '3s'= 2 with 'score'= 2 the values will be [1,1,0] (2 out of 3 in order) in columns ['4','5','6'] (second set of 3s) for corresponding id
df2 = pd.DataFrame(np.array([[1,1,1,0,1,0,0], [2,1,1,0,1,1,0], [3,1,1,1,np.nan,np.nan,np.nan] ]),columns=['id', '1', '2','3','4','5','6'])
id 1 2 3 4 5 6
1 1 1 0 1 0 0
2 1 1 0 1 1 0
3 1 1 1
Any help greatly appreciated!
(please save me)
Use:
n = 3
df2 = df.reindex(index = df.index.repeat(n))
new_df = (df2.assign(score = df2['score'].gt(df2.groupby(['id','3s'])
.id
.cumcount())
.astype(int),
columns = df2.groupby('id').cumcount().add(1))
.pivot_table(index = 'id',
values='score',
columns = 'columns',
fill_value = '')
.rename_axis(columns = None)
.reset_index())
print(new_df)
Output
id 1 2 3 4 5 6
0 1 1.0 1.0 0.0 1 0 0
1 2 1.0 1.0 0.0 1 1 0
2 3 1.0 1.0 1.0
If you want you can use fill_value = 0
id 1 2 3 4 5 6
0 1 1 1 0 1 0 0
1 2 1 1 0 1 1 0
2 3 1 1 1 0 0 0
This should do the trick:
for gr in df.groupby('3s').groups:
for i in range(1,4):
df[str(i+(gr-1)*3)]=np.where((df['3s'].eq(gr))&(df['score'].ge(i)), 1,0)
df=df.drop(['3s', 'score'], axis=1).groupby('id').max().reset_index()
Output:
id 1 2 3 4 5 6
0 1 1 1 0 1 0 0
1 2 1 1 0 1 1 0
2 3 1 1 1 0 0 0
Let's take an example of a python dataframe.
ID Age Bp
1 22 1
1 22 1
1 22 0
1 22 1
2 21 0
2 21 1
2 21 0
In the above code, the last n series for column BP (lets consider n to be 2) with group by ID should be excluded and the rest of the BP should be changed to 0. I have tried it with tail but it does not work.
It should look like this.
ID Age BP
1 22 0
1 22 0
1 22 0
1 22 1
2 21 0
2 21 1
2 21 0
Use cumcount with ascending=False for counter from back per groups and assign 0 with numpy.where:
n = 2
mask = df.groupby('ID').cumcount(ascending=False) < n
df['Bp'] = np.where(mask, df['Bp'], 0)
Alternatives:
df.loc[~mask, 'Bp'] = 0
df['Bp'] = df['Bp'].where(mask, 0)
print (df)
ID Age Bp
0 1 22 0
1 1 22 0
2 1 22 0
3 1 22 1
4 2 21 0
5 2 21 1
6 2 21 0
Details:
print (df.groupby('ID').cumcount(ascending=False))
0 3
1 2
2 1
3 0
4 2
5 1
6 0
dtype: int64
print (mask)
0 False
1 False
2 True
3 True
4 False
5 True
6 True
dtype: bool
For example, I have a dataframe:
cond value1 value2
0 True 1 1
1 False 3 5
2 True 34 2
3 True 23 23
4 False 4 2
I hope to replace value1 to value2*2 when cond=True. So I want the result is:
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
I can achieve it by follow code:
def convert(x):
if x.cond:
x.value1= x.value2*2
return x
data = data.apply(lambda x: convert(x),axis=1)
I think it is so slow when data is big. I try it by .loc, but I don't know how to set value.
How can I achieve it by .loc or other simple ways? Thanks in advance.
Create boolean mask and multiple only filtered rows:
mask = df.cond
df.loc[mask, 'value1'] = df.loc[mask, 'value2'] * 2
print (df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
You can use where/mask:
df.value1 = df.value1.mask(df.cond, df.value2*2)
# Or,
# df.value1 = df.value1.where(~df.cond, df.value2*2)
print(df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
Using np.where :
df['value1'] = np.where(df.cond,df.value2*2,df.value1)
print(df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2