For example, I have a dataframe:
cond value1 value2
0 True 1 1
1 False 3 5
2 True 34 2
3 True 23 23
4 False 4 2
I hope to replace value1 to value2*2 when cond=True. So I want the result is:
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
I can achieve it by follow code:
def convert(x):
if x.cond:
x.value1= x.value2*2
return x
data = data.apply(lambda x: convert(x),axis=1)
I think it is so slow when data is big. I try it by .loc, but I don't know how to set value.
How can I achieve it by .loc or other simple ways? Thanks in advance.
Create boolean mask and multiple only filtered rows:
mask = df.cond
df.loc[mask, 'value1'] = df.loc[mask, 'value2'] * 2
print (df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
You can use where/mask:
df.value1 = df.value1.mask(df.cond, df.value2*2)
# Or,
# df.value1 = df.value1.where(~df.cond, df.value2*2)
print(df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
Using np.where :
df['value1'] = np.where(df.cond,df.value2*2,df.value1)
print(df)
cond value1 value2
0 True 2 1
1 False 3 5
2 True 4 2
3 True 46 23
4 False 4 2
Related
I am trying to create a column 'count' on a pandas DF that cumulatively counts when field 'boolean' is True but resets and stays at 0 when 'boolean' is False. Also needs to be grouped by the ID column, so the count resets when looking at a new ID. No loops please as working with a big data set
Used the code from the following question which works but need to add a group by to include the ID column grouping
Pandas Dataframe - Row Iteration with Resetting Count-Value by Condition without loop
Expected output below: (ID, Boolean columns already exist, just need to create Count)
ID Boolean Count
1 True 1
1 True 2
1 True 3
1 True 4
1 True 5
1 False 0
1 False 0
1 False 0
1 False 0
1 True 1
1 True 2
1 True 3
2 True 1
2 True 2
2 True 3
2 True 4
2 False 0
2 False 0
2 False 0
2 True 1
2 True 2
2 True 3
Identify blocks by using cumsum on inverted boolean mask, then group the dataframe by ID and blocks and use cumsum on Boolean to create a counter
b = (~df['Boolean']).cumsum()
df['Count'] = df.groupby(['ID', b])['Boolean'].cumsum()
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
df['Count'] = df.groupby('ID')['Boolean'].diff()
df = df.fillna(False)
df['Count'] = df.groupby('ID')['Count'].cumsum()
df['Count'] = df.groupby(['ID', 'Count'])['Boolean'].cumsum()
df
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
You can use a column shift for ID and Boolean columns to identify the groups to do the groupby on. Then do a cumsum for each of those groups.
groups = ((df['ID']!=df['ID'].shift()) | (df['Boolean']!=df['Boolean'].shift())).cumsum()
df.assign(Count2=df.groupby(groups)['Boolean'].cumsum())
Result
ID Boolean Count Count2
0 1 True 1 1
1 1 True 2 2
2 1 True 3 3
3 1 True 4 4
4 1 True 5 5
5 1 False 0 0
6 1 False 0 0
7 1 False 0 0
8 1 False 0 0
9 1 True 1 1
10 1 True 2 2
11 1 True 3 3
12 2 True 1 1
13 2 True 2 2
14 2 True 3 3
15 2 True 4 4
16 2 False 0 0
17 2 False 0 0
18 2 False 0 0
19 2 True 1 1
20 2 True 2 2
21 2 True 3 3
I have a pandas dataframe like below.
id A B C
0 1 1 1 1
1 1 5 7 2
2 2 6 9 3
3 3 1 5 4
4 3 4 6 2
After evaluating conditions,
id A B C a_greater_than_b b_greater_than_c c_greater_than_a
0 1 1 1 1 False False False
1 1 5 7 2 False True False
2 2 6 9 3 False True False
3 3 1 5 4 False True True
4 3 4 6 2 False True False
And after evaluating conditions, want to aggregate the results per id.
id a_greater_than_b b_greater_than_c c_greater_than_a
1 False False False
2 False True False
3 False True False
The logic is not fully clear, but you can combine pandas.get_dummies and aggregation per group (here I am assuming the min as your example showed that 1/1/0 -> 0 and 1/1/1 -> 1, but you can use other logics, e.g. last if you want to get the last row per group after sorting by date):
out = (pd
.get_dummies(df[['color', 'size']])
.groupby(df['id'])
.min()
)
print(out)
Output:
color_blue color_yellow size_l
id
A1 0 0 1
I have dataframe look like this:
user_id article_id set_tags
1 31 true
1 32 false
1 35 false
2 11 false
2 11 true
3 56 true
I want to get the result like this:
user_id total_articles set_tags_true set_tags_false
1 3 1 2
2 2 1 1
3 1 1 0
I'm new to this, how can i do this?
I tried to use groupby.count(), but it doesn't seem like it's correct.
import pandas as pd
df = pd.DataFrame(
data = [[1,31,True],[1,32,False],[1,35,False],[2,11,False],[2,11,True],[3,56,True]],
columns=['user_id','article_id','set_tags']
)
df
user_id article_id set_tags
0 1 31 True
1 1 32 False
2 1 35 False
3 2 11 False
4 2 11 True
5 3 56 True
output_df = df.groupby('user_id').agg({'article_id':'nunique', 'set_tags':['sum', (lambda x:sum(~x))]})
output_df.columns = ['total_articles','set_tags_True','set_tags_False']
output_df
total_articles set_tags_True set_tags_False
user_id
1 3 1 2
2 1 1 1
3 1 1 0
If you want the total_articles entry for user_id 2 to be 2 instead of 1, just replace nunique with count.
Following is what my dataframe looks like. Expected_Output is my desired/target column.
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False
If any Value1 == 7 AND if any Value2 == 9 within a given Group, then I want to return True.
I tried to no avail:
df['Expected_Output']= df.groupby('Group').Value1.isin(7) & df.groupby('Group').Value2.isin(9)
N.B:- Either True/False or 1/0 can be output.
Use groupby on Group column and then use transform and lambda function as:
g = df.groupby('Group')
df['Expected'] = (g['Value1'].transform(lambda x: x.eq(7).any()))&(g['Value2'].transform(lambda x: x.eq(9).any()))
Or using groupby, apply and merge using parameter how='left' as:
df.merge(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()).reset_index(),how='left').rename(columns={0:'Expected_Output'})
Or using groupby, apply and map as:
df['Expected_Output'] = df['Group'].map(df.groupby('Group').apply(lambda x: x['Value1'].eq(7).any()&x['Value2'].eq(9).any()))
print(df)
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False
You can create a dataframe of the expected result by group and then merge it back to the original dataframe.
expected = (
df.groupby('Group')
.apply(lambda x: (x['Value1'].eq(7).any()
& x['Value2'].eq(9)).any())
.to_frame('Expected_Output'))
>>> expected
Expected_Output
Group
1 True
2 False
>>> df.merge(expected, left_on='Group', right_index=True)
Group Value1 Value2 Expected_Output
0 1 3 9 True
1 1 7 6 True
2 1 9 7 True
3 2 3 8 False
4 2 8 5 False
5 2 7 6 False
I have the following dataframe:
df1 = pd.DataFrame({1:[1,2,3,4], 2:[1,2,4,5], 3:[8,1,5,6]})
df1
Out[7]:
1 2 3
0 1 1 8
1 2 2 1
2 3 4 5
3 4 5 6
and I would like to create a new column that will show the distance the last column with a particular value, 2 in this case, from the reference column, 3 in this example, or return an NaN result is no such value is found in a row. Output would be something like:
df1
Out[11]:
1 2 3 dist
0 1 1 8 NaN
1 2 2 1 1
2 3 4 5 NaN
3 4 5 6 NaN
What would be an effective way of accomplishing this task?
I think need subtract 3 (last) because reference column with column name of last 2:
df1.columns = df1.columns.astype(int)
print((df1.columns.max() - df1.eq(2).iloc[:,::-1].idxmax(axis=1)).mask(lambda x: x == 0))
0 NaN
1 1.0
2 NaN
3 NaN
dtype: float64
Details:
Compare by 2:
print (df1.eq(2))
1 2 3
0 False False False
1 True True False
2 False False False
3 False False False
Inverse order of columns:
print (df1.eq(2).iloc[:,::-1])
3 2 1
0 False False False
1 False True True
2 False False False
3 False False False
Check column name of first True (because inverse columns, it is last)
print (df1.eq(2).iloc[:,::-1].idxmax(axis=1))
0 3
1 2
2 3
3 3
dtype: int64
Subtract by max value, but it also return 0 if value in reference column and if no value match:
print (df1.columns.max() - df1.eq(2).iloc[:,::-1].idxmax(1))
0 0
1 1
2 0
3 0
dtype: int64