I have the following df, the last column is the desired output. thanks!
group date value desired_first_nonzero
1 jan2019 0 2
1 jan2019 2 2
1 feb2019 3 2
1 mar2019 4 2
1 mar2019 5 2
2 feb2019 0 4
2 feb2019 0 4
2 mar2019 0 4
2 mar2019 4 4
2 apr2019 5 4
I want to group by "group" and find the first non-zero value
You can use GroupBy.transform with a custom function to get the index of the first non-zero value with idxmax (that return the first True value here):
df['desired_first_nonzero'] = (df.groupby('group')['value']
.transform(lambda s: s[s.ne(0).idxmax()])
)
alternatively, using an intermediate Series:
s = df.set_index('group')['value']
df['desired_first_nonzero'] = df['group'].map(s[s.ne(0)].groupby(level=0).first())
output:
group date value desired_first_nonzero
0 1 jan2019 0 2
1 1 jan2019 2 2
2 1 feb2019 3 2
3 1 mar2019 4 2
4 1 mar2019 5 2
5 2 feb2019 0 4
6 2 feb2019 0 4
7 2 mar2019 0 4
8 2 mar2019 4 4
9 2 apr2019 5 4
This should do the job:
# the given example
d = {'group': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2], 'value': [0, 2, 3, 4, 5, 0, 0, 0, 4, 5]}
df = pd.DataFrame(data=d)
first_non_zero = pd.DataFrame(df[df['value'] != 0].groupby('group').head(1))
print(first_non_zero)
Output:
group value
1 1 2
8 2 4
Then you can distributed as needed for each group row.
Related
I have a pandas dataframe that represents the trips I have taken for work. Each row is a single trip, with a column for the date and the number of kilometers traveled.
I get reimbursed on a per kilometer basis for every trip besides the first and the last of each day (these are considered ordinary travel to and from work).
So my data frame looks something like this:
day, distance
1, 5
1, 2
1, 7
2, 11
2, 11
3, 4
3, 10
3, 5
3, 12
I would like to add a column in here that flags all but the first and last trips of the day. Such as:
day, distance, claimable
1, 5, 0
1, 2, 1
1, 7, 0
2, 11, 0
2, 11, 0
3, 4, 0
3, 10, 1
3, 5, 1
3, 12, 0
Assuming I have a dataframe with the columns above is there a way to do something like this:
import pandas as pd
df = pd.DataFrame({'day':(1,1,1,2,2,3,3,3,3),
'dist':(5,2,7,11,11,4,10,5,12),
},)
df['claim'] = 0
# set the value of the "claimable" column to 1 on all
# but the first and last trip of the day
df.groupby("day").nth(slice(1,-1)).loc[:, "claim"] = 1
You can do transform and take the first and last position
g = df.reset_index().groupby('day')['index']
con = (df.index == g.transform('first')) | (df.index == g.transform('last'))
df['new'] = (~con).astype(int)
df
Out[117]:
day dist new
0 1 5 0
1 1 2 1
2 1 7 0
3 2 11 0
4 2 11 0
5 3 4 0
6 3 10 1
7 3 5 1
8 3 12 0
if you sort your dataframe by day column:
df['claim'] = (df['day'].eq(df['day'].shift()) &
df['day'].eq(df['day'].shift(-1))).astype(int)
Diff version # Rodalm
(df['day'].diff().eq(0) & df['day'].diff(-1).eq(0)).astype(int)
Or use GroupBy.cumcount, then compare with 0 and next
c = df.groupby('day').cumcount()
df['claim'] = (c.ne(0) & c.lt(c.shift(-1))).astype(int)
#df['claim'] = (c.gt(c.shift()) & c.lt(c.shift(-1))).astype(int)
print(df)
day dist claim
0 1 5 0
1 1 2 1
2 1 7 0
3 2 11 0
4 2 11 0
5 3 4 0
6 3 10 1
7 3 5 1
8 3 12 0
You can use transform
df = pd.DataFrame({
'day':(1,1,1,2,2,3,3,3,3),
'dist':(5,2,7,11,11,4,10,5,12),
})
def is_claimable(group):
claim = np.ones(len(group), dtype='int8')
claim[[0, -1]] = 0
return claim
df['claim'] = df.groupby("day")['dist'].transform(is_claimable)
Output:
>>> df
day dist claim
0 1 5 0
1 1 2 1
2 1 7 0
3 2 11 0
4 2 11 0
5 3 4 0
6 3 10 1
7 3 5 1
8 3 12 0
One option is to pivot and use first_valid_index and last_valid_index; of course this will fail if there are duplicates in the combination of index and columns:
positions = df.pivot(None, 'day', 'distance')
first = positions.apply(pd.Series.first_valid_index)
last = positions.apply(pd.Series.last_valid_index)
positions = np.concatenate([first.ravel(), last.ravel()])
df.assign(claimable = np.where(df.index.isin(positions), 0, 1))
day distance claimable
0 1 5 0
1 1 2 1
2 1 7 0
3 2 11 0
4 2 11 0
5 3 4 0
6 3 10 1
7 3 5 1
8 3 12 0
Using transform, as in the other answers, does not have to worry about duplicates.
Suppose dataframe:
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 3],
"day": [1, 2, 3, 1, 3, 2],
"value": [1, 2, 3, 4, 5, 6],
})
I need all ids to have the same set of days. How to add rows with missing days?
IIUC
df=df.pivot(*df.columns).fillna(0).stack().reset_index()
id day 0
0 1 1 1.0
1 1 2 2.0
2 1 3 3.0
3 2 1 4.0
4 2 2 0.0
5 2 3 5.0
6 3 1 0.0
7 3 2 6.0
8 3 3 0.0
For solutions with multiple columns filled by 0 for new missing rows use DataFrame.unstack with DataFrame.stack working with MultiIndex by DataFrame.set_index, last convert again to columns by DataFrame.reset_index:
df = df.set_index(['id','day']).unstack(fill_value=0).stack().reset_index()
print (df)
id day value
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 4
4 2 2 0
5 2 3 5
6 3 1 0
7 3 2 6
8 3 3 0
Another solution with DataFrame.reindex by MultiIndex.from_product, working with MultiIndex by DataFrame.set_index, last convert again to columns by DataFrame.reset_index:
df = df.set_index(['id','day'])
m = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
df = df.reindex(m, fill_value=0).reset_index()
print (df)
id day value
0 1 1 1
1 1 2 2
2 1 3 3
3 2 1 4
4 2 2 0
5 2 3 5
6 3 1 0
7 3 2 6
8 3 3 0
If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16
df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7
Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16
After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)
I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],\
'crit_1' : [0, 0, 1, 0, 0, 0, 1, 0, 0, 1], \
'crit_2' : ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'a', 'a', 'a'],\
'value' : [3, 4, 3, 5, 1, 2, 4, 6, 2, 3]}, \
columns=['id' , 'crit_1', 'crit_2', 'value' ])
df
Out[41]:
id crit_1 crit_2 value
0 1 0 a 3
1 1 0 a 4
2 1 1 b 3
3 1 0 b 5
4 2 0 a 1
5 2 0 b 2
6 2 1 a 4
7 3 0 a 6
8 3 0 a 2
9 3 1 a 3
I pull a subset out of this frame based on crit_1
df_subset = df[(df['crit_1']==1)]
Then I perform a complex operation (the nature of which is unimportant for this question) on that subeset producing a new column
df_subset['some_new_val'] = [1, 4,2]
df_subset
Out[42]:
id crit_1 crit_2 value some_new_val
2 1 1 b 3 1
6 2 1 a 4 4
9 3 1 a 3 2
Now, I want to add some_new_val and back into my original dataframe onto the column value. However, I only want to add it in where there is a match on id and crit_2
The result should look like this
id crit_1 crit_2 value new_value
0 1 0 a 3 3
1 1 0 a 4 4
2 1 1 b 3 4
3 1 0 b 5 6
4 2 0 a 1 1
5 2 0 b 2 6
6 2 1 a 4 4
7 3 0 a 6 8
8 3 0 a 2 4
9 3 1 a 3 5
You can use merge with left join and then add:
#filter only columns for join and for append
cols = ['id','crit_2', 'some_new_val']
df = pd.merge(df, df_subset[cols], on=['id','crit_2'], how='left')
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 NaN
1 1 0 a 4 NaN
2 1 1 b 3 1.0
3 1 0 b 5 1.0
4 2 0 a 1 4.0
5 2 0 b 2 NaN
6 2 1 a 4 4.0
7 3 0 a 6 2.0
8 3 0 a 2 2.0
9 3 1 a 3 2.0
df['some_new_val'] = df['some_new_val'].add(df['value'], fill_value=0)
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 3.0
1 1 0 a 4 4.0
2 1 1 b 3 4.0
3 1 0 b 5 6.0
4 2 0 a 1 5.0
5 2 0 b 2 2.0
6 2 1 a 4 8.0
7 3 0 a 6 8.0
8 3 0 a 2 4.0
9 3 1 a 3 5.0
I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.