How do you add an array to each previous row in pandas? - python

If I have an array [1, 2, 3, 4, 5] and a Pandas Dataframe
df = pd.DataFrame([[1,1,1,1,1], [0,0,0,0,0], [0,0,0,0,0], [0,0,0,0,0]])
0 1 2 3 4
0 1 1 1 1 1
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
How do I iterate through the Pandas DataFrame adding my array to each previous row?
The expected result would be:
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16

The array is added n times to the nth row, which you can create using np.arange(len(df))[:,None] * a and then add the first row:
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 0 0 0 0 0
#2 0 0 0 0 0
#3 0 0 0 0 0
a = np.array([1, 2, 3, 4, 5])
np.arange(len(df))[:,None] * a
#array([[ 0, 0, 0, 0, 0],
# [ 1, 2, 3, 4, 5],
# [ 2, 4, 6, 8, 10],
# [ 3, 6, 9, 12, 15]])
df[:] = df.iloc[0].values + np.arange(len(df))[:,None] * a
df
# 0 1 2 3 4
#0 1 1 1 1 1
#1 2 3 4 5 6
#2 3 5 7 9 11
#3 4 7 10 13 16

df = pd.DataFrame([
[1,1,1],
[0,0,0],
[0,0,0],
])
s = pd.Series([1,2,3])
# add to every row except first, then cumulative sum
result = df.add(s, axis=1)
result.iloc[0] = df.iloc[0]
result.cumsum()
Or if you want a one-liner:
pd.concat([df[:1], df[1:].add(s, axis=1)]).cumsum()
Either way, result:
0 1 2
0 1 1 1
1 2 3 4
2 3 5 7

Using cumsum and assignment:
df[1:] = (df+l).cumsum()[:-1].values
0 1 2 3 4
0 1 1 1 1 1
1 2 3 4 5 6
2 3 5 7 9 11
3 4 7 10 13 16
Or using concat:
pd.concat((df[:1], (df+l).cumsum()[:-1]))
0 1 2 3 4
0 1 1 1 1 1
0 2 3 4 5 6
1 3 5 7 9 11
2 4 7 10 13 16

After cumsum, you can shift and add back to the original df:
a = [1,2,3,4,5]
updated = df.add(pd.Series(a), axis=1).cumsum().shift().fillna(0)
df.add(updated)

Related

Fill gaps between 1's in Pandas dataframe column with increment values that reset when next 1 is reached

Apparently this is a more complicated problem than I thought.
All I want to do is fill the zeros with ++1 increments until the next 1
My dataset is 1m+ rows, so I'm trying to vectorize this operation if possible.
Here's a sample column:
# Define the input dataframe
df = pd.DataFrame({'col': [1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0]})
0 1
1 0
2 1
3 0
4 1
5 1
6 0
7 0
8 0
9 0
10 1
11 0
12 1
13 1
14 0
Goal Result:
0 1
1 2
2 1
3 2
4 1
5 1
6 2
7 3
8 4
9 5
10 1
11 2
12 1
13 1
14 2
I've tried a number of different methods with ffill() and cumsum(), but the issue with cumsum() tends to be that it doesn't reset the increment.
Group by cumulative sums of column col and apply cumcount:
df['col'] = df.groupby(df['col'].cumsum())['col'].cumcount() + 1
col
0 1
1 2
2 1
3 2
4 1
5 1
6 2
7 3
8 4
9 5
10 1
11 2
12 1
13 1
14 2
Replace temporary 0 by 1 then create groups for each real 1 and consecutive 0 then apply cumulative sum for the group:
df['col2'] = df['col'].replace(0, 1).groupby(df['col'].cumsum()).cumsum()
print(df)
# Output
col col2
0 1 1
1 0 2
2 1 1
3 0 2
4 1 1
5 1 1
6 0 2
7 0 3
8 0 4
9 0 5
10 1 1
11 0 2
12 1 1
13 1 1
14 0 2

df.loc behavior when assigning a dict to a column

Lets say we have a df like below:
df = pd.DataFrame({'A': [3, 9, 3, 4], 'B': [7, 1, 6, 0], 'C': [9, 0, 3, 4], 'D': [1, 8, 0, 0]})
Starting df:
A B C D
0 3 7 9 1
1 9 1 0 8
2 3 6 3 0
3 4 0 4 0
If we wanted to assign new values to column A, I would expect the following to work:
d = {0:10,1:20,2:30,3:40}
df.loc[:,'A'] = d
Output:
A B C D
0 0 7 9 1
1 1 1 0 8
2 2 6 3 0
3 3 0 4 0
The values that are assigned instead are the keys of the dictionary.
If however, instead of assigning the dictionary to an existing column, if we create a new column, we will get the same result the first time we run it, but running the same code again will get the expected result. We then are able to select any column and it will output the expected output.
First time running df.loc[:,'E'] = {0:10,1:20,2:30,3:40}
Output:
A B C D E
0 0 7 9 1 0
1 1 1 0 8 1
2 2 6 3 0 2
3 3 0 4 0 3
Second time running df.loc[:,'E'] = {0:10,1:20,2:30,3:40}
A B C D E
0 0 7 9 1 10
1 1 1 0 8 20
2 2 6 3 0 30
3 3 0 4 0 40
Then if we run the same code as we did at first, we get a different result:
df.loc[:,'A'] = {0:10,1:20,2:30,3:40}
Output:
A B C D E
0 10 7 9 1 10
1 20 1 0 8 20
2 30 6 3 0 30
3 40 0 4 0 40
Is this the intended behavior? (I am running pandas version 1.4.2)

Find the days lag and replace 0 with last day lag pandas

I have a df containing employee , worked_days and sold columns
Some employee sold only for first day and after five days sold another
My data look like this
data = {'id':[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'days':[1, 3, 3, 8, 8,8, 3, 8, 8, 9, 9, 12],
'sold':[1, 0, 1, 1, 1, 0, 0, 1, 1, 2, 0, 1]}
df = pd.DataFrame(data)
df['days_lag'] = df.groupby('id')['days'].diff().fillna(0).astype('int16')
Gives me this
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 0
3 1 8 1 5
4 1 8 1 0
5 1 8 0 0
6 2 3 0 0
7 2 8 1 5
8 2 8 1 0
9 2 9 2 1
10 2 9 0 0
11 2 12 1 3
I want the results to be like below
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3
How can i achieve this ?
Thanks
Use Groupby.transform:
In [92]: df['days_lag'] = df.groupby('id')['days'].diff().fillna(0).astype('int16')
In [96]: df['days_lag'] = df.groupby(['id', 'days'])['days_lag'].transform('max')
In [97]: df
Out[97]:
id days sold days_lag
0 1 1 1 0
1 1 3 0 2
2 1 3 1 2
3 1 8 1 5
4 1 8 1 5
5 1 8 0 5
6 2 3 0 0
7 2 8 1 5
8 2 8 1 5
9 2 9 2 1
10 2 9 0 1
11 2 12 1 3

Numpy Array to Pandas Data Frame of X Y Coordinates

I have a two dimensional numpy array:
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
How would I go about converting this into a pandas data frame that would have the x coordinate, y coordinate, and corresponding array value at that index into a pandas data frame like this:
x y val
0 0 1
0 1 4
0 2 7
1 0 2
1 1 5
1 2 8
...
With stack and reset index:
df = pd.DataFrame(arr).stack().rename_axis(['y', 'x']).reset_index(name='val')
df
Out:
y x val
0 0 0 1
1 0 1 2
2 0 2 3
3 1 0 4
4 1 1 5
5 1 2 6
6 2 0 7
7 2 1 8
8 2 2 9
If ordering is important:
df.sort_values(['x', 'y'])[['x', 'y', 'val']].reset_index(drop=True)
Out:
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9
Here's a NumPy method -
>>> arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>> shp = arr.shape
>>> r,c = np.indices(shp)
>>> pd.DataFrame(np.c_[r.ravel(), c.ravel(), arr.ravel('F')], \
columns=((['x','y','val'])))
x y val
0 0 0 1
1 0 1 4
2 0 2 7
3 1 0 2
4 1 1 5
5 1 2 8
6 2 0 3
7 2 1 6
8 2 2 9

Pandas dataframe: how to group by values in a column and create new columns out of grouped values

I have a dataframe with two columns:
x y
0 1
1 1
2 2
0 5
1 6
2 8
0 1
1 8
2 4
0 1
1 7
2 3
What I want is:
x val1 val2 val3 val4
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
I know that the values in column x are repeated all N times.
You could use groupby/cumcount to assign column numbers and then call pivot:
import pandas as pd
df = pd.DataFrame({'x': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2],
'y': [1, 1, 2, 5, 6, 8, 1, 8, 4, 1, 7, 3]})
df['columns'] = df.groupby('x')['y'].cumcount()
# x y columns
# 0 0 1 0
# 1 1 1 0
# 2 2 2 0
# 3 0 5 1
# 4 1 6 1
# 5 2 8 1
# 6 0 1 2
# 7 1 8 2
# 8 2 4 2
# 9 0 1 3
# 10 1 7 3
# 11 2 3 3
result = df.pivot(index='x', columns='columns')
print(result)
yields
y
columns 0 1 2 3
x
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Or, if you can really rely on the values in x being repeated in order N times,
N = 3
result = pd.DataFrame(df['y'].values.reshape(-1, N).T)
yields
0 1 2 3
0 1 5 1 1
1 1 6 8 7
2 2 8 4 3
Using reshape is quicker than calling groupby/cumcount and pivot, but it
is less robust since it relies on the values in y appearing in the right order.

Categories

Resources