In the following pandas dataframe, I want to change each row with a "-1" value with the value of the previous row. So this is the original df:
position
0 0
1 -1
2 1
3 1
4 -1
5 0
And I want to transform it in:
position
0 0
1 0
2 1
3 1
4 1
5 0
I'm doing it in the following way but I think that there should be faster ways, probably vectorizing it or something like that (although I wasn't able to do it).
for i, row in self.df.iterrows():
if row["position"] == -1:
self.df.loc[i, "position"] = self.df.loc[i-1, "position"]
So, the code works, but it seems slow, is there any way to speed it up?
Use replace + ffill:
df.replace(-1, np.nan).ffill()
position
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0
Replace will convert -1 to NaN values. ffill will replace NaNs with the value just above it.
Use .astype for an integer result:
df.replace(-1, np.nan).ffill().astype(int)
position
0 0
1 0
2 1
3 1
4 1
5 0
Don't forget to assign the result back. You could perform the same operation non position if need be:
df['position'] = df['position'].replace(-1, np.nan).ffill().astype(int)
Solution using np.where:
c = df['position']
df['position'] = np.where(c == -1, c.shift(), c)
df
position
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0
Related
Suppose I have the following dataframe:
A B C D Count
0 0 0 0 0 12.0
1 0 0 0 1 2.0
2 0 0 1 0 4.0
3 0 0 1 1 0.0
4 0 1 0 0 3.0
5 0 1 1 0 0.0
6 1 0 0 0 7.0
7 1 0 0 1 9.0
8 1 0 1 0 0.0
... (truncated for readability)
And an array: [1, 0, 0, 1]
I would like to access Count value given the above values of each column. In this case, this would be row 7 with Count = 9.0
I can use iloc or at by deconstructing each value in the array, but that seems inefficient. Wondering if there's a way to map the values in the array to a value of a column.
You can index the DataFrame with a list of the key column names and compare the resulting view to the array, using NumPy broadcasting to do it for each line at once. Then collapse the resulting Boolean DataFrame to a Boolean row index with all() and use that to index the Count column.
If df is the DataFrame and a is the array (or a list):
df.Count.loc[(df[list('ABCD')] == a).all(axis=1)]
You can try with tuple
out = df.loc[df[list('ABCD')].apply(tuple,1) == (1, 0, 0, 1),'Count']
Out[333]:
7 9.0
Name: Count, dtype: float64
I just used the .loc command, and searched for the multiple conditions like this:
f = [1,0,0,1]
result = df['Count'].loc[(df['A']==f[0]) &
(df['B']==f[1]) &
(df['C']==f[2]) &
(df['D']==f[3])].values
print(result)
OUTPUT:
[9.]
However, I like Arne's answer better :)
I am looking to add a column that counts consecutive positive numbers and resets the counter on finding a negative on a pandas dataframe. I might be able to loop through it with 'for' statement but I just know there is a better solution. I have looked at various similar posts that almost ask the same but I just cannot get those solutions to work on my problem.
I have:
Slope
-25
-15
17
6
0.1
5
-3
5
1
3
-0.1
-0.2
1
-9
What I want:
Slope Count
-25 0
-15 0
17 1
6 2
0.1 3
5 4
-3 0
5 1
1 2
3 3
-0.1 0
-0.2 0
1 1
-9 0
Please keep in mind that this a low-skill level question. If there are multiple steps on your proposed solution, please explain each. I would like an answer, but would prefer for me to understand the 'how'.
You first want to mark the positions where new segments (i.e., groups) start:
>>> df['Count'] = df.Slope.lt(0)
>>> df.head(7)
Slope Count
0 -25.0 True
1 -15.0 True
2 17.0 False
3 6.0 False
4 0.1 False
5 5.0 False
6 -3.0 True
Now you need to label each group using the cumulative sum: as True is evaluated as 1 in mathematical equations, the cumulative sum will label each segment with an incrementing integer. (This is a very powerful concept in pandas!)
>>> df['Count'] = df.Count.cumsum()
>>> df.head(7)
Slope Count
0 -25.0 1
1 -15.0 2
2 17.0 2
3 6.0 2
4 0.1 2
5 5.0 2
6 -3.0 3
Now you can use groupby to access each segment, then all you need to do is generate an incrementing sequence starting at zero for each group. There are many ways to do that, I'd just use the (reset'ed) index of each group, i.e., reset the index, get the fresh RangeIndex starting at 0, and turn it into a series:
>>> df.groupby('Count').apply(lambda x: x.reset_index().index.to_series())
Count
1 0 0
2 0 0
1 1
2 2
3 3
4 4
3 0 0
1 1
2 2
3 3
4 0 0
5 0 0
1 1
6 0 0
This results in the expected counts, but note that the final index doesn't match the original dataframe, so we need another reset_index() with drop=True to discard the grouped index to put this into our original dataframe:
>>> df['Count'] = df.groupby('Count').apply(lambda x:x.reset_index().index.to_series()).reset_index(drop=True)
Et voilá:
>>> df
Slope Count
0 -25.0 0
1 -15.0 0
2 17.0 1
3 6.0 2
4 0.1 3
5 5.0 4
6 -3.0 0
7 5.0 1
8 1.0 2
9 3.0 3
10 -0.1 0
11 -0.2 0
12 1.0 1
13 -9.0 0
we can solve the problem by looping through all the rows and using the loc feature in pandas. Assuming that you already have a dataframe named df with a column called slope. The idea is that we are going to sequentially add one to the previous row, but if we ever hit a count where slope_i < 0 the row is multiplied by 0.
df['new_col'] = 0 # just preset everything to be zero
for i in range(1, len(df)):
df.loc[i, 'new_col'] = (df.loc[i-1, 'new_col'] + 1) * (df.loc[i, 'slope'] >= 0)
you can do this by using the groupby-command. It requires some steps, which probably could be shortened, but it works this way.
First, you create a reset column by finding negative numbers
# create reset condition
df['reset'] = df.slope.lt(0)
Then you create groups with a cumsum() to this resets --> at this point every group of positives gets an unique group value. the last line here gives all negative numbers the group 0
# create groups of positive values
df['group'] = df.reset.cumsum()
df.loc[df['reset'], 'group'] = 0
Now you take the groups of positives and cumsum some ones (there MUST be a better solution than that) to get your result. The last line again cleans up results for negative values
# sum ones :-D
df['count'] = 1
df['count'] = df.groupby('group')['count'].cumsum()
df.loc[df['reset'], 'count'] = 0
It is not that fine one-line, but especially for larger datasets it should be faster than iterating through the whole dataframe
for easier copy&paste the whole thing (including some commented lines which replace the lines before. makes it shorter but harder to understand)
import pandas as pd
## create data
slope = [-25, -15, 17, 6, 0.1, 5, -3, 5, 1, 3, -0.1, -0.2, 1, -9]
df = pd.DataFrame(data=slope, columns=['slope'])
## create reset condition
df['reset'] = df.slope.lt(0)
## create groups of positive values
df['group'] = df.reset.cumsum()
df.loc[df['reset'], 'group'] = 0
# df['group'] = df.reset.cumsum().mask(df.reset, 0)
## sum ones :-D
df['count'] = 1
df['count'] = df.groupby('group')['count'].cumsum()
df.loc[df['reset'], 'count'] = 0
# df['count'] = df.groupby('group')['count'].cumsum().mask(df.reset, 0)
IMO, solving this problem iteratively is the only way because there is a condition that has to meet. you can use any iterative way like for or while. solving this problem with map will be troublesome since this problem still need the previous element to be modified and assign to current element
I want to get delta between next values in dataframe, data supposed to by monotonically increasing but it's sometimes being resetted. And to account for that if value of diff is negative I want to not change it:
let's say I've got dataframe like:
val:
1
2
3
1
2
4
then after diff().fillna(0) operation I've got
val:
0
1
1
-2
1
2
But I would like to:
val:
0
1
1
1
1
2
Any easy way of doing it?
You could take the diff, then use where with ffill to replace those negatives with the previous values:
d = df.name.diff()
d.where(d.gt(0), df.name.ffill())
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 2.0
Name: name, dtype: float64
Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..
In my dataframe i want to know if the ordonnee value are decreasing, increasing,or not changing, in comparison with the precedent value (the row before) and group by the column temps.
I already try the method of these post:
stackoverflow post
And I try to groupby but this is not working do you have ideas?
entry = pd.DataFrame([['1',0,0],['1',1,1],['1',2,1],['1',3,1],['1',3,-2],['2',1,2],['2',1,3]],columns=['temps','abcisse','ordonnee'])
output = pd.DataFrame([['1',0,0,'--'],['1',1,1,'increase'],['1',2,1,'--'],['1',3,1,'--'],['1',3,-2,'decrease'],['2',1,2,'--'],['2',1,3,'increase']],columns=['temps','abcisse','ordonnee','variation'])
Use
In [5537]: s = entry.groupby('temps').ordonnee.diff().fillna(0)
In [5538]: entry['variation'] = np.where(s.eq(0), '--',
np.where(s.gt(0), 'increase',
'decrease'))
In [5539]: entry
Out[5539]:
temps abcisse ordonnee variation
0 1 0 0 --
1 1 1 1 increase
2 1 2 1 --
3 1 3 1 --
4 1 3 -2 decrease
5 2 1 2 --
6 2 1 3 increase
Also, as pointed in jezrael's comment, you can use np.select instead of np.where
In [5549]: entry['variation'] = np.select([s>0, s<0], ['increase', 'decrease'],
default='--')
Details
In [5541]: s
Out[5541]:
0 0.0
1 1.0
2 0.0
3 0.0
4 -3.0
5 0.0
6 1.0
Name: ordonnee, dtype: float64
Use np.where with groupby transform i.e
entry['new'] = entry.groupby(['temps'])['ordonnee'].transform(lambda x : \
np.where(x.diff()>0,'incresase',
np.where(x.diff()<0,'decrease','--')))
Output :
temps abcisse ordonnee new
0 1 0 0 --
1 1 1 1 incresase
2 1 2 1 --
3 1 3 1 --
4 1 3 -2 decrease
5 2 1 2 --
6 2 1 3 incresase