I am looking to add a column that counts consecutive positive numbers and resets the counter on finding a negative on a pandas dataframe. I might be able to loop through it with 'for' statement but I just know there is a better solution. I have looked at various similar posts that almost ask the same but I just cannot get those solutions to work on my problem.
I have:
Slope
-25
-15
17
6
0.1
5
-3
5
1
3
-0.1
-0.2
1
-9
What I want:
Slope Count
-25 0
-15 0
17 1
6 2
0.1 3
5 4
-3 0
5 1
1 2
3 3
-0.1 0
-0.2 0
1 1
-9 0
Please keep in mind that this a low-skill level question. If there are multiple steps on your proposed solution, please explain each. I would like an answer, but would prefer for me to understand the 'how'.
You first want to mark the positions where new segments (i.e., groups) start:
>>> df['Count'] = df.Slope.lt(0)
>>> df.head(7)
Slope Count
0 -25.0 True
1 -15.0 True
2 17.0 False
3 6.0 False
4 0.1 False
5 5.0 False
6 -3.0 True
Now you need to label each group using the cumulative sum: as True is evaluated as 1 in mathematical equations, the cumulative sum will label each segment with an incrementing integer. (This is a very powerful concept in pandas!)
>>> df['Count'] = df.Count.cumsum()
>>> df.head(7)
Slope Count
0 -25.0 1
1 -15.0 2
2 17.0 2
3 6.0 2
4 0.1 2
5 5.0 2
6 -3.0 3
Now you can use groupby to access each segment, then all you need to do is generate an incrementing sequence starting at zero for each group. There are many ways to do that, I'd just use the (reset'ed) index of each group, i.e., reset the index, get the fresh RangeIndex starting at 0, and turn it into a series:
>>> df.groupby('Count').apply(lambda x: x.reset_index().index.to_series())
Count
1 0 0
2 0 0
1 1
2 2
3 3
4 4
3 0 0
1 1
2 2
3 3
4 0 0
5 0 0
1 1
6 0 0
This results in the expected counts, but note that the final index doesn't match the original dataframe, so we need another reset_index() with drop=True to discard the grouped index to put this into our original dataframe:
>>> df['Count'] = df.groupby('Count').apply(lambda x:x.reset_index().index.to_series()).reset_index(drop=True)
Et voilá:
>>> df
Slope Count
0 -25.0 0
1 -15.0 0
2 17.0 1
3 6.0 2
4 0.1 3
5 5.0 4
6 -3.0 0
7 5.0 1
8 1.0 2
9 3.0 3
10 -0.1 0
11 -0.2 0
12 1.0 1
13 -9.0 0
we can solve the problem by looping through all the rows and using the loc feature in pandas. Assuming that you already have a dataframe named df with a column called slope. The idea is that we are going to sequentially add one to the previous row, but if we ever hit a count where slope_i < 0 the row is multiplied by 0.
df['new_col'] = 0 # just preset everything to be zero
for i in range(1, len(df)):
df.loc[i, 'new_col'] = (df.loc[i-1, 'new_col'] + 1) * (df.loc[i, 'slope'] >= 0)
you can do this by using the groupby-command. It requires some steps, which probably could be shortened, but it works this way.
First, you create a reset column by finding negative numbers
# create reset condition
df['reset'] = df.slope.lt(0)
Then you create groups with a cumsum() to this resets --> at this point every group of positives gets an unique group value. the last line here gives all negative numbers the group 0
# create groups of positive values
df['group'] = df.reset.cumsum()
df.loc[df['reset'], 'group'] = 0
Now you take the groups of positives and cumsum some ones (there MUST be a better solution than that) to get your result. The last line again cleans up results for negative values
# sum ones :-D
df['count'] = 1
df['count'] = df.groupby('group')['count'].cumsum()
df.loc[df['reset'], 'count'] = 0
It is not that fine one-line, but especially for larger datasets it should be faster than iterating through the whole dataframe
for easier copy&paste the whole thing (including some commented lines which replace the lines before. makes it shorter but harder to understand)
import pandas as pd
## create data
slope = [-25, -15, 17, 6, 0.1, 5, -3, 5, 1, 3, -0.1, -0.2, 1, -9]
df = pd.DataFrame(data=slope, columns=['slope'])
## create reset condition
df['reset'] = df.slope.lt(0)
## create groups of positive values
df['group'] = df.reset.cumsum()
df.loc[df['reset'], 'group'] = 0
# df['group'] = df.reset.cumsum().mask(df.reset, 0)
## sum ones :-D
df['count'] = 1
df['count'] = df.groupby('group')['count'].cumsum()
df.loc[df['reset'], 'count'] = 0
# df['count'] = df.groupby('group')['count'].cumsum().mask(df.reset, 0)
IMO, solving this problem iteratively is the only way because there is a condition that has to meet. you can use any iterative way like for or while. solving this problem with map will be troublesome since this problem still need the previous element to be modified and assign to current element
Related
I'm trying to create 2-dimensional bins from a pandas DataFrame based on 3 columns. Here a snippet from my DataFrame:
Scatters N z Dist_first
---------------------------------------
0 0 0 0.096144 2.761508
1 1 0 -8.229910 17.403039
2 2 0 0.038125 21.466233
3 3 0 -2.050480 29.239867
4 4 0 -1.620470 NaN
5 5 0 -1.975930 NaN
6 6 0 -11.672200 NaN
7 7 0 -16.629000 26.554049
8 8 0 0.096002 NaN
9 9 0 0.176049 NaN
10 10 0 0.176005 NaN
11 11 0 0.215408 NaN
12 12 0 0.255889 NaN
13 13 0 0.301834 27.700308
14 14 0 -29.593600 9.155065
15 15 1 -2.582290 NaN
16 16 1 0.016441 2.220946
17 17 1 -17.329100 NaN
18 18 1 -5.442320 34.520919
19 19 1 0.001741 39.579189
For my result each Dist_first should be binned with all "z <= 0" of lower index within a group "N" than the Distance itself. "Scatters" is a copy of the index left from an operation in an earlier stage of my code which is not relevant here. Nonetheless I came to use it instead of the index in the example below. The bins for the distances and z's are in 10 m and 0.1 m steps, respectively and I can obtain a result from looping through groups of the dataFrame:
# create new column for maximal possible distances per group N
for j in range(N.groupby('N')['Dist_first'].count().max()):
N[j+1] = N.loc[N[N['Dist_first'].notna()].groupby('N')['Scatters'].nlargest(j+1).groupby('N').min()]['Dist_first']
# fill nans with zeros to allow
N[j+1] = N[j+1].fillna(0)
# make sure no value is repeated
if j+1 > 1:
N[j+1] = N[j+1]-N[list(np.arange(j)+1)].sum(axis=1)
# and set all values <= 0 to NaN
N[N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] <= 0] = np.nan
# backwards fill to make sure every distance gets all necessary depths
N[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)] = N.set_index('N').groupby('N').bfill().set_index('Scatters')[list(np.arange(N.groupby('N')['Dist_first'].count().max())+1)]
# bin the result(s)
for j in range(N.groupby('N')['Dist_first'].count().max()):
binned = N[N['z'] >= 0].groupby([pd.cut(N[N['z'] >= 0]['z'], bins_v, include_lowest=True), pd.cut(N[N['z'] >= 0][j+1], bins_h, include_lowest=True)])
binned = binned.size().unstack()
## rename
binned.index = N_v.index; binned.columns = N_h.index
## and sum up with earlier chunks
V = V+binned
This bit of code works just fine and the result for the small snippet of the data I've shared looks like this:
Distance [m] 0.0 10.0 20.0 30.0 40.0
Depth [m]
----------------------------------------------------
0.0 1 1 1 4 2
0.1 1 2 2 4 0
0.2 0 3 0 3 0
0.3 0 2 0 2 0
0.4 0 0 0 0 0
However, the whole dataset(s) are excesively large (> 300 mio rows each) and looping through all rows is not an option. Therefore I'm looking for some vectorized solution.
I suggest you to calculate creiteria in extra columns and then use Pandas standard binning function, like qcut. It can be applied separately along the 2 binning dimensions. Not most elegant, but definitely vectorized.
Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..
I have a table in pandas df1
id value
1 1500
2 -1000
3 0
4 50000
5 50
also I have another table in dataframe df2, that contains upper boundaries of groups, so essentially every row represents an interval from the previous boundary to the current one (the first interval is "<0"):
group upper
0 0
1 1000
2 NaN
How should I get the relevant groups for value from df, using intervals from df2? I can't use join, merge etc., because the rules for this join should be like "if value is between previous upper and current upper" and not "if value equals something". The only way that I've found is using predefined function with df.apply() (also there is a case of categorical values in it with interval_flag==False):
def values_to_group(x, interval_flag, groups_def):
if interval_flag==True:
for ind, gr in groups_def.sort_values(by='group').iterrows():
if x<gr[1]:
return gr[0]
elif math.isnan(gr[1]) == True:
return gr[0]
else:
for ind, gr in groups_def.sort_values(by='group').iterrows():
if x in gr[1]:
return gr[0]
Is there an easier/more optimal way to do it?
The expected output should be this:
id value group
1 1500 2
2 -1000 0
3 0 1
4 50000 2
5 50 1
I suggest use cut with sorted DataFrame of df2 by sorted upper and repalce last NaN to np.inf:
df2 = pd.DataFrame({'group':[0,1,2], 'upper':[0,1000,np.nan]})
df2 = df2.sort_values('upper')
df2['upper'] = df2['upper'].replace(np.nan, np.inf)
print (df2)
group upper
0 0 0.000000
1 1 1000.000000
2 2 inf
#added first bin -np.inf
bins = np.insert(df2['upper'].values, 0, -np.inf)
df1['group'] = pd.cut(df1['value'], bins=bins, labels=df2['group'], right=False)
print (df1)
id value group
0 1 1500 2
1 2 -1000 0
2 3 0 1
3 4 50000 2
4 5 50 1
Here's a solution using numpy.digitize. Your only task is to construct bins and names input lists, which should be possible via an input dataframe.
import pandas as pd, numpy as np
df = pd.DataFrame({'val': [99, 53, 71, 84, 84]})
df['ratio'] = df['val']/ df['val'].shift() - 1
bins = [-np.inf, 0, 0.2, 0.4, 0.6, 0.8, 1.0, np.inf]
names = ['<0', '0.0-0.2', '0.2-0.4', '0.4-0.6', '0.6-0.8', '0.8-1.0', '>1']
d = dict(enumerate(names, 1))
df['Bucket'] = list(map(d.get, np.digitize(df['ratio'], bins)))
print(df)
val ratio Bucket
0 99 NaN None
1 53 -0.464646 <0
2 71 0.339623 0.2-0.4
3 84 0.183099 0.0-0.2
4 84 0.000000 0.0-0.2
In the following pandas dataframe, I want to change each row with a "-1" value with the value of the previous row. So this is the original df:
position
0 0
1 -1
2 1
3 1
4 -1
5 0
And I want to transform it in:
position
0 0
1 0
2 1
3 1
4 1
5 0
I'm doing it in the following way but I think that there should be faster ways, probably vectorizing it or something like that (although I wasn't able to do it).
for i, row in self.df.iterrows():
if row["position"] == -1:
self.df.loc[i, "position"] = self.df.loc[i-1, "position"]
So, the code works, but it seems slow, is there any way to speed it up?
Use replace + ffill:
df.replace(-1, np.nan).ffill()
position
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0
Replace will convert -1 to NaN values. ffill will replace NaNs with the value just above it.
Use .astype for an integer result:
df.replace(-1, np.nan).ffill().astype(int)
position
0 0
1 0
2 1
3 1
4 1
5 0
Don't forget to assign the result back. You could perform the same operation non position if need be:
df['position'] = df['position'].replace(-1, np.nan).ffill().astype(int)
Solution using np.where:
c = df['position']
df['position'] = np.where(c == -1, c.shift(), c)
df
position
0 0.0
1 0.0
2 1.0
3 1.0
4 1.0
5 0.0
Someone asked to select the first observation per group in pandas df, I am interested in both first and last, and I don't know an efficient way of doing it except writing a for loop.
I am going to modify his example to tell you what I am looking for
basically there is a df like this:
group_id
1
1
1
2
2
2
3
3
3
I would like to have a variable that indicates the last observation in a group:
group_id indicator
1 0
1 0
1 1
2 0
2 0
2 1
3 0
3 0
3 1
Using pandas.shift, you can do something like:
df['group_indicator'] = df.group_id != df.group_id.shift(-1)
(or
df['group_indicator'] = (df.group_id != df.group_id.shift(-1)).astype(int)
if it's actually important for you to have it as an integer.)
Note:
for large datasets, this should be much faster than list comprehension (not to mention loops).
As Alexander notes, this assumes the DataFrame is sorted as it is in the example.
First, we'll create a list of the index locations containing the last element of each group. You can see the elements of each group as follows:
>>> df.groupby('group_id').groups
{1: [0, 1, 2], 2: [3, 4, 5], 3: [6, 7, 8]}
We use a list comprehension to extract the last index location (idx[-1]) of each of these group index values.
We assign the indicator to the dataframe by using a list comprehension and a ternary operator (i.e. 1 if condition else 0), iterating across each element in the index and checking if it is in the idx_last_group list.
idx_last_group = [idx[-1] for idx in df.groupby('group_id').groups.values()]
df['indicator'] = [1 if idx in idx_last_group else 0 for idx in df.index]
>>> df
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Use the .tail method:
df=df.groupby('group_id').tail(1)
You can groupby the 'id' and call nth(-1) to get the last entry for each group, then use this to mask the df and set the 'indicator' to 1 and then the rest with 0 using fillna:
In [21]:
df.loc[df.groupby('group_id')['group_id'].nth(-1).index,'indicator'] = 1
df['indicator'].fillna(0, inplace=True)
df
Out[21]:
group_id indicator
0 1 0
1 1 0
2 1 1
3 2 0
4 2 0
5 2 1
6 3 0
7 3 0
8 3 1
Here is the output from the groupby:
In [22]:
df.groupby('group_id')['group_id'].nth(-1)
Out[22]:
2 1
5 2
8 3
Name: group_id, dtype: int64
One line:
data['indicator'] = (data.groupby('group_id').cumcount()==data.groupby('group_id')['any_other_column'].transform('size') -1 ).astype(int)`
What we do is check if the cumulative count (which returns a vector the same size as the dataframe) is equal to the "size of the group - 1" which we calculate using transform so it also returns a vector the same size as the dataframe.
We need to use some other column for the transform because it won't let you transform the .groupby() variable but this can literally any other column and it won't be affected since its only used in calculating the new indicator. Use .astype(int) to make it a binary and done.