Efficient Partitioning of Pandas DataFrame rows between sandwiched indicator variables - python

Suppose I have a pandas df with an indicator row that sandwiches a period. Ex.
In [9]: pd.DataFrame({'col1':np.arange(1,11),'indicator':[0,1,0,0,0,1,0,0,1,1]})
Out[9]:
col1 indicator
0 1 0
1 2 1
2 3 0
3 4 0
4 5 0
5 6 1
6 7 0
7 8 0
8 9 1
9 10 1
What I want to do, is to use groupby to select the partitions separated by the indicators.
ex.
Group 1
col1 indicator
0 1 0
1 2 1
Group 2
2 3 0
3 4 0
4 5 0
5 6 1
Group 3
6 7 0
7 8 0
8 9 1
Group 4
9 10 1
The naive solution will be to just take the indicator column out as a list, run a for-loop through it, and just label each part. But suppose the dataset is really big, and you want to avoid the for-loop. Is there something more clever that can be done here, to separate out the different groups?
Thanks!

Just assign another column as a cumsum of indicator, then apply groupby, this should do the trick:
# reverse the order as you have indicator at end of group, then reverse back
df['grouped'] = df['indicator'].loc[::-1].cumsum().loc[::-1]
for g in df.groupby('grouped', sort=False):
print g
(4, col1 indicator grouped
0 1 0 4
1 2 1 4)
(3, col1 indicator grouped
2 3 0 3
3 4 0 3
4 5 0 3
5 6 1 3)
(2, col1 indicator grouped
6 7 0 2
7 8 0 2
8 9 1 2)
(1, col1 indicator grouped
9 10 1 1)

Related

regrouping similar column values in pandas

I have dataframe with many lines and columns, looking like this :
index
col1
col2
1
0
1
2
5
1
3
5
4
4
5
4
5
3
4
6
2
4
7
2
1
8
2
2
I would like to keep only the values that are different from the previous index and replace the others by 0. On the example dataframe, it would be :
index
col1
col2
1
0
1
2
5
0
3
0
4
4
0
0
5
3
0
6
2
0
7
0
1
8
0
2
What is a solution that works for any number of row/columns ?
So you'd like to keep the values where the difference to previous row is not equal to 0 (i.e., they're not the same), and put 0 to other places:
>>> df.where(df.diff().ne(0), other=0)
col1 col2
index
1 0 1
2 5 0
3 0 4
4 0 0
5 3 0
6 2 0
7 0 1
8 0 2

Find first non-zero element within a group in pandas

I have a dataframe that you can see how it is in the following. The column named target is my desired column:
group value target
1 1 0
1 2 0
1 3 2
1 4 0
1 5 1
2 1 0
2 2 0
2 3 0
2 4 1
2 5 3
Now I want to find the first non-zero value in the target column for each group and remove rows before that row in each group. So the output should be like this:
group value target
1 3 2
1 4 0
1 5 1
2 4 1
2 5 3
I have seen this post, but I don't how to change the code to get my desired result.
How can I do this?
In the groupby, set sort to False, get the cumsum, then filter for rows not equal to 0 :
df.loc[df.groupby(["group"], sort=False).target.cumsum() != 0]
group value target
2 1 3 2
3 1 4 0
4 1 5 1
8 2 4 1
9 2 5 3
This shoul do. I'm sure you can do it with less reset_index(), but this shouldn't affect too much the speed if your dataframe isn't too big:
idx = dff[dff.target.ne(0)].reset_index().groupby('group').index.first()
mask = (dff.reset_index().set_index('group')['index'].ge(idx.to_frame()['index'])).values
df_final = dff[mask]
Output:
0 group value target
3 1 3 2
4 1 4 0
5 1 5 1
9 2 4 1
10 2 5 3

Filtering pandas dataframe groups based on groups comparison

I am trying to remove corrupted data from my pandas dataframe. I want to remove groups from dataframe that has difference of value bigger than one from the last group. Here is an example:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 8 <- here number of group if I groupby by Value is larger than
7 8 the last groups number by 6, so I want to remove this
8 3 group from dataframe
9 3
Expected result:
Value
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
Edit:
jezrael solution is great, but in my case it is possible that there will be dubplicate group values:
Value
0 1
1 1
2 1
3 3
4 3
5 3
6 1
7 1
Sorry if I was not clear about this.
First remove duplicates for unique rows, then compare difference with shifted values and last filter by boolean indexing:
s = df['Value'].drop_duplicates()
v = s[s.diff().gt(s.shift())]
df = df[~df['Value'].isin(v)]
print (df)
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
Maybe:
df2 = df.drop_duplicates()
print(df[df['Value'].isin(df2.loc[~df2['Value'].gt(df2['Value'].shift(-1)), 'Value'].tolist())])
Output:
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3
We can check if the difference is less than or equal to 5 or NaN. After we check if we have duplicates and keep those rows:
s = df[df['Value'].diff().le(5) | df['Value'].diff().isna()]
s[s.duplicated(keep=False)]
Value
0 1
1 1
2 1
3 2
4 2
5 2
8 3
9 3

Pandas: how to add row values by index value

I'm having trouble working out how to add the index value of a pandas dataframe to each value at that index. For example, if I have a dataframe of zeroes, the row with index 1 should have a value of 1 for all columns. The row at index 2 should have values of 2 for each column, and so on.
Can someone enlighten me please?
You can use pd.DataFrame.add with axis=0. Just remember, as below, to convert your index to a series first.
df = pd.DataFrame(np.random.randint(0, 10, (5, 5)))
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 9 6 1 8 0
2 2 9 0 5 3
3 3 1 1 7 0
4 2 6 3 6 6
df = df.add(df.index.to_series(), axis=0)
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 10 7 2 9 1
2 4 11 2 7 5
3 6 4 4 10 3
4 6 10 7 10 10

Sort pandas DataFrame by multiple columns and duplicated index

I have a pandas DataFrame with duplicated indices. There are 3 rows with each index, and they correspond to a group of items. There are two columns, a and b.
df = pandas.DataFrame([{'i': b % 4, 'a': abs(b - 6) , 'b': b}
for b in range(12)]).set_index('i')
I want to sort the DataFrame so that:
All of the rows with the same indices are adjacent. (all of the groups are together)
The groups are in reverse order by the lowest value of a within the group.
For example, in the above df, the first three items should be the ones with index 0, because the lowest a value for those three rows is 2, and all of the other groups have at least one row with an a value lower than 2. The second three items could be either group 3 or group 1, because the lowest a value in both of those groups is 1. The last group of items should be group 2, because it has a row with an a value of 0.
Within each group, the items are sorted in ascending order by b.
Desired output:
a b
i
0 6 0
0 2 4
0 2 8
3 3 3
3 1 7
3 5 11
1 5 1
1 1 5
1 3 9
2 4 2
2 0 6
2 4 10
I've been trying something like:
df.groupby('i')[['a']].transform(min).sort(['a', 'b'], ascending=[0, 1])
But it gives me a KeyError, and it only gets that far if I make i a column instead of an index anyway.
The most straightforward way I see is moving your index to a column, and calculating a new column with the group min.
In [43]: df = df.reset_index()
In [45]: df['group_min'] = df.groupby('i')['a'].transform('min')
Then you can sort by your conditions:
In [49]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True])
Out[49]:
i a b group_min
0 0 6 0 2
4 0 2 4 2
8 0 2 8 2
3 3 3 3 1
7 3 1 7 1
11 3 5 11 1
1 1 5 1 1
5 1 1 5 1
9 1 3 9 1
2 2 4 2 0
6 2 0 6 0
10 2 4 10 0
To get back to your desired frame, drop the tracking variable and reset the index.
In [50]: df.sort_values(['group_min', 'i', 'b'], ascending=[False, False, True]).drop('group_min', axis=1).set_index('i')
Out[50]:
a b
i
0 6 0
0 2 4
0 2 8
3 3 3
3 1 7
3 5 11
1 5 1
1 1 5
1 3 9
2 4 2
2 0 6
2 4 10
You can first sort by a in descending order and then sort your index:
>>> df.sort(['a', 'b'], ascending=[False, True]).sort_index()
a b
i
0 6 0
0 2 4
0 2 8
1 5 1
1 3 9
1 1 5
2 4 2
2 4 10
2 0 6
3 5 11
3 3 3
3 1 7

Categories

Resources