How to create sub-DatafFrame with minimal values count - python

I have a DataFrame of the form:
a b Class
0 1 10 A
1 2 12 A
2 3 2 A
3 12 5 B
4 5 7 A
5 6 8 B
6 7 17 A
7 1 1 B
8 5 0 B
From this DataFrame I want to get a another DataFrame that has at least N rows for each of the values of column Class (here at least N rows from class 'A' and N rows of class B).
The new DataFrame should include all the rows starting from the end of the DataFrame and down to the row where the condition is met.
In the data above with N=2 I expect to get:
a b Class
4 5 7 A
5 6 8 B
6 7 17 A
7 1 1 B
8 5 0 B
Thanks.

You can extract the last 2 items by Class and the first index of the result.
Then index from this point onwards on your original dataframe.
idx = df.groupby('Class').tail(2).index[0]
res = df[idx:]
print(res)
a b Class
4 5 7 A
5 6 8 B
6 7 17 A
7 1 1 B
8 5 0 B

Related

How to identify one column with continuous number and same value of another column?

I have a DataFrame with two columns A and B.
I want to create a new column named C to identify the continuous A with the same B value.
Here's an example
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,5,6,10,11,12,13,18], 'B':[1,1,2,2,3,3,3,3,4,4]})
I found a similar question, but that method only identifies the continuous A regardless of B.
df['C'] = df['A'].diff().ne(1).cumsum().sub(1)
I have tried to groupby B and apply the function like this:
df['C'] = df.groupby('B').apply(lambda x: x['A'].diff().ne(1).cumsum().sub(1))
However, it doesn't work: TypeError: incompatible index of inserted column with frame index.
The expected output is
A B C
1 1 0
2 1 0
3 2 1
5 2 2
6 3 3
10 3 4
11 3 4
12 3 4
13 4 5
18 4 6
Let's create a sequential counter using groupby, diff and cumsum then factorize to reencode the counter
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().factorize()[0]
Result
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6
Use DataFrameGroupBy.diff with compare not equal 1 and Series.cumsum, last subtract 1:
df['C'] = df.groupby('B')['A'].diff().ne(1).cumsum().sub(1)
print (df)
A B C
0 1 1 0
1 2 1 0
2 3 2 1
3 5 2 2
4 6 3 3
5 10 3 4
6 11 3 4
7 12 3 4
8 13 4 5
9 18 4 6

How to calculate totals of all possible combinations of columns

I have the following df:
df = pd.DataFrame({'a': [1,2,3,4,2], 'b': [3,4,1,0,4], 'c':[1,2,3,1,0], 'd':[3,2,4,1,4]})
I want to generate a combination of totals from these 4 columns, which equals 4 x 3 x 2 = 24 total combinations minus duplicates. I want the results in the same df.
I want something that looks like this (partial results shown):
A combo of a_b is the same as b_a and therefore I wouldn't want such a calculation since its a duplicate.
Is there a way to calculate all combinations and exclude duplicate totals?
import itertools as it
orig_cols = df.columns
for r in range(2, df.shape[1] + 1):
for cols in it.combinations(orig_cols, r):
df["_".join(cols)] = df.loc[:, cols].sum(axis=1)
Needs some looping, but not on the dataframe itself, but rather the combinations. We get 2, 3, ..., N-1'th combinations of the column names where N is number of columns. Then form the new _-joined column as the sum.
In [11]: df
Out[11]:
a b c d a_b a_c a_d b_c b_d c_d a_b_c a_b_d a_c_d b_c_d a_b_c_d
0 1 3 1 3 4 2 4 4 6 4 5 7 5 7 8
1 2 4 2 2 6 4 4 6 6 4 8 8 6 8 10
2 3 1 3 4 4 6 7 4 5 7 7 8 10 8 11
3 4 0 1 1 4 5 5 1 1 2 5 5 6 2 6
4 2 4 0 4 6 2 6 4 8 4 6 10 6 8 10

I want to generate a new column in a pandas dataframe, counting "edges" in another column

i have a dataframe looking like this:
A B....X
1 1 A
2 2 B
3 3 A
4 6 K
5 7 B
6 8 L
7 9 M
8 1 N
9 7 B
1 6 A
7 7 A
that is, some "rising edges" occur from time to time in the column X (in this example the edge is x==B)
What I need is, a new column Y which increments every time a value of B occurs in X:
A B....X Y
1 1 A 0
2 2 B 1
3 3 A 1
4 6 K 1
5 7 B 2
6 8 L 2
7 9 M 2
8 1 N 2
9 7 B 3
1 6 A 3
7 7 A 3
In SQL I would use some trick like sum(case when x=B then 1 else 0) over ... rows between first and previous. How can I do it in Pandas?
Use cumsum
df['Y'] = (df.X == 'B').cumsum()
Out[8]:
A B X Y
0 1 1 A 0
1 2 2 B 1
2 3 3 A 1
3 4 6 K 1
4 5 7 B 2
5 6 8 L 2
6 7 9 M 2
7 8 1 N 2
8 9 7 B 3
9 1 6 A 3
10 7 7 A 3

How to set value to a cell filtered by rows in python DataFrame?

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['A','B','C'])
df[df['B']%2 ==0]['C'] = 5
I am expecting this code to change the value of columns C to 5, wherever B is even. But it is not working.
It returns the table as follow
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I am expecting it to return
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
If need change value of column in DataFrame is necessary DataFrame.loc with condition and column name:
df.loc[df['B']%2 ==0, 'C'] = 5
print (df)
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
Your solution is nice example of chained indexing - docs.
You could just change the order to:
df['C'][df['B']%2 == 0] = 5
And it also works
Using numpy where
df['C'] = np.where(df['B']%2 == 0, 5, df['C'])
Output
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12

Get all dataframe based in certain value in dataframe column

I have a DataFrame looks something like this :
import numpy as np
import pandas as pd
df=pd.DataFrame([['d',5,6],['a',6,6],['index',5,8],['b',3,1],['b',5,6],['index',6,7],
['e',2,3],['c',5,6],['index',5,8]],columns=['A','B','C'])
I want to select all the lines that are between index and create many dataframes
I want to obtain all as :
dataframe1:
A B C
1 a 6 6
2 index 5 8
3 3 b 3
dataframe 2
A B C
4 b 5 6
5 index 6 7
6 c 2 3
datframe3:
A B C
7 c 5 6
8 index 5 8
9 4 3 1
dataframe4 :
A B C
11 5 2 3
12 index 4 2
13 1 2 5
index_list = df.index[df['A'] == 'index'].tolist() # create a list of the index where df['A']=='index'
new_df = [] # empty list for dataframes
for i in index_list: # for loop
try:
new_df.append(df.iloc[i-1:i+2])
except:
pass
this creates a list of dataframes you can call them by new_df[0] new_df[1] or use a loop to print them out:
for i in range(len(new_df)):
print(f'{new_df[i]}\n')
A B C
1 a 6 6
2 index 5 8
3 b 3 1
A B C
4 b 5 6
5 index 6 7
6 e 2 3
A B C
7 c 5 6
8 index 5 8

Categories

Resources