There is a dataframe df with 2 columns col1 and col2. Both columns have randomly spread 0s and 1s. More zeros than ones.
If col1 has a 1 on an index, program should be able to look for next first 1 in col2 and get the difference of indices of both rows.
Everytime this distribution is different also the sequence length.
Try with idxmax
id1 = df.col1.idxmax()
id2 = df.loc[id1:,'col2'].idxmax()
id2-id1
2
id2
4
id1
2
I cannot see your posted image.
How about this.
import random
import pandas as pd
numrows = 10
df = pd.DataFrame({'c1': [random.randint(0, 1) for _ in range(numrows)], 'c2': [random.randint(0, 1) for _ in range(numrows)]})
print(df)
col1_index = None
for index, row in df.iterrows():
if col1_index is not None:
if row['c2'] == 1:
diff = col1_index - index
print(f'first occurrence of 1 at c2 is at index {index}, the index diff is {diff}')
col1_index = None
elif row['c1'] == 1:
col1_index = index
print(f'this index {index} has value 1 at c1')
Typical output
c1 c2
0 1 0
1 0 0
2 0 0
3 1 1
4 0 1
5 0 0
6 1 1
7 0 1
8 0 1
9 1 1
this index 0 has value 1 at c1
first occurrence of 1 at c2 is at index 3, the index diff is -3
this index 6 has value 1 at c1
first occurrence of 1 at c2 is at index 7, the index diff is -1
this index 9 has value 1 at c1
Related
I have a DataFrame with a 3-level MultiIndex, for example:
df = pd.DataFrame({
'col0': [0,8,3,1,2,2,0,0],
'col1': range(8),
}, index=pd.MultiIndex.from_product([[0,1]] * 3, names=['idx0', 'idx1', 'idx2']))
>>> df
col0 col1
idx0 idx1 idx2
0 0 0 0 0
1 8 1
1 0 3 2
1 1 3
1 0 0 2 4
1 2 5
1 0 0 6
1 0 7
For each idx0, I want to find the idx1 that has the lowest mean of col0. This gives me idx0, idx1 pairs. Then I'd like to select all the rows matching those pairs.
In the example above, the pairs are [(0, 1), (1, 1)] (with means 2 and 0, respectively) and the desired result is:
col0 col1
idx0 idx1 idx2
0 1 0 3 2
1 1 3
1 1 0 0 6
1 0 7
What I have tried
Step 1: Group by idx0, idx1 and calculate the mean of col0:
mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean()
>>> mean_col0
idx0 idx1
0 0 4.0
1 2.0
1 0 2.0
1 0.0
Step 2: Select the indexmin (idx1) by group of idx0:
level_idxs = mean_col0.groupby('idx0').idxmin()
>>> level_idxs
idx0
0 (0, 1)
1 (1, 1)
Step 3: Use that to filter the original dataframe.
That's the main problem. When I simply try df.loc[ix], I get a ValueError due to shape mismatch. I would need the third index value or a wildcard.
I think I have a solution. Putting it all together with the steps above:
mean_col0 = df.groupby(['idx0', 'idx1'])['col0'].mean()
level_idxs = mean_col0.groupby(["idx0"]).idxmin()
result = df[df.index.droplevel(2).isin(level_idxs)]
But it seems quite complicated. Is there a better way?
You can use .apply().
For each group of idx0: query only those idx1-s which have the smallest mean in col0:
df.groupby('idx0').apply(lambda g:
g.query(f"idx1 == {g.groupby('idx1')['col0'].mean().idxmin()}")
).droplevel(0)
The same can be written in this (hopefully more readable) way:
def f(df):
chosen_idx1 = df.groupby('idx1')['col0'].mean().idxmin()
return df.query('idx1 == #chosen_idx1')
df.groupby('idx0').apply(f).droplevel(0)
Consider a DataFrame such as
df = pd.DataFrame({'a': [1,-2,0,3,-1,2],
'b': [-1,-2,-5,-7,-1,-1],
'c': [-1,-2,-5,4,5,3]})
For each column, how to replace any negative value with the last positive value or zero ? Last here refers from top to bottom for each column. The closest solution noticed is for instance df[df < 0] = 0.
The expected result would be a DataFrame such as
df_res = pd.DataFrame({'a': [1,1,0,3,3,2],
'b': [0,0,0,0,0,0],
'c': [0,0,0,4,5,3]})
You can use DataFrame.mask to convert all values < 0 to NaN then use ffill and fillna:
df = df.mask(df.lt(0)).ffill().fillna(0).convert_dtypes()
a b c
0 1 0 0
1 1 0 0
2 0 0 0
3 3 0 4
4 3 0 5
5 2 0 3
Use pandas where
df.where(df.gt(0)).ffill().fillna(0).astype(int)
a b c
0 1 0 0
1 1 0 0
2 1 0 0
3 3 0 4
4 3 0 5
5 2 0 3
Expected result may obtained with this manipulations:
mask = df >= 0 #creating boolean mask for non-negative values
df_res = (df.where(mask, np.nan) #replace negative values to nan
.ffill() #apply forward fill for nan values
.fillna(0)) # fill rest nan's with zeros
ID col1 col2 col3
I1 1 0 1
I2 1 0 1
I3 0 1 0
I4 0 1 0
I5 0 0 1
This is my dataframe. I am looking forward to aggregate ID values based on the group by of col1,col2,col3 and also want a count columns along ith this.
Expected output :
ID_List Count
[I1,I2] 2
[I3,I4] 2
[I5] 1
My code
cols_to_group = ['col1','col2','col3']
data = pd.DataFrame(df.groupby(cols_to_group)['id'].nunique()).reset_index(drop=True)
data.head()
ID
0 2
1 2
2 1
You can do a groupby.agg():
df.groupby(['col1','col2','col3'], sort=False).ID.agg([list,'count'])
Output:
list count
col1 col2 col3
1 0 1 [I1, I2] 2
0 1 0 [I3, I4] 2
0 1 [I5] 1
You need to aggregate a function by either sum, count etc. In this case, count. Try the below code.
df.groupby(['col1','col2','col3']).ID.agg([list,'count']).reset_index(drop=True)
Output:
list count
0 [I1, I2] 2
1 [I3, I4] 2
2 [I5] 1
Here you go:
grouped = df.groupby(['col1', 'col2', 'col3'], sort=False).ID
df = pd.DataFrame({
'ID_List': grouped.aggregate(list),
'Count': grouped.count()
}).reset_index(drop=True)
print(df)
Output:
ID_List Count
0 [I1, I2] 2
1 [I3, I4] 2
2 [I5] 1
This is my csv file:
A B C D J
0 1 0 0 0
0 0 0 0 0
1 1 1 0 0
0 0 0 0 0
0 0 7 0 7
I need each time to select two columns and I verify this condition if I have Two 0 I delete the row so for exemple I select A and B
Input
A B
0 1
0 0
1 1
0 0
0 0
Output
A B
0 1
1 1
And Then I select A and C ..
I used This code for A and B but it return errors
import pandas as pd
df = pd.read_csv('Book1.csv')
a=df['A']
b=df['B']
indexes_to_drop = []
for i in df.index:
if df[(a==0) & (b==0)] :
indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop], inplace=True )
Any help please!
First we make your desired combinations of column A with all the rest, then we use iloc to select the correct rows per column combination:
idx_ranges = [[0,i] for i in range(1, len(df.columns))]
dfs = [df[df.iloc[:, idx].ne(0).any(axis=1)].iloc[:, idx] for idx in idx_ranges]
print(dfs[0], '\n')
print(dfs[1], '\n')
print(dfs[2], '\n')
print(dfs[3])
A B
0 0 1
2 1 1
A C
2 1 1
4 0 7
A D
2 1 0
A J
2 1 0
4 0 7
Do not iterate. Create a Boolean Series to slice your DataFrame:
cols = ['A', 'B']
m = df[cols].ne(0).any(1)
df.loc[m]
A B C D J
0 0 1 0 0 0
2 1 1 1 0 0
You can get all combinations and store them in a dict with itertools.combinations. Use .loc to select both the rows and columns you care about.
from itertools import combinations
d = {c: df.loc[df[list(c)].ne(0).any(1), list(c)]
for c in list(combinations(df.columns, 2))}
d[('A', 'B')]
# A B
#0 0 1
#2 1 1
d[('C', 'J')]
# C J
#2 1 0
#4 7 7
I have a dataframe:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
which looks like-
A B class
0 0 4 0
1 4 1 1
2 8 0 1
3 1 0 0
4 0 3 1
5 0 1 0
I want to calculate for each column the corresponding a,b,c,d which are no of non-zero in column corresponding to class column 1,no of non-zero in column corresponding to class column 0,no of zero in column corresponding to class column 1,no of zero in column corresponding to class column 0
for example-
for column A the a,b,c,d are 2,1,1,2
explantion- In column A we see that where column[class]=1 the number of non zero values in column A are 2 therefore a=2(indices 1,2).Similarly b=1(indices 3)
My attempt(when the dataframe had equal no of 0 and 1 class)-
dataset = pd.read_csv('aaf.csv')
n=len(dataset.columns) #no of columns
X=dataset.iloc[:,1:n].values
l=len(X) #no or rows
score = []
for i in range(n-1):
#print(i)
X_column=X[:,i]
neg_array,pos_array=np.hsplit(X_column,2)##hardcoded
#print(pos_array.size)
a=np.count_nonzero(pos_array)
b=np.count_nonzero(neg_array)
c= l/2-a
d= l/2-b
Use:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
df = (df.set_index('class')
.ne(0)
.stack()
.groupby(level=[0,1])
.value_counts()
.unstack(1)
.sort_index(level=1, ascending=False)
.T)
print (df)
class 1 0 1 0
True True False False
A 2 1 1 2
B 2 2 1 1
df.columns = list('abcd')
print (df)
a b c d
A 2 1 1 2
B 2 2 1 1