Find next higher value in a python dataframe column - python

we all now how to find the maximum value of a dataframe column.
But how can i find the next higher value in a column? So for example I have the following dataframe:
d = {'col1': [1, 4, 2], 'col2': [3, 4, 3]}
df = pd.DataFrame(data=d)
col1 col2
0 3 3
1 5 4
2 2 3
Basic-Questions:
When I want to find the next higher value in col1 to 0, outcome would be:2. Is there something similar to: df.loc[df['col1'].idxmax()], which would lead to:
col1 col2
5 4
And my outcome should be:
col1 col2
2 3
Background: And I am using a if-condition to filter this dataframe, as I need to prepare it for further filtering, and not all values are exsting which I will put in:
v= 0
if len(df[(df['col1'] == v)]) == 0:
df2 = df[(df['col1'] == v+1)]
else:
df2 = df[(df['col1'] == v)]
This would lead to an empty dataframe.
But I would like to go the the next entry not v+1=1 , in this case I want to insert 2because it is the next higher value, which has entry after 0. So the condition would be:
v= 0
if len(df[(df['col1'] == v)]) == 0:
df2 = df[(df['col1'] == 2)] #the 2 has to be find automatic, as the next value does not have a fixed distance
else:
df2 = df[(df['col1'] == v)]
How can I achieve that automatically?
So my desired outcome is:
when I put in v=0:
df2
col1 col2
2 3
when I put in v=2, it jumps to v=3:
df2
col1 col2
3 3
If I put v=3, it stays (else-condition):
df2
col1 col2
3 3

Check the searchsorted from numpy
df=df.sort_values('col1')
df.iloc[np.searchsorted(df.col1.values,[0])]
col1 col2
2 2 3
df.iloc[np.searchsorted(df.col1.values,[3,5])]
col1 col2
0 3 3
1 5 4
Add-on(from the questioneer): This also skips the if-condition

Related

Pandas: Split dataframe with duplicate values into dataframe with unique values

I have a dataframe in Pandas with duplicate values in Col1:
Col1
a
a
b
a
a
b
What I want to do is to split this df into different df-s with unique Col1 values in each.
DF1:
Col1
a
b
DF2:
Col1
a
b
DF3:
Col1
a
DF4:
Col1
a
Any suggestions ?
I don't think you can achieve this in a vectorial way.
One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:
def cum_uniq(s):
i = 0
seen = set()
out = []
for x in s:
if x in seen:
i+=1
seen = set()
out.append(i)
seen.add(x)
return pd.Series(out, index=s.index)
out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]
output:
[ Col1
0 a,
Col1
1 a
2 b,
Col1
3 a,
Col1
4 a
5 b]
intermediate:
cum_uniq(df['Col1'])
0 0
1 1
2 1
3 2
4 3
5 3
dtype: int64
if order doesn't matter
Let's ad a Col2 to the example:
Col1 Col2
0 a 0
1 a 1
2 b 2
3 a 3
4 a 4
5 b 5
the previous code gives:
[ Col1 Col2
0 a 0,
Col1 Col2
1 a 1
2 b 2,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4
5 b 5]
If order does not matter, you can vectorize it:
out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]
output:
[ Col1 Col2
0 a 0
2 b 2,
Col1 Col2
1 a 1
5 b 5,
Col1 Col2
3 a 3,
Col1 Col2
4 a 4]

How to enter the value of one index and column into a new cell with +1 in the iteration?

I have the following DataFrame named df1:
col1
col2
col3
5
3
50
10
4
3
2
0
1
I would like to create a loop that adds a new column called "Total", which takes the value of col1 index 0 (5) and enters that value under the column "Total" at index 0. The next iteration, will col2 index 1 (4) and that value will go under column "Total" at index 1. This step will continue all columns and rows are completed.
The ideal output will be the following:
df1
col1
col2
col3
Total
5
3
50
5
10
4
3
4
2
0
1
1
I have the following code but I would like to find a more efficient way of doing this as I have a large DataFrame:
df1.iloc[0,3] = df1.iloc[0,0]
df1.iloc[1,3] = df1.iloc[1,1]
df1.iloc[2,3] = df1.iloc[2,2]
Thank you!
Numpy has a built in diagonal function:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1': [5, 10, 2], 'col2': [3, 4, 0], 'col3': [50, 3, 1]})
df['Total'] = np.diag(df)
print(df)
Output
col1 col2 col3 Total
0 5 3 50 5
1 10 4 3 4
2 2 0 1 1
You can try apply on rows
df['Total'] = df.apply(lambda row: row.iloc[row.name], axis=1)
col1 col2 col3 Total
0 5 3 50 5
1 10 4 3 4
2 2 0 1 1
Hope this logic will help
length = len(df1["col1"])
total = pd.Series([df1.iloc[i, i%3] for i in range(length)])
# in i%3, 3 is number of cols(col1, col2, col3)
# add this total Series to df1

How to return column&row index of cell of certain value

index col1 col2 col3
0 0 1 0
1 1 0 1
2 1 1 0
I am just stuck at a task: to find locations(indices) of all cells that equals to 1.
I was trying to use such a statement
column_result=[]
row_result=[]
for column in df:
column_result=column_result.append(df.index[df[i] != 0])
for row in df:
row_result=row_result.append(df.index[df[i]!=0)
my logic is using loops to traverse the colomns and rows separately and concatenate them later
however it returns'NoneType' object has no attribute 'append'
would you please help me to debug and complete this task
Use numpy.where for indices for index and columns and then select them for cols, idx lists:
i, c = np.where(df.ne(0))
cols = df.columns[c].tolist()
idx = df.index[i].tolist()
print (idx)
[0, 1, 1, 2, 2]
print (cols)
['col2', 'col1', 'col3', 'col1', 'col2']
Or use DataFrame.stack with filtering for final DataFrame:
s = df.stack()
df1 = s[s.ne(0)].rename_axis(['idx','cols']).index.to_frame(index=False)
print (df1)
idx cols
0 0 col2
1 1 col1
2 1 col3
3 2 col1
4 2 col2

replace values in pandas based on aggregion and condition

I have a dataframe like this:
I want to replace values in col1 with a specific value (ex:with "b"). I should count the records of each group based on col1 and col2. For example count of col1 = a, col2 = t is 3 and col1 = a, col2 = u is 1 .
If the count is greater than 2 then replace the value of col1 with 'b'. For this example, i want to replace all "a" values with "b" where col2 = t.
I tried the below code, but it did not change all of then "a" values with this condition.
import pandas as pd
df = pd.read_excel('c:/test.xlsx')
df.loc[df[(df['col1'] == 'a') & (df['col2'] == 't')].agg("count")["ID"] >2, 'col1'] = 'b'
I want this result:
You can use numpy.where and check whether all your conditions are satisfied. If yes, replace the values in col1 with b, and otherwise leave the values as is:
import numpy as np
df['col1'] = np.where((df['col1']=='a') &
(df['col2']=='t') &
(df.groupby('col1')['ID'].transform('count') > 2),'b',df['col1'])
prints:
ID col1 col2
0 1 b t
1 2 b t
2 3 b t
3 4 a u
4 5 b t
5 6 b t
6 7 b u
7 8 c t
8 9 c u
9 10 c w
Using transform('count'), will check whether the grouped (by col1) ID column will have more than 2 values.

How can I match two rows in a pyspark dataframe when the value in a column in a row matches the value in another column in another row?

I have a spark dataframe like below. If the value in col2 is found in other rows in col1, I want to get the values for col3 in a list in a new column. And I would rather not use self-join.
input:
col1 col2 col3
A B 1
B C 2
B A 3
output:
col1 col2 col3 col4
A B 1 [2,3]
B C 2 []
B A 3 [1]
You need to create a mapping using groupby and then use merge.
mapper = df.groupby('col1', as_index=False).agg({'col3': list}).rename(columns={'col3':'col4', 'col1': 'col2'})
df.merge(mapper, on='col2', how='left')
Output:
col1 col2 col3 col4
0 A B 1 [2, 3]
1 B C 2 NaN
2 B A 3 [1]

Categories

Resources