How to find smallest positive integer in data frame row - python

I have looked everywhere for this answer which must exist. I am trying to find the smallest positive integer per row in a data frame.
Imagine a dataframe
'lat':[-120, -90, -100, -100],
'long':[20, 21, 19, 18],
'dist1':[2, 6, 8, 1],
'dist2':[1,3,10,5]}```
The following function gives me the minimum value, but includes negatives. i.e. the df['lat'] column.
df.min(axis = 1)
Obviously, I could drop the lat column, or convert to string or something, but I will need it later. The lat column is the only column with negative values. I am trying to return a new column such as
df['min_dist'] = [1,3,8,1]
I hope this all makes sense. Thanks in advance for any help.

In general you can use DataFrame.where to mark negative values as null and exclude them from min calculation:
df['min_dist'] = df.where(df > 0).min(1)
df
lat long dist1 dist2 min_dist
0 -120 20 2 1 1.0
1 -90 21 6 3 3.0
2 -100 19 8 10 8.0
3 -100 18 1 5 1.0

Filter for just the dist columns and apply the minimum function :
df.assign(min_dist = df.iloc[:, -2:].min(1))
Out[205]:
lat long dist1 dist2 min_dist
0 -120 20 2 1 1
1 -90 21 6 3 3
2 -100 19 8 10 8
3 -100 18 1 5 1

Just use:
df['min_dist'] = df[df > 0].min(1)

Related

How to assign conditional value if I want to use pct_change method on some negative values?

I have a dataframe which contains some negative and positive values
I've used following code to get pct_change on row values
df_gp1 = df_gp1.pct_change(periods=4, axis=1) * 100
and here I want to assign some specific number, depending on how the values change from negative to positive or vice versa
for example, if the value turns from
positive to negative, return -100
negative to positive, return 100
negative to negative, return -100,
positive to positive, ordinary pct_change
for example my current dataframe could look like the following
DATA
D-4
D-3
D-2
D-1
D-0
A
-20
-15
-13
-10
-5
B
-30
-15
-10
10
25
C
40
25
30
41
30
D
25
25
10
15
-10
I want a new output(dataframe) that gives me following return
DATA
D-0
A
-100
B
100
C
-25
D
-100
as you can see, the 4th period must provide pct_change (i.e D-0 / D-4), but if it stays negative, return -100
if it turns from positive to negative, still return -100
if it turns from negative to positive, return 100,
if it's a change from positive value to another positive value, then apply pct_chg
and my original dataframe is like 4000 rows and 300 columns big.
Thus my desired output will have 4000 rows and 296 columns(since the it eliminates data D-4, D-3, D-2, D-1
I tried to make conditional list, and choice list, and use np.select method, but I just don't know how to apply it across whole dataframe and create new one that returns percentage changes.
Any help is deeply appreciated.
Use:
#convert column DATA to index if necessary
df = df.set_index('DATA')
#compare for less like 0
m1 = df.lt(0)
#comapre shifted 4 columns less like 0
m2 = df.shift(4, axis=1).lt(0)
#pass to np.select
arr = np.select([m1, ~m1 & m2, ~m1 & ~m2],
[-100, 100, df.pct_change(periods=4, axis=1) * 100])
#create DataFrame, remove first 4 columns
df = pd.DataFrame(arr, index=df.index, columns=df.columns).iloc[:, 4:].reset_index()
print (df)
DATA D-0
0 A -100.0
1 B 100.0
2 C -25.0
3 D -100.0
Given:
D-4 D-3 D-2 D-1 D-0
DATA
A -20 -15 -13 -10 -5
B -30 -15 -10 10 25
C 40 25 30 41 30
D 25 25 10 15 -10
Doing:
def stuff(row):
if row['D-0'] < 0:
return -100
elif row['D-4'] < 0:
return 100
else:
return (row.pct_change(periods=4) * 100)['D-0']
print(df.apply(stuff, axis=1))
Output:
A -100.0
B 100.0
C -25.0
D -100.0
dtype: float64

How to select the rows with same absolute value in a column

I want to select rows 0, 1, 3, and 4 and other rows with values that have the same absolution values. Note that assume we don't know the values (there could be -25, 25, -2356, 2356, etc.)
test = pd.DataFrame({'id':[1, 2, 3, 4, 5],
'quantity':[20, 30, 40, -30, -20]})
id quantity
0 1 20
1 2 30
2 3 40
3 4 -30
4 5 -20
.....
What is the best way of doing this?
IIUC, you want to filter the rows that have at least 2 times a value in absolute form. You could use groupby on the abs value:
test[test.groupby(test['quantity'].abs())['quantity'].transform('size').ge(2)]
If you want to ensure that you have both the negative and positive value, make it a set and check that there are 2 elements (the positive and negative):
test[test.groupby(test['quantity'].abs())['quantity'].transform(lambda g: len(set(g))==2)]
output:
id quantity
0 1 20
1 2 30
3 4 -30
4 5 -20

replicating data in same dataFrame

I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**
My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15
In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.

How to extract mean and fluctuation by equal index?

I have a CSV file like the below (after sorted the dataframe by iy):
iy,u
1,80
1,90
1,70
1,50
1,60
2,20
2,30
2,35
2,15
2,25
I'm trying to compute the mean and the fluctuation when iy are equal. For example, for the CSV above, what I want is something like this:
iy,u,U,u'
1,80,70,10
1,90,70,20
1,70,70,0
1,50,70,-20
1,60,70,-10
2,20,25,-5
2,30,25,5
2,35,25,10
2,15,25,-10
2,25,25,0
Where U is the average of u when iy are equal, and u' is simply u-U, the fluctuation. I know that there's a function called groupby.mean() in pandas, but I don't want to group the dataframe, just take the mean, put the values in a new column, and then calculate the fluctuation.
How can I proceed?
Use groupby with transform to calculate a mean for each group and assign that value to a new column 'U', then pandas to subtract two columns:
df['U'] = df.groupby('iy').transform('mean')
df["u'"] = df['u'] - df['U']
df
Output:
iy u U u'
0 1 80 70 10
1 1 90 70 20
2 1 70 70 0
3 1 50 70 -20
4 1 60 70 -10
5 2 20 25 -5
6 2 30 25 5
7 2 35 25 10
8 2 15 25 -10
9 2 25 25 0
You could get fancy and do it in one line:
df.assign(U=df.groupby('iy').transform('mean')).eval("u_prime = u-U")

How to multiply a specific row in pandas dataframe by a condition

I have a column which of 10th marks but some specific rows are not scaled properly i.e they are out of 10. I want to create a function that will help me to detect which are <=10 and then multiply to 100. I tried by creating a function but it failed.
Following is the Column:
data['10th']
0 0
1 0
2 0
3 10.00
4 0
...
2163 0
2164 0
2165 0
2166 76.50
2167 64.60
Name: 10th, Length: 2168, dtype: object
I am not what do you mean by "multiply to 100" but you should be able to use apply with lambda similar to this:
df = pd.DataFrame({"a": [1, 3, 5, 23, 76, 43 ,12, 3 ,5]})
df['a'] = df['a'].apply(lambda x: x*100 if x < 10 else x)
print(df)
0 100
1 300
2 500
3 23
4 76
5 43
6 12
7 300
8 500
If I do not understand you correctly you could replace the action and condition in the lambda function to your purpose.
Looks like you need to change the data type first data["10th"] = pd.to_numeric(data["10th"])
I assume you want to multiply by 10 not 100 to scale it with the other out of 100 scores. you can try this np.where(data["10th"]<10, data["10th"]*10, data["10th"])
assigning it back to the dataframe using. data["10th"] = np.where(data["10th"]<10, data["10th"]*10, data["10th"])

Categories

Resources