Finding minimum value of a column between two entries in another column - python

Viewed 64 times
0
I have two columns in a data frame containing more than 1000 rows. Column A can take values X,Y,None. Column B contains random numbers from 50 to 100.
Every time there is a non 'None' occurrence in Column A, it is considered as occurrence4. so, previous non None occurrence in Column A will be occurrence3, and the previous to that will be occurrence2 and the previous to that will be occurrence1. I want to find the minimum value of column B between occurrence4 and occurrence3 and check if it is greater than the minimum value of column B between occurrence2 and occurrence1. The results can be stored in a new column in the data frame as "YES" or "NO".
SAMPLE INPUT
ROWNUM A B
1 None 68
2 None 83
3 X 51
4 None 66
5 None 90
6 Y 81
7 None 81
8 None 100
9 None 83
10 None 78
11 X 68
12 None 53
13 None 83
14 Y 68
15 None 94
16 None 50
17 None 71
18 None 71
19 None 52
20 None 67
21 None 82
22 X 76
23 None 66
24 None 92
For example, I need to find the minimum value of Column B between ROWNUM 14 and ROWNUM 11 and check if it is GREATER THAN the minimum value of Column B between ROWNUM 6 and ROWNUM 3. Next, I need to find the minimum value between ROWNUM 22 AND ROWNUM 14 and check if it is GREATER THAN the minimum value between ROWNUM 11 and ROWNNUM 6 and so on.
EDIT:
In the sample data, we start our calculation from row 14, since that is where we have the fourth non none occurrence of column A. The minimum value between row 14 and row 11 is 53. The minimum value between row 6 and 3 is 51. Since 53 > 51, , it means the minimum value of column B between occurrence 4 and occurrence 3, is GREATER THAN minimum value of column B between occurrence 2 and occurrence 1. So, output at row 14 would be "YES" or 1.
Next, at row 22, the minimum value between row 22 and row 14 is 50. The minimum value between row 11 and 6 is 68. Since 50 < 68, it means minimum between occurrence 4 and occurrence 3 is NOT GREATER THAN minimum between occurrence 2 and occurrence 1. So, output at row 22 would be "NO" or 0.
I have the following code.
import numpy as np
import pandas as pd
df = pd.DataFrame([[0, 0]]*100, columns=list('AB'), index=range(1, 101))
df.loc[[3, 6, 11, 14, 22, 26, 38, 51, 64, 69, 78, 90, 98], 'A'] = 1
df['B'] = np.random.randint(50, 100, size=len(df))
df['result'] = df.index[df['A'] != 0].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)
print(df)
This code works when column A has inputs [0,1]. But I need a code where column A could contain [None, X, Y]. Also, this code produces output as [0,1]. I need output as [YES, NO] instead.

I read your sample data as follows:
df = pd.read_fwf('input.txt', widths=[7, 6, 3], na_values=['None'])
Note na_values=['None'], which provides that None in input (a string)
is read as NaN.
This way the DataFrame is:
ROWNUM A B
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76
22 23 NaN 66
23 24 NaN 92
The code to do your task is:
res = df.index[df.A.notnull()].to_series().rolling(4).apply(
lambda x: df.loc[x[2]:x[3], 'B'].min() > df.loc[x[0]:x[1], 'B'].min(), raw=True)\
.dropna().map(lambda x: 'YES' if x > 0 else 'NO').rename('Result')
df = df.join(res)
df.Result.fillna('', inplace=True)
As you can see, it is in part a slight change of your code, with some
additions.
The result is:
ROWNUM A B Result
0 1 NaN 68
1 2 NaN 83
2 3 X 51
3 4 NaN 66
4 5 NaN 90
5 6 Y 81
6 7 NaN 81
7 8 NaN 100
8 9 NaN 83
9 10 NaN 78
10 11 X 68
11 12 NaN 53
12 13 NaN 83
13 14 Y 69 YES
14 15 NaN 94
15 16 NaN 50
16 17 NaN 71
17 18 NaN 71
18 19 NaN 52
19 20 NaN 67
20 21 NaN 82
21 22 X 76 NO
22 23 NaN 66
23 24 NaN 92
The advantage of my solution over the other is that:
the content is either YES or NO, just as you want,
this content shows up only for non-null values in A column,
"ignoring" first 3, which don't have enough "predecessors".

Here's my approach:
def is_incr(x):
return x[:2].min() > x[2:].min()
# replace with s = df['A'] == 'None' if needed
s = df['A'].isna()
df['new_col'] = df.loc[s, 'B'].rolling(4).apply(is_incr)
Output:
ROWNUM A B new_col
0 1 NaN 68 NaN
1 2 NaN 83 NaN
2 3 X 51 NaN
3 4 NaN 66 NaN
4 5 NaN 90 1.0
5 6 Y 81 NaN
6 7 NaN 81 0.0
7 8 NaN 100 0.0
8 9 NaN 83 0.0
9 10 NaN 78 1.0
10 11 X 68 NaN
11 12 NaN 53 1.0
12 13 NaN 83 1.0
13 14 Y 68 NaN
14 15 NaN 94 0.0
15 16 NaN 50 1.0
16 17 NaN 71 1.0
17 18 NaN 71 0.0
18 19 NaN 52 0.0
19 20 NaN 67 1.0
20 21 NaN 82 0.0
21 22 X 76 NaN
22 23 NaN 66 0.0
23 24 NaN 92 1.0

Related

Sum up previous rows upto 3 and multiply with value from another column using pandas

I have 2 dataframes, i want to get sum value of every row based on groupby of unique id each previous 3rows & each row value should be multiply by other dataframe value
for example : dataframe A dataframe B
unique_id value out_value num_values
1 1 45 0.15
2 1 33 0.30
3 1 18 0.18
#4 1 26 20.7
5 2 66
6 2 44
7 2 22
#8 2 19. 28.3
expected output_value column
4th row = 18 * 0.15 + 33*0.30 + 45*0.18 = 2.7+9.9+8.1 = 20.7
8th row = 22 * 0.15 + 44*0.30 + 66*0.18 = 3.3+ 13.2 + 11.88= 28.3
based on Unique_id each value should calculate based previous 3values.
for every row there will be previous 3 rows available
import pandas as pd
import numpy as np
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1, 2, 2, 2, 2, 152, 152, 152, 152, 152],
'value':[45,33,18,26,66,44,22,19,36,27,45,81,90]
}, index=range(1,14))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id value
1 1 45
2 1 33
3 1 18
4 1 26
5 2 66
6 2 44
7 2 22
8 2 19
9 152 36
10 152 27
11 152 45
12 152 81
13 152 90
df_b
###
num_values
0 0.15
1 0.30
2 0.18
# main calculation
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby('uni_id').cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id value out_value
1 1 45 NaN
2 1 33 NaN
3 1 18 NaN
4 1 26 20.70
5 2 66 NaN
6 2 44 NaN
7 2 22 NaN
8 2 19 28.38
9 152 36 NaN
10 152 27 NaN
11 152 45 NaN
12 152 81 21.33
13 152 90 30.51
If you want to keep non-null values through filtrate:
df_a.query('out_value.notnull()')
###
uni_id value out_value
4 1 26 20.70
8 2 19 28.38
12 152 81 21.33
13 152 90 30.51
Group with metrics uni_id,Year_Month
Data preparation:
# create date range series with 7 days
import pandas as pd
import numpy as np
rng = np.random.default_rng(42)
rng.integers(10,100, 26)
date_range = pd.Series(pd.date_range(start='01.30.2020', periods=27, freq='5D')).dt.to_period('M')
df_a = pd.DataFrame({
'uni_id':[1, 1, 1, 1,1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 152, 152, 152, 152, 152,152, 152, 152, 152, 152],
'Year_Month':date_range,
'value':rng.integers(10,100, 26)
}, index=range(1,27))
df_b = pd.DataFrame({
'num_values':[0.15,0.30,0.18]
})
df_a
###
uni_id Year_Month value
1 1 2020-02 46
2 1 2020-02 84
3 1 2020-02 59
4 1 2020-02 49
5 1 2020-02 50
6 1 2020-02 30
7 1 2020-03 18
8 1 2020-03 59
9 2 2020-03 89
10 2 2020-03 15
11 2 2020-03 87
12 2 2020-03 84
13 2 2020-04 34
14 2 2020-04 66
15 2 2020-04 24
16 2 2020-04 78
17 152 2020-04 73
18 152 2020-04 41
19 152 2020-05 16
20 152 2020-05 97
21 152 2020-05 50
22 152 2020-05 90
23 152 2020-05 71
24 152 2020-05 80
25 152 2020-06 78
26 152 2020-06 27
Processing
arr = [df_a['value'].shift(x+1).values[::-1][:3] for x in range(len(df_a['value']))[::-1]]
arr_b = pd.Series(np.inner(arr, df_b['num_values']))
# filter and clean
mask = df_a.groupby(['uni_id','Year_Month']).cumcount()+1 > 3
output = arr_b * mask
output[output == 0] = np.nan
# concat result to df_a
df_a['out_value'] = output
df_a
###
uni_id Year_Month value out_value
1 1 2020-02 46 NaN
2 1 2020-02 84 NaN
3 1 2020-02 59 NaN
4 1 2020-02 49 40.17
5 1 2020-02 50 32.82
6 1 2020-02 30 28.32
7 1 2020-03 18 NaN
8 1 2020-03 59 NaN
9 2 2020-03 89 NaN
10 2 2020-03 15 NaN
11 2 2020-03 87 NaN
12 2 2020-03 84 41.4
13 2 2020-04 34 NaN
14 2 2020-04 66 NaN
15 2 2020-04 24 NaN
16 2 2020-04 78 30.78
17 152 2020-04 73 NaN
18 152 2020-04 41 NaN
19 152 2020-05 16 NaN
20 152 2020-05 97 NaN
21 152 2020-05 50 NaN
22 152 2020-05 90 45.96
23 152 2020-05 71 46.65
24 152 2020-05 80 49.5
25 152 2020-06 78 NaN
26 152 2020-06 27 NaN

Pandas: Reshaping Long Data to Wide with duplicated columns

I need to pivot long pandas dataframe to wide. The issue is that for some id there are multiple values for the same parameter. Some parameters present only in a few ids.
df = pd.DataFrame({'indx':[11,11,11,11,12,12,12,13,13,13,13],'param':['a','b','b','c','a','b','d','a','b','c','c'],'value':[100,54,65,65,789,24,98,24,27,75,35]})
indx param value
11 a 100
11 b 54
11 b 65
11 c 65
12 a 789
12 b 24
12 d 98
13 a 24
13 b 27
13 c 75
13 c 35
Want to receive something like this:
indx a b c d
11 100 `54,65` 65 None
12 789 None 98 24
13 24 27 `75,35` None
or
indx a b b1 c c1 d
11 100 54 65 65 None None
12 789 None None 98 None 24
13 24 27 None 75 35 None
So, obviously direct df.pivot() not a solution.
Thanks for any help.
Option 1:
df.astype(str).groupby(['indx', 'param'])['value'].agg(','.join).unstack()
Output:
param a b c d
indx
11 100 54,65 65 NaN
12 789 24 NaN 98
13 24 27 75,35 NaN
Option 2
df_out = df.set_index(['indx', 'param', df.groupby(['indx','param']).cumcount()])['value'].unstack([1,2])
df_out.columns = [f'{i}_{j}' if j != 0 else f'{i}' for i, j in df_out.columns]
df_out.reset_index()
Output:
indx a b b_1 c d c_1
0 11 100.0 54.0 65.0 65.0 NaN NaN
1 12 789.0 24.0 NaN NaN 98.0 NaN
2 13 24.0 27.0 NaN 75.0 NaN 35.0
Ok, found a solution (there is method df.pivot_table for such cases,allows different aggregation functions):
df.pivot_table(index='indx', columns='param',values='value', aggfunc=lambda x: ','.join(x.astype(str)) )
indx a b c d
11 100 54,65 65 NaN
12 789 24 NaN 98
13 24 27 75,35 NaN

Create new Pandas columns using the value from previous row

I need to create two new Pandas columns using the logic and value from the previous row.
I have the following data:
Day Vol Price Income Outgoing
1 499 75
2 3233 90
3 1812 70
4 2407 97
5 3474 82
6 1057 53
7 2031 68
8 304 78
9 1339 62
10 2847 57
11 3767 93
12 1096 83
13 3899 88
14 4090 63
15 3249 52
16 1478 52
17 4926 75
18 1209 52
19 1982 90
20 4499 93
My challenge is to come up with a logic where both the Income and Outgoing columns (which are currently empty), should have the values of (Vol * Price).
But, the Income column should carry this value when, the previous day's "Price" value is lower than present. The Outgoing column should carry this value when, the previous day's "Price" value is higher than present. The rest of the Income and Outgoing columns, should just have NaN's. If the Price is unchanged, then that day's value is to be dropped.
But the entire logic should start with (n + 1) day. The first row should be skipped and the logic should apply from row 2 onwards.
I have tried using shift in my code example such as:
if sample_data['Price'].shift(1) < sample_data['Price'].shift(2)):
sample_data['Income'] = sample_data['Vol'] * sample_data['Price']
else:
sample_data['Outgoing'] = sample_data['Vol'] * sample_data['Price']
But it isn't working.
I feel there would be a simpler and comprehensive tactic to go about this, could someone please help ?
Update (The final output should look like this):
For day 16, the data is deleted because we have two similar prices for day 15 and 16.
I'd calculate the product and the mask separately, and then update the cols:
In [11]: vol_price = df["Vol"] * df["Price"]
In [12]: incoming = df["Price"].diff() < 0
In [13]: df.loc[incoming, "Income"] = vol_price
In [14]: df.loc[~incoming, "Outgoing"] = vol_price
In [15]: df
Out[15]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 NaN 290970.0
2 3 1812 70 126840.0 NaN
3 4 2407 97 NaN 233479.0
4 5 3474 82 284868.0 NaN
5 6 1057 53 56021.0 NaN
6 7 2031 68 NaN 138108.0
7 8 304 78 NaN 23712.0
8 9 1339 62 83018.0 NaN
9 10 2847 57 162279.0 NaN
10 11 3767 93 NaN 350331.0
11 12 1096 83 90968.0 NaN
12 13 3899 88 NaN 343112.0
13 14 4090 63 257670.0 NaN
14 15 3249 52 168948.0 NaN
15 16 1478 52 NaN 76856.0
16 17 4926 75 NaN 369450.0
17 18 1209 52 62868.0 NaN
18 19 1982 90 NaN 178380.0
19 20 4499 93 NaN 418407.0
or is it this way around:
In [21]: incoming = df["Price"].diff() > 0
In [22]: df.loc[incoming, "Income"] = vol_price
In [23]: df.loc[~incoming, "Outgoing"] = vol_price
In [24]: df
Out[24]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 290970.0 NaN
2 3 1812 70 NaN 126840.0
3 4 2407 97 233479.0 NaN
4 5 3474 82 NaN 284868.0
5 6 1057 53 NaN 56021.0
6 7 2031 68 138108.0 NaN
7 8 304 78 23712.0 NaN
8 9 1339 62 NaN 83018.0
9 10 2847 57 NaN 162279.0
10 11 3767 93 350331.0 NaN
11 12 1096 83 NaN 90968.0
12 13 3899 88 343112.0 NaN
13 14 4090 63 NaN 257670.0
14 15 3249 52 NaN 168948.0
15 16 1478 52 NaN 76856.0
16 17 4926 75 369450.0 NaN
17 18 1209 52 NaN 62868.0
18 19 1982 90 178380.0 NaN
19 20 4499 93 418407.0 NaN

How to rank data, but give the same ranking for data that is equal

I have a csv of daily maximum temperatures. I am trying to assign a "rank" for my data. I first sorted my daily maximum temperature from lowest to highest. I then created a new column called rank.
#Sort data smallest to largest
ValidFullData_Sorted=ValidFullData.sort_values(by="TMAX")
#count total obs
n=ValidFullData_Sorted.shape[0]
#add a numbered column 1-> n to use in return calculation for rank
ValidFullData_Sorted.insert(0,'rank',range(1,1+n))
How can I make the rank the same for values of daily maximum temperature that are the same? (i.e. every time the daily maximum temperature reaches 95° the rank for each of those instances should be the same)
Here is some sample data:(its daily temperature data so its thousands of lines long)
Date TMAX TMIN
1/1/00 22 11
1/2/00 26 12
1/3/00 29 14
1/4/00 42 7
1/5/00 42 21
And I want to add a TMAXrank column that would look like this:
Date TMAX TMIN TMAXRank
1/1/00 22 11 4
1/2/00 26 12 3
1/3/00 29 14 2
1/4/00 42 7 1
1/5/00 42 21 1
ValidFullData['TMAXRank'] = ValidFullData[ValidFullData['TMAX'] < 95]['TMAX'].rank(ascending=False, method='dense')
Output:
Unnamed: 0 TMAX TMIN TMAXRank
17 17 88 14 1.0
16 16 76 12 2.0
15 15 72 11 3.0
14 14 64 21 4.0
8 8 62 7 5.0
7 7 58 14 6.0
13 13 58 7 6.0
18 18 55 7 7.0
3 3 42 7 8.0
4 4 42 21 8.0
6 6 41 12 9.0
12 12 37 14 10.0
5 5 36 11 11.0
2 2 29 14 12.0
1 1 26 12 13.0
0 0 22 11 14.0
9 9 98 21 NaN
10 10 112 11 NaN
11 11 98 12 NaN
19 19 95 21 NaN

pandas drop row below each row containing an 'na'

i have a dataframe with, say, 4 columns [['a','b','c','d']], to which I add another column ['total'] containing the sum of all the other columns for each row. I then add another column ['growth of total'] with the growth rate of the total.
some of the values in [['a','b','c','d']] are blank, rendering the ['total'] column invalid for these rows. I can easily get rid of these rows with df.dropna(how='any').
However, my growth rate will be invalid not only for rows with missing values in [['a','b','c','d']], but also for the following row. How do I drop all these rows?
IIUC correctly you can use notnull with all to mask off any rows with NaN and any rows that follow NaN rows:
In [43]:
df = pd.DataFrame({'a':[0,np.NaN, 2, 3,np.NaN], 'b':[np.NaN, 1,2,3,4], 'c':[0, np.NaN,2,3,4]})
df
Out[43]:
a b c
0 0 NaN 0
1 NaN 1 NaN
2 2 2 2
3 3 3 3
4 NaN 4 4
In [44]:
df[df.notnull().all(axis=1) & df.shift().notnull().all(axis=1)]
Out[44]:
a b c
3 3 3 3
Here's one option that I think does what you're looking for:
In [76]: df = pd.DataFrame(np.arange(40).reshape(10,4))
In [77]: df.ix[1,2] = np.nan
In [78]: df.ix[6,1] = np.nan
In [79]: df['total'] = df.sum(axis=1, skipna=False)
In [80]: df
Out[80]:
0 1 2 3 total
0 0 1 2 3 6
1 4 5 NaN 7 NaN
2 8 9 10 11 38
3 12 13 14 15 54
4 16 17 18 19 70
5 20 21 22 23 86
6 24 NaN 26 27 NaN
7 28 29 30 31 118
8 32 33 34 35 134
9 36 37 38 39 150
In [81]: df['growth'] = df['total'].iloc[1:] - df['total'].values[:-1]
In [82]: df
Out[82]:
0 1 2 3 total growth
0 0 1 2 3 6 NaN
1 4 5 NaN 7 NaN NaN
2 8 9 10 11 38 NaN
3 12 13 14 15 54 16
4 16 17 18 19 70 16
5 20 21 22 23 86 16
6 24 NaN 26 27 NaN NaN
7 28 29 30 31 118 NaN
8 32 33 34 35 134 16
9 36 37 38 39 150 16

Categories

Resources