Identify a code by quantity intervals in a pandas DataFrame - python

Given the following DataFrame in pandas:
avg_time_1
avg_time_2
avg_time_3
1200
34
1
90
45
3600
0
4
1
0
4
50
80
4
60
82
40
65
I want to get a new DataFrame from the previous one, such that it assigns the following code to each row if any of the three columns visit_time, exceeds the following values:
CODE-1: All values are less than 5.
CODE-2: Some value is between 5 and 100.
CODE-3: All values are between 5 and 100.
CODE-4: Some value is higher than 1000.
Applying the function, we will obtain the following DataFrame.
avg_time_1
avg_time_2
avg_time_3
codes
1200
34
1
4
90
45
3600
4
0
4
1
1
0
4
50
2
80
4
60
2
82
40
65
3
Thank you for your response in advance.

You can try np.select, note that you should put the higher priority condition ahead.
df['codes'] = np.select(
[df.lt(5).all(1), df.gt(1000).any(1),
df.apply(lambda col: col.between(5, 100)).all(1),
df.apply(lambda col: col.between(5, 100)).any(1)],
[1, 4, 3, 2],
default=0
)
print(df)
avg_time_1 avg_time_2 avg_time_3 codes
0 1200 34 1 4
1 90 45 3600 4
2 0 4 1 1
3 0 4 50 2
4 80 4 60 2
5 82 40 65 3

Related

How to create new column with all the values from another column starting from 2nd row in dataframe? [duplicate]

This question already has answers here:
How to shift a column in Pandas DataFrame
(9 answers)
Closed 9 months ago.
I want to create new column which should contain values from another column starting from second. This means that for the new column, the last value will be Nan.
Dataset example.
A B C
10 20 30
40 50 60
70 80 90
100 110 120
New column needed as:
A B C D
10 20 30 50
40 50 60 80
70 80 90 110
110 110 120 Nan
The column D needs values to be extracted from column B starting from 2nd row.
I tried the code as:
df['D'] = [df['B'][i] for i in range(0, len(df))]
This obviously gives the same as B, I am unable to change the rows from 1:len(df)
Simply use shift(-1)
>>> df['D'] = df['B'].shift(-1)
A B C D
0 10 20 30 50.0
1 40 50 60 80.0
2 70 80 90 110.0
3 100 110 120 NaN
Use shift:
df['D'] = df['B'].shift(-1)
output:
A B C D
0 10 20 30 50.0
1 40 50 60 80.0
2 70 80 90 110.0
3 100 110 120 NaN

Pandas conditional lookup based on columns from a different dataframe

I have searched but found no answers for my problem. My first dataframe looks like:
df1
Item Value
1 23
2 3
3 45
4 65
5 17
6 6
7 18
… …
500 78
501 98
and the second lookup table looks like
df2
L1 H1 L2 H2 L3 H3 L4 H4 L5 H5 Name
1 3 5 6 11 78 86 88 90 90 A
4 4 7 10 79 85 91 99 110 120 B
89 89 91 109 0 0 0 0 0 0 C
...
What I am trying to do is to get Name from df2 to df1 when Item in df1 falls between the Low (L) and High (H) columns. Something (which does not work) like:
df1[Name]=np.where((df1['Item']>=df2['L1'] & df1['Item']<=df2['H1'])|
(df1['Item']>=df2['L2'] & df1['Item']<=df2['H2']) |
(df1['Item']>=df2['L3'] & df1['Item']<=df2['H3']) |
(df1['Item']>=df2['L4'] & df1['Item']<=df2['H4']) |
(df1['Item']>=df2['L5'] & df1['Item']<=df2['H5']) |
(df1['Item']>=df2['L6'] & df1['Item']<=df2['H6']), df2['Name'], "Other")
So that the result would be like:
Item Value Name
1 23 A
2 3 A
3 45 A
4 65 B
5 17 A
6 6 A
7 18 A
… … …
500 78 K
501 98 Other
If you have any guidance for my problem to share, I would much appreciate it! Thank you in advance!
Try:
Transform df2 using wide_to_long
Create lists of numbers from "L" to "H" for each row using apply and range
explode to have one value in each row
map each "Item" in df1 using a dict created from ranges with the structure {value: name}
ranges = pd.wide_to_long(df2, ["L","H"], i="Name", j="Subset")
ranges["values"] = ranges.apply(lambda x: list(range(x["L"], x["H"]+1)), axis=1)
ranges = ranges.explode("values").reset_index()
df1["Name"] = df1["Item"].map(dict(zip(ranges["values"], ranges["Name"])))
>>> df1
Item Value Name
0 1 23 A
1 2 3 A
2 3 45 A
3 4 65 B
4 5 17 A
5 6 6 A
6 7 18 B
7 500 78 NaN
8 501 98 NaN
A faster option (tests can prove/debunk that), would be to use conditional_join from pyjanitor (conditional_join uses binary search underneath the hood):
#pip install pyjanitor
import pandas as pd
import janitor
temp = (pd.wide_to_long(df2,
stubnames=['L', 'H'],
i='Name',
j='Num')
.reset_index('Name')
)
# the `Num` index is sorted already
(df1.conditional_join(
temp,
# left column, right column, join operator
('Item', 'L', '>='),
('Item', 'H', '<='),
how = 'left')
.loc[:, ['Item', 'Value', 'Name']]
)
Item Value Name
0 1 23 A
1 2 3 A
2 3 45 A
3 4 65 B
4 5 17 A
5 6 6 A
6 7 18 B
7 500 78 NaN
8 501 98 NaN

Pandas Python highest 2 rows of every 3 and tabling the results

Suppose I have the following dataframe:
. Column1 Column2
0 25 1
1 89 2
2 59 3
3 78 10
4 99 20
5 38 30
6 89 100
7 57 200
8 87 300
Im not sure if what I want to do is impossible or not. But I want to compare every three rows of column1 and then take the highest 2 out the three rows and assign the corresponding 2 Column2 values to a new column. The values in column 3 does not matter if they are joined or not. It does not matter if they are arranged or not for I know every 2 rows of column 3 belong to every 3 rows of column 1.
. Column1 Column2 Column3
0 25 1 2
1 89 2 3
2 59 3
3 78 10 20
4 99 20 10
5 38 30
6 89 100 100
7 57 200 300
8 87 300
You can use np.arange with np.repeat to create a grouping array which groups every 3 values.
Then use GroupBy.nlargest then extract indices of those values using pd.Index.get_level_values, then assign them to Column3 pandas handles index alignment.
n_grps = len(df)/3
g = np.repeat(np.arange(n_grps), 3)
idx = df.groupby(g)['Column1'].nlargest(2).index.get_level_values(1)
vals = df.loc[idx, 'Column2']
vals
# 1 2
# 2 3
# 4 20
# 3 10
# 6 100
# 8 300
# Name: Column2, dtype: int64
df['Column3'] = vals
df
Column1 Column2 Column3
0 25 1 NaN
1 89 2 2.0
2 59 3 3.0
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 NaN
8 87 300 300.0
To get output like you mentioned in the question you have to sort and push NaN to last then you have perform this additional step.
df['Column3'] = df.groupby(g)['Column3'].apply(lambda x:x.sort_values()).values
Column1 Column2 Column3
0 25 1 2.0
1 89 2 3.0
2 59 3 NaN
3 78 10 10.0
4 99 20 20.0
5 38 30 NaN
6 89 100 100.0
7 57 200 300.0
8 87 300 NaN

Checking if values of a row are consecutive

I have a df like this:
1 2 3 4 5 6
0 5 10 12 35 70 80
1 10 11 23 40 42 47
2 5 26 27 38 60 65
Where all the values in each row are different and have an increasing order.
I would like to create a new column with 1 or 0 if there are at least 2 consecutive numbers.
For example the second and third row have 10 and 11, and 26 and 27. Is there a more pythonic way than using an iterator?
Thanks
Use DataFrame.diff for difference per rows, compare by 1, check if at least one True per rows and last cast to integers:
df['check'] = df.diff(axis=1).eq(1).any(axis=1).astype(int)
print (df)
1 2 3 4 5 6 check
0 5 10 12 35 70 80 0
1 10 11 23 40 42 47 1
2 5 26 27 38 60 65 1
For improve performance use numpy:
arr = df.values
df['check'] = np.any(((arr[:, 1:] - arr[:, :-1]) == 1), axis=1).astype(int)

Comparing two consecutive rows and creating a new column based on a specific logical operation

I have a data frame with two columns
df = ['xPos', 'lineNum']
import pandas as pd
data = '''\
xPos lineNum
40 1
50 1
75 1
90 1
42 2
75 2
110 2
45 3
70 3
95 3
125 3
38 4
56 4
74 4'''
I have created the aggregate data frame for this by using
aggrDF = df.describe(include='all')
command
and I am interested in the minimum of the xPos value. So, i get it by using
minxPos = aggrDF.ix['min']['xPos']
Desired output
data = '''\
xPos lineNum xDiff
40 1 2
50 1 10
75 1 25
90 1 15
42 2 4
75 2 33
110 2 35
45 3 7
70 3 25
95 3 25
125 3 30
38 4 0
56 4 18
74 4 18'''
The logic
I want to compere the two consecutive rows of the data frame and calculate a new column based on this logic:
if( df['LineNum'] != df['LineNum'].shift(1) ):
df['xDiff'] = df['xPos'] - minxPos
else:
df['xDiff'] = df['xPos'].shift(1)
Essentially, I want the new column to have the difference of the two consecutive rows in the df, as long as the line number is the same.
If the line number changes, then, the xDiff column should have the difference with the minimum xPos value that I have from the aggregate data frame.
Can you please help? thanks,
These two lines should do it:
df['xDiff'] = df.groupby('lineNum').diff()['xPos']
df.loc[df['xDiff'].isnull(), 'xDiff'] = df['xPos'] - minxPos
>>> df
xPos lineNum xDiff
0 40 1 2.0
1 50 1 10.0
2 75 1 25.0
3 90 1 15.0
4 42 2 4.0
5 75 2 33.0
6 110 2 35.0
7 45 3 7.0
8 70 3 25.0
9 95 3 25.0
10 125 3 30.0
11 38 4 0.0
12 56 4 18.0
13 74 4 18.0
You just need groupby lineNum and apply the condition you already writing down
df['xDiff']=np.concatenate(df.groupby('lineNum').apply(lambda x : np.where(x['lineNum'] != x['lineNum'].shift(1),x['xPos'] - x['xPos'].min(),x['xPos'].shift(1)).astype(int)).values)
df
Out[76]:
xPos lineNum xDiff
0 40 1 0
1 50 1 40
2 75 1 50
3 90 1 75
4 42 2 0
5 75 2 42
6 110 2 75
7 45 3 0
8 70 3 45
9 95 3 70
10 125 3 95
11 38 4 0
12 56 4 38
13 74 4 56

Categories

Resources