What is the equivalent of proc format in SAS to python - python

I want
proc format;
value RNG
low - 24 = '1'
24< - 35 = '2'
35< - 44 = '3'
44< - high ='4'
I need this in python pandas.

If you are looking for equivalent of the mapping function, you can use something like this.
df = pd.DataFrame(np.random.randint(100,size=5), columns=['score'])
print(df)
output:
score
0 73
1 90
2 83
3 40
4 76
Now lets apply the binning function for score column in dataframe and create new column in the same dataframe.
def format_fn(x):
if x < 24:
return '1'
elif x <35:
return '2'
elif x< 44:
return '3'
else:
return '4'
df['binned_score']=df['score'].apply(format_fn)
print(df)
output:
score binned_score
0 73 4
1 90 4
2 83 4
3 40 3
4 76 4

Related

subtracting column from unmatched dataFrame by mapping it to matching column

I have two DataFrame as listed below
plusMinusOne = pd.DataFrame({0: [2459650, 2459650,2459650,2459654,2459654,2459654,2459660], 1: [100, 90,80,14,15,16,2]},index=[3,4,5,12,13,14,27])
bias = pd.DataFrame({0: [2459651, 2459652,2459653,2459655,2459656,2459658,2459659], 1: [10, 20,30,40,50,60,70]})
I have to subtract plusMinusOne's 1st column with bias 1th column by matching the bias 0th column with plusMinusOne's 0th column.
As 2459650 is not present in bias dataFrame i have to check for 2459651/2459649 from bias and subtract any one's value from that. I have to look for 1 above or 1 below from bias and then subtract the value for every row
I was trying like this.
for i in plusMinusOne[0]:
if i+1 in bias[0].values:
plusMinusOne[1] = plusMinusOne[1].sub(plusMinusOne[0].map(
bias.assign(key=bias[0]-1).set_index('key')[1]), fill_value=0)
break
elif i-1 in bias[0].values:
plusMinusOne[1] = plusMinusOne[1].sub(plusMinusOne[0].map(
bias.assign(key=bias[0]+1).set_index('key')[1]), fill_value=0)
break
My expected output is :
plusMinusOne
2459650 90
2459650 80
2459650 70
2459654 -26
2459654 -25
2459654 -24
2459660 -68
Vectorized solution,
def bias_diff(row):
value = 0
if (row[0] == bias[0]).any():
value = row[1] - bias[(row[0]) == bias[0]].iloc[0,1]
elif ((row[0]+1) == bias[0]).any():
value = row[1] - bias[(row[0]+1) == bias[0]].iloc[0,1]
else:
value = row[1] - bias[(row[0]-1) == bias[0]].iloc[0,1]
return value
plusMinusOne[1] = plusMinusOne.apply(bias_diff, axis=1)
print(plusMinusOne)
Output
0 1
3 2459650 90
4 2459650 80
5 2459650 70
12 2459654 -26
13 2459654 -25
14 2459654 -24
27 2459660 -68
This is not an efficient code, but this work for your case.This will work for which difference you want by changing the diff variable value
import pandas as pd
df1 = pd.DataFrame({0: [2459650, 2459650,2459650,2459654,2459654,2459654,2459660], 1: [100, 90,80,14,15,16,2]})
df2 = pd.DataFrame({0: [2459651, 2459652,2459653,2459655,2459656,2459658,2459659], 1: [10, 20,30,40,50,60,70]})
diff = 3
def data_process(df1,df2,i,diff):
data = None
for j in range(len(df2)):
if df1[0][i] == df2[0][j]:
data = df1[1][i]-df2[1][j]
else:
try:
if (df1[0][i])+diff == df2[0][j]:
data = df1[1][i]-df2[1][j]
elif (df1[0][i])-diff == df2[0][j]:
data = df1[1][i] - df2[1][j]
except:
pass
return data
processed_data = []
for i in range(len(df1)):
if data_process(df1,df2,i,diff) is None:
processed_data.append(df1[1][i])
else:
processed_data.append(data_process(df1,df2,i,diff))
df1[2] = processed_data
print(df1[[0,2]])
The output dataframe for diff 1 is
0 2
0 2459650 90
1 2459650 80
2 2459650 70
3 2459654 -26
4 2459654 -25
5 2459654 -24
6 2459660 -68
the output dataframe for diff 3 is
0 2
0 2459650 70.0
1 2459650 60.0
2 2459650 50.0
3 2459654 4.0
4 2459654 5.0
5 2459654 6.0
6 2459660 2
The 2459660 does not contain +3 or -3 combinational value (i.e) 2459657 or 2459663 in second dataframe. so i return the value as it is. Else it will return Nan value instead of 2.

How to name the column when using value_count function in pandas?

I was counting the no of occurrence of angle and dist by the code below:
g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
the output:
current_angle current_dist 0
-50 30 1
-50 40 2
-50 41 6
-50 45 4
try1:
g.columns = ['angle','Distance','count','Percentage Missed'] - result was no change in the name of column
try2:
When I print the columns using print(g.columns) ended with error AttributeError: 'Series' object has no attribute 'columns'
I want to rename the column 0 as count and add a new column to the dataframe g as percent missed which is calculated by 100 - value in column 0
Expected output
current_angle current_dist count percent missed
-50 30 1 99
-50 40 2 98
-50 41 6 94
-50 45 4 96
1:How to modify the code? I mean instead of value_counts, is there any other function that can give the expected output?
2. How to get the expected output with the current method?
EDIT 1(exceptional case)
data:
angle
distance
velocity
0
124
-3
50
24
-25
50
34
25
expected output:
count is calculated based on distance
angle
distance
velocity
count
percent missed
0
124
-3
1
99
50
24
-25
1
99
50
34
25
1
99
First add Series.reset_index, because DataFrame.value_counts return Series, so possible use parameter name for change column 0 to count column and then subtract 100 to new column by Series.rsub for subtract from right side like 100 - df['count']:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
Or if need also set new columns names use DataFrame.set_axis:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.set_axis(['angle','Distance','count'], axis=1)
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
If need assign new columns names here is alternative solution:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index())
df.columns = ['angle','Distance','count']
df['percent missed'] = df['count'].rsub(100)
Assuming a DataFrame as input (if not reset_index first), simply use rename and a subtraction:
df = df.rename(columns={'0': 'count'}) # assuming string '0' here, else use 0
df['percent missed'] = 100 - df['count']
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
alternative: using groupby.size:
(new_df
.groupby(['current_angle','current_dist']).size()
.reset_index(name='count')
.assign(**{'percent missed': lambda d: 100-d['count']})
)
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96

How to edit columns values based on another value?

My data fame looks like this:
val type
10 new
70 new
61 new
45 old
32 new
77 mid
11 mid
For values in column "type", if value is new, I want to edit it depending on val column. If value in val is =< 20, it must be new-1, if > 20 and < 50, it must be new-2,if >= 50, it must be new-3. So desired result is:
val type
10 new-1
70 new-3
61 new-3
45 old
32 new-2
77 mid
11 mid
How to do that?
I'd use pd.cut:
>>> df['type'] = df['type'].where(df['type'].ne('new'), df['type'] + '-' + pd.cut(df['val'], [0, 20, 50, float('inf')], labels=[1, 2, 3]).astype(str))
>>> df
val type
0 10 new-1
1 70 new-3
2 61 new-3
3 45 old
4 32 new-2
5 77 mid
6 11 mid
>>>

Python - Group by time periods with tolerance

Three columns exist in this data set: ID (unique employee identification), WorkComplete (indicating when all work has been completed), and DateDiff (number of days from their start date). I am looking to group the DaysDiff column based on certain time periods with an added layer of tolerance or leniency. For my mock data, I am spacing the time periods by 30 days.
Group 0: 0-30 DateDiff (with a 30 day extra window if 'Y' is not found)
Group 1: 31-60 DateDiff (with a 30 day extra window if 'Y' is not found)
Group 2: 61-90 DateDiff (with a 30 day extra window if 'Y' is not found)
I was able to create very basic code and assign the groupings, but I am having trouble with the extra 30 day window. For example, if an employee completed their work (Y) during the time periods above, then they receive the attributed grouping. For ID 111 below, you can see that the person did not complete their work within the first 30 days, so I am giving them an addition 30 days to complete their work. If they complete their work, then the first instance we see a 'Y', it is grouped in the previous grouping.
df = pd.DataFrame({'ID':[111, 111, 111, 111, 111, 111, 112, 112, 112],
'WorkComplete':['N', 'N', 'Y', 'N', 'N', 'N', 'N', 'Y', 'Y'],
'DaysDiff': [0, 29, 45, 46, 47, 88, 1, 12, 89]})
Input
ID WorkComplete DaysDiff
111 N 0
111 N 29
111 Y 45
111 N 46
111 N 47
111 N 88
123 N 1
123 Y 12
123 Y 89
Output
ID WorkComplete DaysDiff Group
111 N 0 0
111 N 29 0
111 Y 45 0 <---- note here the grouping is 0 to provide extra time
111 N 46 1 <---- back to normal
111 N 47 1
111 N 88 2
123 N 1 0
123 Y 12 0
123 Y 89 2
minQ1 = 0
highQ1 = 30
minQ2 = 31
highQ2 = 60
minQ2 = 61
highQ2 = 90
def Group_df(df):
if (minQ1 <= df['DateDiff'] <= highQ1): return '0'
elif (minQ1 <= df['DateDiff'] <= highQ1): return '1'
elif (minQ2 <= df['DateDiff'] <= highQ2): return '2'
df['Group'] = df.apply(Group_df, axis = 1)
The trouble I am having is allowing for the additional 30 days if the person did not complete the work. My above attempt is partial at trying to resolve the issue.
You can use np.select for the primary conditions.
Then, use mask for the specific condition you mention. s is the first index location for all Y values per group. I then temporarily assign s as a new column, so that I can check for rows against df.index (the index) to return rows that meet the condition. The second condition is if the group number is 1 from the previos line of code:
df['Group'] = np.select([df['DaysDiff'].between(0,30),
df['DaysDiff'].between(31,60),
df['DaysDiff'].between(61,90)],
[0,1,2])
s = df[df['WorkComplete'] == 'Y'].groupby('ID')['DaysDiff'].transform('idxmin')
df['Group'] = df['Group'].mask((df.assign(s=s)['s'].eq(df.index)) & (df['Group'].eq(1)), 0)
df
Out[1]:
ID WorkComplete DaysDiff Group
0 111 N 0 0
1 111 N 29 0
2 111 Y 45 0
3 111 N 46 1
4 111 N 47 1
5 111 N 88 2
6 123 N 1 0
7 123 Y 12 0
8 123 Y 89 2

getting the previous and next value from a dataframe and add a new column

I am new to python and pandas. Here I have a dataframe which is like ,
Id Offset feature
0 0 2
0 5 2
0 11 0
0 21 22
0 28 22
1 32 0
1 38 21
1 42 21
1 52 21
1 55 0
1 58 0
1 62 1
1 66 1
1 70 1
2 73 0
2 78 1
2 79 1
Now In this df, I have a feature column. I am trying to do some operations on this column. This column has some values. It also has a 0 value in it. Now, I want to replace this value on the basis of previous and next three values of this 0.
If we see,first 0 which has previous values are [2,2] as it is a first so it will not get the third one and the next three are [22,22,0] .
Now I am trying to get the following dataframe
Expected output
Offset feature previous Next NewFeature
0 2 - - 2
5 2 - - 2
11 0 [2,2] [22,22,0] 0
21 22 - - 22
28 22 - - 22
32 0 [22,22,0] [21,21,21] 0
38 21 - - 21
42 21 - - 21
52 21 - - 21
55 0 [21,21,21] [0,1,1] 0
58 0 [0,21,21] [1,1,1] 0
62 1 - - 1
66 1 - - 1
70 1 - - 1
73 0 [1,1,1] [1,1] 1
78 1 - - 1
79 1 - - 1
So, In this I am trying to checking that if previous and next are same or not.
Is there any way I can get this dataframe ? How do I get the previous and next values in this dataframe ? any help will be great .
Thanks
So, The logic for getting the newFeature is that .
Here I have the features list which is ,
1, 2, 16,15,26,25
if the previous and next array has values like, (1,16,15) then it is the same as 1. and if it is from (2,26,25) then we can replace it with the 2.
Here if the
previous values are [1,16,2] and next are [1,26,1] then in this as I said earlier (1,16,15) are 1 only .. so the number of 1 are more than 2 so, the 0 will get replaced by 1. and 26 will become 2
like it will become [1,1,2] and [1,2,1]
So, this way. even the given data also we can use .
This should get you in the right direction:
import pandas as pd
# create a dummy df
df = pd.DataFrame()
df['feature'] = range(100)
df = df.sample(frac=1)
# create shifted columns
df['shift1'] = df['feature'].shift()
df['shift2'] = df['feature'].shift(2)
# concat the previous values
values = df.loc[:, ['shift1', 'shift2']].values
df['prev'] = values.tolist()
# you just want the zeros, right?
df.query('feature == 0')
You can use list comprehensions:
x = df['feature'].tolist()
y = x[::-1]
df['previous'] = [y[-i:][:3] for i in range(1, len(x)+1)]
df['Next'] = [x[i: i + 3] for i in range(1, len(x) + 1)]
df['previous'] = df['previous'].shift(1).where(df['feature'] == 0, '-')
df['Next'] = df['Next'].where(df['feature'] == 0, '-')
print (df)
Offset feature previous Next
0 0 2 - -
1 5 2 - -
2 11 0 [2, 2] [22, 22, 0]
3 21 22 - -
4 28 22 - -
5 32 0 [22, 22, 0] [21, 21, 21]
6 38 21 - -
7 42 21 - -
8 52 21 - -
9 55 0 [21, 21, 21] [0, 1, 1]
10 58 0 [0, 21, 21] [1, 1, 1]
11 62 1 - -
12 66 1 - -
13 70 1 - -
14 73 0 [1, 1, 1] [1, 1]
15 78 1 - -
16 79 1 - -

Categories

Resources