How to edit columns values based on another value? - python

My data fame looks like this:
val type
10 new
70 new
61 new
45 old
32 new
77 mid
11 mid
For values in column "type", if value is new, I want to edit it depending on val column. If value in val is =< 20, it must be new-1, if > 20 and < 50, it must be new-2,if >= 50, it must be new-3. So desired result is:
val type
10 new-1
70 new-3
61 new-3
45 old
32 new-2
77 mid
11 mid
How to do that?

I'd use pd.cut:
>>> df['type'] = df['type'].where(df['type'].ne('new'), df['type'] + '-' + pd.cut(df['val'], [0, 20, 50, float('inf')], labels=[1, 2, 3]).astype(str))
>>> df
val type
0 10 new-1
1 70 new-3
2 61 new-3
3 45 old
4 32 new-2
5 77 mid
6 11 mid
>>>

Related

1 to 2 matching in two dataframes with different sizes in Python/R

please help me with this problem I've been struggling all day lol, solution in either Python or R is fine! Please help I'm really stuck!!!
I have two dataframes - df1 has 44 rows, df2 has 100 rows, they both have these columns:
ID, status (0,1), Age, Gender, Race, Ethnicity, Height, Weight
for each row in df1, I need to find an age match in df2:
it can be exact age match, but the criteria should be used is - df2[age]-5 <= df1[age]<= df2[age]+5
I need a list/dictionary to store which are the age matches for df1, and their IDs
Then I need to randomly select 2 IDs from df2 as the final match for df1 age
I also need to make sure the 2 df2 matches shares the same gender and race as df1
I have tried R and Python, and both stuck on the nested loops part.
I'm not sure how to loop through each record both df1 and df2, compare df1 age with df2 age-5 and df2 age+5, and store the matches
Here are the sample data format for df1 and df2:
| ID | sex | age | race |
| -------- | -------------- |--------|-------|
| 284336 | female | 42.8 | 2 |
| 294123 | male | 48.5 | 1 |
Here is what I've attempted in R:
id_match <- NULL
for (i in 1:nrow(gwi_case)){
age <- gwi_case$age[i]
gender <- gwi_case$gender[i]
ethnicity <- gwi_case$hispanic_non[i]
race <- gwi_case$race[i]
x <- which(gwi_control$gender==gender & gwi_control$age>=age-5 & gwi_control$age<=age+5 & gwi_control$hispanic_non==ethnicity & gwi_control$race==race)
y <- sample(x, min(2, length(x)))
id_match <- c(id_match, y)
}
id_match <- id_match[!duplicated(id_match)]
length(id_match)
The question asks this:
for each row in df1, find an age match in df2 such that df2[age] - 5 <= df1[age] <= df2[age] + 5
create a list/dictionary to hold age matches and IDs for df1
randomly select 2 IDs from df2 as the final match for df1 age
Here is some Python code that:
uses the criteria to populate list of lists ageMatches with a list of unique df2 ages matching each unique df1 age
calls DataFrame.query() on df2 for each age in df1 to populate idMatches with a list of df2 IDs with age matching each unique df1 age
populates age1ToID2 with unique df1 age keys and with values that are lists of 2 (or fewer if available number < 2) randomly selected df2 IDs of matching age
adds a column to df1 containing the pair of selected df2 IDs corresponding to each row's age (i.e., the values in age1ToID2)
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID':list(range(101,145)), 'Age':[v % 11 + 21 for v in range(44)], 'Height':[67]*44})
df2 = pd.DataFrame({'ID':list(range(1,101)), 'Age':[v % 10 + 14 for v in range(50)] + [v % 20 + 25 for v in range(0,100,2)], 'Height':[67]*100})
ages1 = np.sort(df1['Age'].unique())
ages2 = np.sort(df2['Age'].unique())
ageMatches = [[] for _ in ages1]
j1, j2 = 0, 0
for i, age1 in enumerate(ages1):
while j1 < len(ages2) and ages2[j1] < age1 - 5:
j1 += 1
if j2 <= j1:
j2 = j1 + 1
while j2 < len(ages2) and ages2[j2] <= age1 + 5:
j2 += 1
ageMatches[i] += list(ages2[j1:j2])
idMatches = [df2.query('Age in #m')['ID'].to_list() for i, m in enumerate(ageMatches)]
# select random pair of df2 IDs for each unique df1 age and put them into a new df1 column
from random import sample
age1ToID2 = {ages1[i]:m if len(m) < 2 else sample(m, 2) for i, m in enumerate(idMatches)}
df1['df2_matches'] = df1['Age'].apply(lambda x: age1ToID2[x])
print(df1)
Output:
ID Age Height df2_matches
0 101 21 67 [24, 30]
1 102 22 67 [50, 72]
2 103 23 67 [10, 37]
3 104 24 67 [63, 83]
4 105 25 67 [83, 49]
5 106 26 67 [20, 52]
6 107 27 67 [49, 84]
7 108 28 67 [54, 55]
8 109 29 67 [91, 55]
9 110 30 67 [65, 51]
10 111 31 67 [75, 72]
11 112 21 67 [24, 30]
...
42 143 30 67 [65, 51]
43 144 31 67 [75, 72]
This hopefully provides the result and intermediate collections that OP is asking for, or something close enough to get to the desired result.
Alternatively, to have the random selection be different for each row in df1, we can do this:
# select random pair of df2 IDs for each df1 row and put them into a new df1 column
from random import sample
age1ToID2 = {ages1[i]:m for i, m in enumerate(idMatches)}
def foo(x):
m = age1ToID2[x]
return m if len(m) < 2 else sample(m, 2)
df1['df2_matches'] = df1['Age'].apply(foo)
print(df1)
Output:
ID Age Height df2_matches
0 101 21 67 [71, 38]
1 102 22 67 [71, 5]
2 103 23 67 [9, 38]
3 104 24 67 [49, 61]
4 105 25 67 [27, 93]
5 106 26 67 [40, 20]
6 107 27 67 [9, 19]
7 108 28 67 [53, 72]
8 109 29 67 [82, 53]
9 110 30 67 [74, 62]
10 111 31 67 [52, 62]
11 112 21 67 [71, 39]
...
42 143 30 67 [96, 66]
43 144 31 67 [63, 83]
not sure I fully understand the requirement but... in python you can use apply to the dataframe and a lambda function to perform some funky things
df1['age_matched_ids'] = df1.apply(lambda x: list(df2.loc[df2['Age'] >= x['Age'] - 5 & df2['Age'] <= x['Age'] + 5, 'ID']), axis=1)
this will store in column 'age_matched_ids' the list of IDs from df2 that fall in between Age +/- 5. You can do #2 and #3 from here.

How to name the column when using value_count function in pandas?

I was counting the no of occurrence of angle and dist by the code below:
g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
the output:
current_angle current_dist 0
-50 30 1
-50 40 2
-50 41 6
-50 45 4
try1:
g.columns = ['angle','Distance','count','Percentage Missed'] - result was no change in the name of column
try2:
When I print the columns using print(g.columns) ended with error AttributeError: 'Series' object has no attribute 'columns'
I want to rename the column 0 as count and add a new column to the dataframe g as percent missed which is calculated by 100 - value in column 0
Expected output
current_angle current_dist count percent missed
-50 30 1 99
-50 40 2 98
-50 41 6 94
-50 45 4 96
1:How to modify the code? I mean instead of value_counts, is there any other function that can give the expected output?
2. How to get the expected output with the current method?
EDIT 1(exceptional case)
data:
angle
distance
velocity
0
124
-3
50
24
-25
50
34
25
expected output:
count is calculated based on distance
angle
distance
velocity
count
percent missed
0
124
-3
1
99
50
24
-25
1
99
50
34
25
1
99
First add Series.reset_index, because DataFrame.value_counts return Series, so possible use parameter name for change column 0 to count column and then subtract 100 to new column by Series.rsub for subtract from right side like 100 - df['count']:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
Or if need also set new columns names use DataFrame.set_axis:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.set_axis(['angle','Distance','count'], axis=1)
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
If need assign new columns names here is alternative solution:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index())
df.columns = ['angle','Distance','count']
df['percent missed'] = df['count'].rsub(100)
Assuming a DataFrame as input (if not reset_index first), simply use rename and a subtraction:
df = df.rename(columns={'0': 'count'}) # assuming string '0' here, else use 0
df['percent missed'] = 100 - df['count']
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
alternative: using groupby.size:
(new_df
.groupby(['current_angle','current_dist']).size()
.reset_index(name='count')
.assign(**{'percent missed': lambda d: 100-d['count']})
)
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96

determine the range of a value using a look up table

I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?
You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
firstly,explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly,judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear

Pandas first 5 and last 5 rows in single iloc operation

I need to check df.head() and df.tail() many times.
When using df.head(), df.tail() jupyter notebook dispalys the ugly output.
Is there any single line command so that we can select only first 5 and last 5 rows:
something like:
df.iloc[:5 | -5:] ?
Test example:
df = pd.DataFrame(np.random.rand(20,2))
df.iloc[:5]
Update
Ugly but working ways:
df.iloc[(np.where( (df.index < 5) | (df.index > len(df)-5)))[0]]
or,
df.iloc[np.r_[np.arange(5), np.arange(df.shape[0]-5, df.shape[0])]]
Try look at numpy.r_
df.iloc[np.r_[0:5, -5:0]]
Out[358]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
Also head + tail is not a bad solution
df.head(5).append(df.tail(5))
Out[362]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
df.query("index<5 | index>"+str(len(df)-5))
Here's a way to query the index. You can change the values to whatever you want.
Another approach (per this SO post)
uses only Pandas .isin()
Generate some dummy/demo data
df = pd.DataFrame({'a':range(10,100)})
print(df.head())
a
0 10
1 11
2 12
3 13
4 14
print(df.tail())
a
85 95
86 96
87 97
88 98
89 99
print(df.shape)
(90, 1)
Generate list of required indexes
ls = list(range(5)) + list(range(len(df)-5, len(df)))
print(ls)
[0, 1, 2, 3, 4, 85, 86, 87, 88, 89]
Slice DataFrame using list of indexes
df_first_last_5 = df[df.index.isin(ls)]
print(df_first_last_5)
a
0 10
1 11
2 12
3 13
4 14
85 95
86 96
87 97
88 98
89 99

What is the equivalent of proc format in SAS to python

I want
proc format;
value RNG
low - 24 = '1'
24< - 35 = '2'
35< - 44 = '3'
44< - high ='4'
I need this in python pandas.
If you are looking for equivalent of the mapping function, you can use something like this.
df = pd.DataFrame(np.random.randint(100,size=5), columns=['score'])
print(df)
output:
score
0 73
1 90
2 83
3 40
4 76
Now lets apply the binning function for score column in dataframe and create new column in the same dataframe.
def format_fn(x):
if x < 24:
return '1'
elif x <35:
return '2'
elif x< 44:
return '3'
else:
return '4'
df['binned_score']=df['score'].apply(format_fn)
print(df)
output:
score binned_score
0 73 4
1 90 4
2 83 4
3 40 3
4 76 4

Categories

Resources