How to shorten my code with lambda statement in python? - python

I have trouble with shortening my code with lambda if possible. bp is my data name.
My data looks like this:
user label
1 b
2 b
3 c
I expect to have
user label Y
1 b 1
2 b 1
3 c 0
Here is my code:
counts = bp['Label'].value_counts()
def score_to_numeric(x):
if counts['b'] > counts['s']:
if x == 'b':
return 1
else:
return 0
else:
if x =='b':
return 0
else:
return 1
bp['Y'] = bp['Label'].apply(score_to_numeric) # apply above function to convert data
It is a function converting a categorical data 'b' or 's' in column named 'Label' into numeric data: 0 or 1. The line counts = bp['Label'].value_counts() counts the number of 'b' or 's' in column 'Label'. Then, in score_to_numeric, if the count of 'b' is more than 's', then give value 1 to b in a new column called 'Y', and vice versa.
I would like to shorten my code into 3-4 lines at most. I think perhaps using a lambda statement will do this, but I'm not familiar enough with lambdas.

Since True and False evaluate to 1 and 0, respectively, you can simply return the Boolean expression, converted to integer.
def score_to_numeric(x):
return int((counts['b'] > counts['s']) == \
(x == 'b'))
It returns 1 iff both expressions have the same Boolean value.

I don't think you need to use the apply method. Something simple like this should work:
value_counts = bp.Label.value_counts()
bp.Label[bp.Label == 'b'] = 1 if value_counts['b'] > value_counts['s'] else 0
bp.Label[bp.Label == 's'] = 1 if value_counts['s'] > value_counts['b'] else 0

You could do the following
counts = bp['Label'].value_counts()
t = 1 if counts['b'] > counts['s'] else 0
bp['Y'] = bp['Label'].apply(lambda x: t if x == 'b' else 1 - t)

Related

Creating a function to standardize categorical variables (python)

I don't know if it is right to say "standardize" categorical variable string, but basically I want to create a function to set all observations F or f in the column below to 0 and M or m to 1:
> df['gender']
gender
f
F
f
M
M
m
I tried this:
def padroniza_genero(x):
if(x == 'f' or x == 'F'):
replace(['f', 'F'], 0)
else:
replace(1)
df1['gender'] = df1['gender'].apply(padroniza_genero)
But I got an error:
NameError: name 'replace' is not defined
Any ideas? Thanks!
There is no replace function defined in your code.
Back to your goal, use a vector function.
Convert to lower and map f->0, m->1:
df['gender_num'] = df['gender'].str.lower().map({'f': 0, 'm': 1})
Or use a comparison (not equal to f) and conversion from boolean to integer:
df['gender_num'] = df['gender'].str.lower().ne('f').astype(int)
output:
gender gender_num
0 f 0
1 F 0
2 f 0
3 M 1
4 M 1
5 m 1
generalization
you can generalize to ant number of categories using pandas.factorize. Advantage: you will get a real Categorical type.
NB. the number values is set depending on whatever values comes first, or lexicographic order if sort=True:
s, key = pd.factorize(df['gender'].str.lower(), sort=True)
df['gender_num'] = s
key = dict(enumerate(key))
# {0: 'f', 1: 'm'}

CASE statement in Python based on Regex

So I have a data frame like this:
FileName
01011RT0TU7
11041NT4TU8
51391RST0U2
01011645RT0TU9
11311455TX0TU8
51041545ST3TU9
What I want is another column in the DataFrame like this:
FileName |RdwyId
01011RT0TU7 |01011000
11041NT4TU8 |11041000
51391RST0U2 |51391000
01011645RT0TU9|01011645
11311455TX0TU8|11311455
51041545ST3TU9|51041545
Essentially, if the first 5 characters are digits then concat with "000", if the first 8 characters are digits then simply move them to the RdwyId column
I am noob so I have been playing with this:
Test 1:
rdwyre1=re.compile(r'\d\d\d\d\d')
rdwyre2=re.compile(r'\d\d\d\d\d\d\d\d')
rdwy1=rdwyre1.findall(str(thous["FileName"]))
rdwy2=rdwyre2.findall(str(thous["FileName"]))
thous["RdwyId"]=re.sub(r'\d\d\d\d\d', str(thous["FileName"].loc[:4])+"000",thous["FileName"])
Test 2:
thous["RdwyId"]=np.select(
[
re.search(r'\d\d\d\d\d',thous["FileName"])!="None",
rdwyre2.findall(str(thous["FileName"]))!="None"
],
[
rdwyre1.findall(str(thous["FileName"]))+"000",
rdwyre2.findall(str(thous["FileName"])),
],
default="Unknown"
)
Test 3:
thous=thous.assign(RdwyID=lambda x: str(rdwyre1.search(x).group())+"000" if bool(rdwyre1.search(x))==True else str(rdwyre2.search(x).group()))
None of the above have worked. Could anyone help me figure out where I am going wrong? and how to fix it?
You can use numpy select, which replicates CASE WHEN for multiple conditions, and Pandas' str.isnumeric method:
cond1 = df.FileName.str[:8].str.isnumeric() # first condition
choice1 = df.FileName.str[:8] # result if first condition is met
cond2 = df.FileName.str[:5].str.isnumeric() # second condition
choice2 = df.FileName.str[:5] + "000" # result if second condition is met
condlist = [cond1, cond2]
choicelist = [choice1, choice2]
df.loc[:, "RdwyId"] = np.select(condlist, choicelist)
df
FileName RdwyId
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545
def filt(list1):
for i in list1:
if i[:8].isdigit():
print(i[:8])
else:
print(i[:5]+"000")
# output
01011000
11041000
51391000
01011645
11311455
51041545
I mean, if your case is very specific, you can tweak it and apply it to your dataframe.
To a dataframe.
def filt(i):
if i[:8].isdigit():
return i[:8]
else:
return i[:5]+"000"
d = pd.DataFrame({"names": list_1})
d["filtered"] = d.names.apply(lambda x: filt(x)) #.apply(filt) also works im used to lambdas
#output
names filtered
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545
Using regex:
c1 = re.compile(r'\d{5}')
c2 = re.compile(r'\d{8}')
rdwyId = []
for f in thous['FileName']:
m = re.match(c2, f)
if m:
rdwyId.append(m[0])
continue
m = re.match(c1, f)
if m:
rdwyId.append(m[0] + "000")
thous['RdwyId'] = rdwyId
Edit: replaced re.search with re.match as it's more efficient, since we are only looking for matches at the beginning of the string.
Let us try findall with ljust
df['new'] = df.FileName.str.findall(r"(\d+)[A-z]").str[0].str.ljust(8,'0')
Out[226]:
0 01011000
1 11041000
2 51391000
3 01011645
4 11311455
5 51041545
Name: FileName, dtype: object

Retrieve certain value located in dataframe in any row or column and keep it in separate column without forloop

I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column
I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5

Loop and arrays of strings in python

I have the following data set:
column1
HL111
PG3939HL11
HL339PG
RC--HL--PG
I am attempting to write a function that does the following:
Loop through each row of column1
Pull only the alphabet and put into an array
If the array has "HL" in it, remove it from the array UNLESS HL is the only word in the array.
Take the first word in the array and output results.
So for the above example, my array (step2) would look like this:
[HL]
[PG,HL]
[HL,PG]
[RC,HL,PG]
and my desired final output (step4) would look like this:
desired_column
HL
PG
PG
RC
I have the code for step 2, and it seems to work fine
df['array_column'] = (df.column1.str.extractall('([A-Z]+)')
.unstack()
.values.tolist())
But I don't know how to get from here to my final output (step4).
You may achieve what you need by replacing all non-letters first, then extracting pairs of letters and then applying some custom logic to extract the necessary value from the array:
>>> df['array_column'].str.replace('[^A-Z]+', '').str.findall('([A-Z]{2})').apply(lambda d: [''] if len(d) == 0 else d).apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0])
0 HL
1 PG
2 PG
3 RC
Name: array_column, dtype: object
>>>
Details
.replace('[^A-Z]+', '') - remove all chars other the uppercase letters
.str.findall('([A-Z]{2})') - extract pairs of letters
.apply(lambda d: [''] if len(d) == 0 else d) will add an empty item if there is no regex match in the previous step
.apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0]) - custom logic: if the list length is 1 and it is equal to HL, keep it, else remove all HL and get the first element
This is one approach using apply
Demo:
import re
import pandas as pd
def checkValue(value):
value = re.findall(r"[A-Z]{2}", value)
if (len(value) > 1) and ("HL" in value):
return [i for i in value if i != "HL"][0]
else:
return value[0]
df = pd.DataFrame({"column1": ["HL111", "PG3939HL11", "HL339PG", "RC--HL--PG"]})
print(df.column1.apply(checkValue))
Output:
0 HL
1 PG
2 PG
3 RC
Name: column1, dtype: object
You can do something like this (or probably something more elegant), what you had already gets you to a fairly nice structure where you can use groupby to complete your solution
def extract_relevant_str(grp):
ret_val = None
if "HL" in grp[0].tolist() and len(grp) == 1:
ret_val = "HL"
elif len(grp) >= 1:
ret_val = grp.loc[grp[0] != "HL", 0].iloc[0]
return ret_val
items = df.column1.str.extractall('([A-Z]+)')
items.reset_index().groupby("level_0").apply(extract_relevant_str)
Output:
level_0
0 HL
1 PG
2 PG
3 RC
dtype: object

Compound inequality in if statement

This is a generalized function I want to use to check if each row of a dataframe follows a specific trend in column values.
def follows_trend(row):
trend = None
if row[("col_5" < "col_6" < "col_4" < "col_1" < "col_2" < "col_3")]:
trend = True
else:
trend = False
return trend
I'll apply it like this
df_trend = df.apply(follows_trend, axis=1)
When I do, it returns all True when there are clearly some rows that should return False. I'm not sure if there is something wrong with the inequality I used or the function itself.
The compound comparisons don't "expand out of" the dict lookup. "col_5" < "col_6" < "col_4" < "col_1" < "col_2" < "col_3" will be evaluated first, producing False because the strings aren't sorted - so your if statement is actually if row[(False)]:. You need to do this:
if row["col_5"] < row["col_6"] < row["col_4"] < row["col_1"] < row["col_2"] < row["col_3"]:
If you have a lot of these expressions, you should probably extract this to a method that takes row and a list of the column names, and uses a loop for the comparisons. If you only have one, but want a somewhat more nice-looking version, try this:
a, b, c, d, e, f = (row[c] for c in ("col_5", "col_6", "col_4", "col_1", "col_2", "col_3"))
if a < b < c < d < e < f:
Also you can reorder the column names, use the diff function to check the difference along the rows and compare the result with 0:
(df[["col_5", "col_6", "col_4", "col_1", "col_2", "col_3"]]
.diff(axis=1).drop('col_5', 1).gt(0).all(1))
Example:
import pandas as pd
df = pd.DataFrame({"A": [1,2], "B": [3,1], "C": [4,2]})
df
# A B C
#0 1 3 4
#1 2 1 2
df.diff(axis=1).drop('A', 1).gt(0).all(1)
#0 True
#1 False
#dtype: bool
you could use query for this. See example below
df = pd.DataFrame(np.random.randn(5, 3), columns=['col1','col2','col3'])
print df
print df.query('col2>col3>col1') # query can accept a string with multiple comparisons.
results in
col1 col2 col3
0 -0.788909 1.591521 1.709402
1 -1.563310 1.188993 2.295683
2 -1.572323 -0.600015 -1.518411
3 1.786051 0.303291 -0.344720
4 0.756029 -0.393941 1.059874
col1 col2 col3
2 -1.572323 -0.600015 -1.518411

Categories

Resources