CASE statement in Python based on Regex - python

So I have a data frame like this:
FileName
01011RT0TU7
11041NT4TU8
51391RST0U2
01011645RT0TU9
11311455TX0TU8
51041545ST3TU9
What I want is another column in the DataFrame like this:
FileName |RdwyId
01011RT0TU7 |01011000
11041NT4TU8 |11041000
51391RST0U2 |51391000
01011645RT0TU9|01011645
11311455TX0TU8|11311455
51041545ST3TU9|51041545
Essentially, if the first 5 characters are digits then concat with "000", if the first 8 characters are digits then simply move them to the RdwyId column
I am noob so I have been playing with this:
Test 1:
rdwyre1=re.compile(r'\d\d\d\d\d')
rdwyre2=re.compile(r'\d\d\d\d\d\d\d\d')
rdwy1=rdwyre1.findall(str(thous["FileName"]))
rdwy2=rdwyre2.findall(str(thous["FileName"]))
thous["RdwyId"]=re.sub(r'\d\d\d\d\d', str(thous["FileName"].loc[:4])+"000",thous["FileName"])
Test 2:
thous["RdwyId"]=np.select(
[
re.search(r'\d\d\d\d\d',thous["FileName"])!="None",
rdwyre2.findall(str(thous["FileName"]))!="None"
],
[
rdwyre1.findall(str(thous["FileName"]))+"000",
rdwyre2.findall(str(thous["FileName"])),
],
default="Unknown"
)
Test 3:
thous=thous.assign(RdwyID=lambda x: str(rdwyre1.search(x).group())+"000" if bool(rdwyre1.search(x))==True else str(rdwyre2.search(x).group()))
None of the above have worked. Could anyone help me figure out where I am going wrong? and how to fix it?

You can use numpy select, which replicates CASE WHEN for multiple conditions, and Pandas' str.isnumeric method:
cond1 = df.FileName.str[:8].str.isnumeric() # first condition
choice1 = df.FileName.str[:8] # result if first condition is met
cond2 = df.FileName.str[:5].str.isnumeric() # second condition
choice2 = df.FileName.str[:5] + "000" # result if second condition is met
condlist = [cond1, cond2]
choicelist = [choice1, choice2]
df.loc[:, "RdwyId"] = np.select(condlist, choicelist)
df
FileName RdwyId
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545

def filt(list1):
for i in list1:
if i[:8].isdigit():
print(i[:8])
else:
print(i[:5]+"000")
# output
01011000
11041000
51391000
01011645
11311455
51041545
I mean, if your case is very specific, you can tweak it and apply it to your dataframe.
To a dataframe.
def filt(i):
if i[:8].isdigit():
return i[:8]
else:
return i[:5]+"000"
d = pd.DataFrame({"names": list_1})
d["filtered"] = d.names.apply(lambda x: filt(x)) #.apply(filt) also works im used to lambdas
#output
names filtered
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545

Using regex:
c1 = re.compile(r'\d{5}')
c2 = re.compile(r'\d{8}')
rdwyId = []
for f in thous['FileName']:
m = re.match(c2, f)
if m:
rdwyId.append(m[0])
continue
m = re.match(c1, f)
if m:
rdwyId.append(m[0] + "000")
thous['RdwyId'] = rdwyId
Edit: replaced re.search with re.match as it's more efficient, since we are only looking for matches at the beginning of the string.

Let us try findall with ljust
df['new'] = df.FileName.str.findall(r"(\d+)[A-z]").str[0].str.ljust(8,'0')
Out[226]:
0 01011000
1 11041000
2 51391000
3 01011645
4 11311455
5 51041545
Name: FileName, dtype: object

Related

Keep only some elements in a column of string

I have a df like:
df = pd.DataFrame({'Temp' : ['ko1234', 'ko1234|ko445|ko568', 'map123', 'ko895', 'map123|ko889|ko665', 'ko635|map789|map777', 'ko985']})
(out) >>>
ko1234
ko1234|ko445|ko568
map123
ko895
map123|ko889|ko665
ko635|map789|map777
ko985
I need two things:
I want to keep only the words starting with ko but keep the remaining spaces, so:
ko1234
ko1234|ko445|ko568
ko895
ko889|ko665
ko635
ko985
In another case he would like to do this:
if there is only one word keep it
if there are more words divided by a "|" keep only the second one, so:
ko1234
ko445
map123
ko895
ko889
map789
ko985
What is the best way to do this?
Here is how to do it using .apply (or .transform - the result will be the same).
The functions are applied to each element of the Series lists - which cointains a list of words (that were separated by "|" in the column Temp):
lists = df['Temp'].str.split('|')
def starting_with_ko(lst):
ko = [word for word in lst if word.startswith('ko')]
return '|'.join(ko) if ko else ''
def choose_element(lst):
if len(lst) == 1:
return lst[0]
else:
return lst[1]
out1 = lists.apply(starting_with_ko)
out2 = lists.apply(choose_element)
Results:
>>> out1
0 ko1234
1 ko1234|ko445|ko568
2
3 ko895
4 ko889|ko665
5 ko635
6 ko985
dtype: object
>>> out2
0 ko1234
1 ko445
2 map123
3 ko895
4 ko889
5 map789
6 ko985
dtype: object
We can do split then explode and remove the unwanted items with startswith
out = s.str.split('|').explode().str.strip()
out1 = out[out.str.startswith('ko')].groupby(level=0).agg('|'.join).reindex(s.index)
out2 = s.str.split('|').str[1].fillna(s)

Replace a pandas column by splitting the text based on "_"

I have a pandas dataframe as below
import pandas as pd
df = pd.DataFrame({'col':['abcfg_grp_202005', 'abcmn_abc_202009', 'abcgd_xyz_8976', 'abcgd_lmn_1']})
df
col
0 abcfg_grp_202005
1 abcmn_abc_202009
2 abcgd_xyz_8976
3 abcgd_lmn_1
I want to replace 'col' as fist instance before _ in "col". IF there is a single digit in the 3rd instance after _ then append that to end of "col" as below
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1
You can use df.apply:
In [1441]: df['col'] = df.col.str.split('_', expand=True).apply(lambda x: (x[0] + '_' + x[2]) if len(x[2]) == 1 else x[0], axis=1)
In [1442]: df
Out[1442]:
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1
Split on the underscores, then add the strings. Here we can use the trick that False multiplied by a string returns the empty string to deal with the conditional addition. The check is a 1 character string that is a digit.
df1 = df['col'].str.split('_', expand=True)
df['col'] = df1[0] + ('_' + df1[2])*(df1[2].str.len().eq(1) & df1[2].str.isdigit())
print(df)
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1
You can apply a custom function.
import pandas as pd
df = pd.DataFrame({'col':['abcfg_grp_202005', 'abcmn_abc_202009', 'abcgd_xyz_8976', 'abcgd_lmn_1']})
def func(x):
ar = x.split('_')
if len(ar[2]) == 1 and ar[2].isdigit():
return ar[0]+"_"+ar[2]
else:
return ar[0]
df['col'] = df['col'].apply(lambda x: func(x))
df
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1
Here's another way to do it:
df['col'] = np.where(df['col'].str.contains(r'[a-zA-Z0-9]+_[a-zA-Z0-9]+_[0-9]\b', regex=True),
df['col'].str.split('_').str[0] + '_' + df['col'].str.split('_').str[2],
df['col'].str.split('_').str[0])
print(df)
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1
Perhaps not the most elegant answer, but I would recommend using str.replace twice here:
df["col"]= df["Team"]
.str.replace("^([^_]+)_.*?(?!_\d$).{2}$", "\\1")
.str.replace("_[^_]+(?=_)", "")
The first regex targets inputs of the form abcfg_grp_202005 which do not end in underscore followed by a digit. In this case, we would be left with abcfg. The second regex removes the middle underscore term, should it still exist, which would only be true for inputs like abcgd_lmn_1 ending in underscore followed by a digit.
Try this
for i in range(len(df)):
x = df.loc[i,"col"].split('_')
if(len(x[2])==1):
df.loc[i,"col"] = x[0]+"_"+x[2]
else:
df.loc[i,"col"] = x[0]
Split the data then check if the 2nd index value's lenght. If it is 1, make the data in column splited[0] + splited[2], if not it is just splited[0]
I wrote function. Then i used .apply() built-in method to apply my function to each value.
def editcols(col_value):
splitted_col_value = col_value.split('_')
if len(splitted_col_value[2])==1:
return f'{splitted_col_value[0]}_{splitted_col_value[2]}'
else:
return splitted_col_value[0]
df['col'] = df['col'].apply(editcols)
I hope it is clear. Please let me know if it worked

Multi String search in Data frame in Multiple column with AND or OR Option

I can do a single word search in each column but unable to search user-provided number of string search with "and" "or" option
0 1 3 4
0 [OH-] [Na+] NAN CCO
1 [OH-] [Na+] CCO Cl
This one works
search = 'CCO'
df.loc[df.isin([search]).any(axis=1)].index.tolist()
For multi-search I tried
import re
terms = ['C1', 'CCO']
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df[df['col'].str.contains(p)]
Gives me KeyError: 'col'
Expected output
Search='C1' AND '[NA+]
Results 1
Search='CCO' OR 'C1'
Results 0 1
I created your dataframe this way:
df = pd.DataFrame( { 0 : ["[OH-]","[Na+]","NAN","CCO" ], 1 : ["[OH-]","[Na+]","CCO","Cl"] } ).transpose()
Yielding this df:
0 1 2 3
0 [OH-] [Na+] NAN CCO
1 [OH-] [Na+] CCO Cl
I observed that you can do your OR logic with the isin() function on the df:
df.isin(['CCO','C1'])
Yields:
0 1 2 3
0 False False False True
1 False False True False
And so you can figure out which rows match using any(1) as you are using:
df.isin(['CCO','C1']).any(1).index.tolist()
Yields:
[0, 1]
Logic for AND:
The snippet below looks for each term individually and accumulates them in the results dataframe. After finding matching columns, the number of matches in each row is checked to see if it matches the number of terms.
results = pd.DataFrame()
terms = [ 'Cl', '[Na+]' ]
for term in terms:
if results.empty:
results = df.isin( [ term ] )
else:
results |= df.isin( [ term ] )
results['count'] = results.sum(axis=1)
print( results[ results['count'] == len( terms ) ].index.tolist() )
I know there is a better way - but this way works (I think)
The above code yields [1] for terms = [ 'Cl', '[Na+]' ] and [0,1] for terms = [ "[OH-]","[Na+]" ] .
Because there isn't a column name col. Try this:
df[df.apply(lambda col: col.str.contains(p)).any(axis=1)]
col is now the name of an input parameter to the lambda.

Retrieve certain value located in dataframe in any row or column and keep it in separate column without forloop

I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column
I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5

How to add a specific number of characters to the end of string in Pandas?

I am using the Pandas library within Python and I am trying to increase the length of a column with text in it to all be the same length. I am trying to do this by adding a specific character (this will be white space normally, in this example I will use "_") a number of times until it reaches the maximum length of that column.
For example:
Col1_Before
A
B
A1R
B2
AABB4
Col1_After
A____
B____
A1R__
B2___
AABB4
So far I have got this far (using the above table as the example). It is the next part (and the part that does it that I am stuck on).
df['Col1_Max'] = df.Col1.map(lambda x: len(x)).max()
df['Col1_Len'] = df.Col1.map(lambda x: len(x))
df['Difference_Len'] = df ['Col1_Max'] - df ['Col1_Len']
I may have not explained myself well as I am still learning. If this is confusing let me know and I will clarify.
consider the pd.Series s
s = pd.Series(['A', 'B', 'A1R', 'B2', 'AABB4'])
solution
use str.ljust
m = s.str.len().max()
s.str.ljust(m, '_')
0 A____
1 B____
2 A1R__
3 B2___
4 AABB4
dtype: object
for your case
m = df.Col1.str.len().max()
df.Col1 = df.Col1.ljust(m '_')
It isn't the most pandas-like solution, but you can try the following:
col = np.array(["A", "B", "A1R", "B2", "AABB4"])
data = pd.DataFrame(col, columns=["Before"])
Now compute the maximum length, the list of individual lengths, and the differences:
max_ = data.Before.map(lambda x: len(x)).max()
lengths_ = data.Before.map(lambda x: len(x))
diffs_ = max_ - lengths_
Create a new column called After adding the underscores, or any other character:
data["After"] = data["Before"] + ["_"*i for i in diffs_]
All this gives:
Before After
0 A A____
1 B B____
2 A1R A1R__
3 AABB4 AABB4
Without creating extra columns:
In [63]: data
Out[63]:
Col1
0 A
1 B
2 A1R
3 B2
4 AABB4
In [64]: max_length = data.Col1.map(len).max()
In [65]: data.Col1 = data.Col1.apply(lambda x: x + '_'*(max_length - len(x)))
In [66]: data
Out[66]:
Col1
0 A____
1 B____
2 A1R__
3 B2___
4 AABB4

Categories

Resources