Keep only some elements in a column of string

Keep only some elements in a column of string - python

I have a df like:
df = pd.DataFrame({'Temp' : ['ko1234', 'ko1234|ko445|ko568', 'map123', 'ko895', 'map123|ko889|ko665', 'ko635|map789|map777', 'ko985']})
(out) >>>
ko1234
ko1234|ko445|ko568
map123
ko895
map123|ko889|ko665
ko635|map789|map777
ko985
I need two things:
I want to keep only the words starting with ko but keep the remaining spaces, so:
ko1234
ko1234|ko445|ko568
ko895
ko889|ko665
ko635
ko985
In another case he would like to do this:
if there is only one word keep it
if there are more words divided by a "|" keep only the second one, so:
ko1234
ko445
map123
ko895
ko889
map789
ko985
What is the best way to do this?

Here is how to do it using .apply (or .transform - the result will be the same).
The functions are applied to each element of the Series lists - which cointains a list of words (that were separated by "|" in the column Temp):
lists = df['Temp'].str.split('|')
def starting_with_ko(lst):
ko = [word for word in lst if word.startswith('ko')]
return '|'.join(ko) if ko else ''
def choose_element(lst):
if len(lst) == 1:
return lst[0]
else:
return lst[1]
out1 = lists.apply(starting_with_ko)
out2 = lists.apply(choose_element)
Results:
>>> out1
0 ko1234
1 ko1234|ko445|ko568
2
3 ko895
4 ko889|ko665
5 ko635
6 ko985
dtype: object
>>> out2
0 ko1234
1 ko445
2 map123
3 ko895
4 ko889
5 map789
6 ko985
dtype: object

We can do split then explode and remove the unwanted items with startswith
out = s.str.split('|').explode().str.strip()
out1 = out[out.str.startswith('ko')].groupby(level=0).agg('|'.join).reindex(s.index)
out2 = s.str.split('|').str[1].fillna(s)

Related

CASE statement in Python based on Regex

So I have a data frame like this:
FileName
01011RT0TU7
11041NT4TU8
51391RST0U2
01011645RT0TU9
11311455TX0TU8
51041545ST3TU9
What I want is another column in the DataFrame like this:
FileName |RdwyId
01011RT0TU7 |01011000
11041NT4TU8 |11041000
51391RST0U2 |51391000
01011645RT0TU9|01011645
11311455TX0TU8|11311455
51041545ST3TU9|51041545
Essentially, if the first 5 characters are digits then concat with "000", if the first 8 characters are digits then simply move them to the RdwyId column
I am noob so I have been playing with this:
Test 1:
rdwyre1=re.compile(r'\d\d\d\d\d')
rdwyre2=re.compile(r'\d\d\d\d\d\d\d\d')
rdwy1=rdwyre1.findall(str(thous["FileName"]))
rdwy2=rdwyre2.findall(str(thous["FileName"]))
thous["RdwyId"]=re.sub(r'\d\d\d\d\d', str(thous["FileName"].loc[:4])+"000",thous["FileName"])
Test 2:
thous["RdwyId"]=np.select(
[
re.search(r'\d\d\d\d\d',thous["FileName"])!="None",
rdwyre2.findall(str(thous["FileName"]))!="None"
],
[
rdwyre1.findall(str(thous["FileName"]))+"000",
rdwyre2.findall(str(thous["FileName"])),
],
default="Unknown"
)
Test 3:
thous=thous.assign(RdwyID=lambda x: str(rdwyre1.search(x).group())+"000" if bool(rdwyre1.search(x))==True else str(rdwyre2.search(x).group()))
None of the above have worked. Could anyone help me figure out where I am going wrong? and how to fix it?

You can use numpy select, which replicates CASE WHEN for multiple conditions, and Pandas' str.isnumeric method:
cond1 = df.FileName.str[:8].str.isnumeric() # first condition
choice1 = df.FileName.str[:8] # result if first condition is met
cond2 = df.FileName.str[:5].str.isnumeric() # second condition
choice2 = df.FileName.str[:5] + "000" # result if second condition is met
condlist = [cond1, cond2]
choicelist = [choice1, choice2]
df.loc[:, "RdwyId"] = np.select(condlist, choicelist)
df
FileName RdwyId
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545

def filt(list1):
for i in list1:
if i[:8].isdigit():
print(i[:8])
else:
print(i[:5]+"000")
# output
01011000
11041000
51391000
01011645
11311455
51041545
I mean, if your case is very specific, you can tweak it and apply it to your dataframe.
To a dataframe.
def filt(i):
if i[:8].isdigit():
return i[:8]
else:
return i[:5]+"000"
d = pd.DataFrame({"names": list_1})
d["filtered"] = d.names.apply(lambda x: filt(x)) #.apply(filt) also works im used to lambdas
#output
names filtered
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545

Using regex:
c1 = re.compile(r'\d{5}')
c2 = re.compile(r'\d{8}')
rdwyId = []
for f in thous['FileName']:
m = re.match(c2, f)
if m:
rdwyId.append(m[0])
continue
m = re.match(c1, f)
if m:
rdwyId.append(m[0] + "000")
thous['RdwyId'] = rdwyId
Edit: replaced re.search with re.match as it's more efficient, since we are only looking for matches at the beginning of the string.

Let us try findall with ljust
df['new'] = df.FileName.str.findall(r"(\d+)[A-z]").str[0].str.ljust(8,'0')
Out[226]:
0 01011000
1 11041000
2 51391000
3 01011645
4 11311455
5 51041545
Name: FileName, dtype: object

Replace a pandas column by splitting the text based on "_"

I have a pandas dataframe as below
import pandas as pd
df = pd.DataFrame({'col':['abcfg_grp_202005', 'abcmn_abc_202009', 'abcgd_xyz_8976', 'abcgd_lmn_1']})
df
col
0 abcfg_grp_202005
1 abcmn_abc_202009
2 abcgd_xyz_8976
3 abcgd_lmn_1
I want to replace 'col' as fist instance before _ in "col". IF there is a single digit in the 3rd instance after _ then append that to end of "col" as below
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1

You can use df.apply:
In [1441]: df['col'] = df.col.str.split('_', expand=True).apply(lambda x: (x[0] + '_' + x[2]) if len(x[2]) == 1 else x[0], axis=1)
In [1442]: df
Out[1442]:
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1

Split on the underscores, then add the strings. Here we can use the trick that False multiplied by a string returns the empty string to deal with the conditional addition. The check is a 1 character string that is a digit.
df1 = df['col'].str.split('_', expand=True)
df['col'] = df1[0] + ('_' + df1[2])*(df1[2].str.len().eq(1) & df1[2].str.isdigit())
print(df)
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1

You can apply a custom function.
import pandas as pd
df = pd.DataFrame({'col':['abcfg_grp_202005', 'abcmn_abc_202009', 'abcgd_xyz_8976', 'abcgd_lmn_1']})
def func(x):
ar = x.split('_')
if len(ar[2]) == 1 and ar[2].isdigit():
return ar[0]+"_"+ar[2]
else:
return ar[0]
df['col'] = df['col'].apply(lambda x: func(x))
df
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1

Here's another way to do it:
df['col'] = np.where(df['col'].str.contains(r'[a-zA-Z0-9]+_[a-zA-Z0-9]+_[0-9]\b', regex=True),
df['col'].str.split('_').str[0] + '_' + df['col'].str.split('_').str[2],
df['col'].str.split('_').str[0])
print(df)
col
0 abcfg
1 abcmn
2 abcgd
3 abcgd_1

Perhaps not the most elegant answer, but I would recommend using str.replace twice here:
df["col"]= df["Team"]
.str.replace("^([^_]+)_.*?(?!_\d$).{2}$", "\\1")
.str.replace("_[^_]+(?=_)", "")
The first regex targets inputs of the form abcfg_grp_202005 which do not end in underscore followed by a digit. In this case, we would be left with abcfg. The second regex removes the middle underscore term, should it still exist, which would only be true for inputs like abcgd_lmn_1 ending in underscore followed by a digit.

Try this
for i in range(len(df)):
x = df.loc[i,"col"].split('_')
if(len(x[2])==1):
df.loc[i,"col"] = x[0]+"_"+x[2]
else:
df.loc[i,"col"] = x[0]
Split the data then check if the 2nd index value's lenght. If it is 1, make the data in column splited[0] + splited[2], if not it is just splited[0]

I wrote function. Then i used .apply() built-in method to apply my function to each value.
def editcols(col_value):
splitted_col_value = col_value.split('_')
if len(splitted_col_value[2])==1:
return f'{splitted_col_value[0]}_{splitted_col_value[2]}'
else:
return splitted_col_value[0]
df['col'] = df['col'].apply(editcols)
I hope it is clear. Please let me know if it worked

substrings in multiple pandas series

I am trying to find a way to search for substrings in strings for a problem like this
findin = pd.Series({1:'abcab', 2: 'abab',3: 'abcdaa', 4:'cabca'})
what = pd.Series({1:'b',2: 'a',3: 'bc',4: 'abc'})
where "what" is what I am seeking and "findin" is the values I want to search
I would like the output to be something like
1 4
0 3
1
1
Every method I have tried is upset at the different number of values that come out. I keep getting "Data must be 1-dimensional" for example using methods like
list(map(lambda x, y: x.find(y), findin, what))
I feel like expand needs to be here, but where does it go?

You can use regex in a function and apply if on findin Series:
c = iter(range(1, 5))
def func(x):
ind = next(c)
return [i.start() for i in re.finditer(what[ind], x)]
findin.apply(func)
Out:
1 [1, 4]
2 [0, 2]
3 [1]
4 [1]
dtype: object

How to add a specific number of characters to the end of string in Pandas?

I am using the Pandas library within Python and I am trying to increase the length of a column with text in it to all be the same length. I am trying to do this by adding a specific character (this will be white space normally, in this example I will use "_") a number of times until it reaches the maximum length of that column.
For example:
Col1_Before
A
B
A1R
B2
AABB4
Col1_After
A____
B____
A1R__
B2___
AABB4
So far I have got this far (using the above table as the example). It is the next part (and the part that does it that I am stuck on).
df['Col1_Max'] = df.Col1.map(lambda x: len(x)).max()
df['Col1_Len'] = df.Col1.map(lambda x: len(x))
df['Difference_Len'] = df ['Col1_Max'] - df ['Col1_Len']
I may have not explained myself well as I am still learning. If this is confusing let me know and I will clarify.

consider the pd.Series s
s = pd.Series(['A', 'B', 'A1R', 'B2', 'AABB4'])
solution
use str.ljust
m = s.str.len().max()
s.str.ljust(m, '_')
0 A____
1 B____
2 A1R__
3 B2___
4 AABB4
dtype: object
for your case
m = df.Col1.str.len().max()
df.Col1 = df.Col1.ljust(m '_')

It isn't the most pandas-like solution, but you can try the following:
col = np.array(["A", "B", "A1R", "B2", "AABB4"])
data = pd.DataFrame(col, columns=["Before"])
Now compute the maximum length, the list of individual lengths, and the differences:
max_ = data.Before.map(lambda x: len(x)).max()
lengths_ = data.Before.map(lambda x: len(x))
diffs_ = max_ - lengths_
Create a new column called After adding the underscores, or any other character:
data["After"] = data["Before"] + ["_"*i for i in diffs_]
All this gives:
Before After
0 A A____
1 B B____
2 A1R A1R__
3 AABB4 AABB4

Without creating extra columns:
In [63]: data
Out[63]:
Col1
0 A
1 B
2 A1R
3 B2
4 AABB4
In [64]: max_length = data.Col1.map(len).max()
In [65]: data.Col1 = data.Col1.apply(lambda x: x + '_'*(max_length - len(x)))
In [66]: data
Out[66]:
Col1
0 A____
1 B____
2 A1R__
3 B2___
4 AABB4

Pandas Apply Syntax

I can't figure out how to apply a simple function to every row of a column in a Panda data frame.
Example:
def delLastThree(x):
x = x.strip()
x = x[:-3]
return x
arr = ['test123','test234','test453']
arrDF = pandas.DataFrame(arr)
arrDF.columns = ['colOne']
arrDF['colOne'].apply(delLastThree)
print arrDF
I would expect the code below to return 'test' for every row. Instead it prints the original values.
How do I apply the delLastThree function to every row in the DF?

You are creating a pd.Series when selecting using single brackets with df['colOne'].
Either use .apply(func, axis=1) on a DataFrame, ie either when selecting with [['colOne']], or without selecting any columns. However, if you use .apply(axis=1), the result is a pd.Series, so you need to modify the function to .str for .string methods.
With the pd.Series resulting from selecting with ['colOne'], you can use either just .apply() or .map().
def delLastThree_series(x):
x = x.strip()
x = x[:-3]
return x
def delLastThree_df(x):
x = x.str.strip()
x = x.str[:-3]
return x
arr = ['test123','test234','test453']
arrDF = pd.DataFrame(arr)
arrDF.columns = ['colOne']
Now use either
arrDF.apply(delLastThree_df, axis=1)
arrDF[['colOne']].apply(delLastThree_df, axis=1)
or
arrDF['colOne'].apply(delLastThree_series)
arrDF['colOne'].map(delLastThree_series, axis=1)
to get:
colOne
0 test
1 test
2 test
You could of course also just:
arrDF['colOne'].str.strip().str[:-3]

use map() function for series (single column):
In [15]: arrDF['colOne'].map(delLastThree)
Out[15]:
0 test
1 test
2 test
Name: colOne, dtype: object
or if you want to change it:
In [16]: arrDF['colOne'] = arrDF['colOne'].map(delLastThree)
In [17]: arrDF
Out[17]:
colOne
0 test
1 test
2 test
but as #Stefan said this will be much faster and more efficient and more "Pandonic":
arrDF['colOne'] = arrDF['colOne'].str.strip().str[:-3]
or if you want to strip all trailing spaces and numbers:
arrDF['colOne'] = arrDF['colOne'].str.replace(r'[\s\d]+$', '')
test:
In [21]: arrDF['colOne'].str.replace(r'[\s\d]+$', '')
Out[21]:
0 test
1 test
2 test
Name: colOne, dtype: object

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Keep only some elements in a column of string - python

We can do split then explode and remove the unwanted items with startswith out = s.str.split('|').explode().str.strip() out1 = out[out.str.startswith('ko')].groupby(level=0).agg('|'.join).reindex(s.index) out2 = s.str.split('|').str[1].fillna(s)

Related

CASE statement in Python based on Regex

Replace a pandas column by splitting the text based on "_"

substrings in multiple pandas series

How to add a specific number of characters to the end of string in Pandas?

Pandas Apply Syntax

Categories

Resources