replace text string in entire column after first occurance - python

I'm trying to replace all but the first occurrence of a text string in an entire column. My specific case is replacing underscores with periods in data that looks like client_19_Aug_21_22_2022 and I need this to be client_19.Aug.21.22.2022
if I use [1], I get this error: string index out of range
but [:1] does all occurrences (it doesn't skip the first one)
[1:] inserts . after every character but doesn't find _ and replace
df1['Client'] = df1['Client'].str.replace('_'[:1],'.')

Not the simplest, but solution:
import re
df.str.apply(lambda s: re.sub(r'^(.*?)\.', r'\1_', s.replace('_', '.')))
Here in the lambda function we firstly replace all _ with .. Then we replace the first occurrence of . back with _. And finally, we apply lambda to each value in a column.

Pandas Series have a .map method that you can use to apply an arbitrary function to every row in the Series.
In your case you can write your own replace_underscores_except_first
function, looking something like:
def replace_underscores_except_first(s):
newstring = ''
# Some logic here to handle replacing all but first.
# You probably want a for loop with some conditional checking
return newstring
and then pass that to .map like:
df1['Client'] = df1['Client'].map(replace_underscores_except_first)

An example using map, and in the function check if the string contain an underscore. If it does, split on it, and join back all parts except the first with a dot.
import pandas as pd
items = [
"client_19_Aug_21_22_2022",
"client123"
]
def replace_underscore_with_dot_except_first(s):
if "_" in s:
parts = s.split("_")
return f"{parts[0]}_{'.'.join(parts[1:])}"
return s
df1 = pd.DataFrame(items, columns=["Client"])
df1['Client'] = df1['Client'].map(replace_underscore_with_dot_except_first)
print(df1)
Output
Client
0 client_19.Aug.21.22.2022
1 client123

Related

Remove everything after second caret regex and apply to pandas dataframe column

I have a dataframe with a column that looks like this:
0 EIAB^EIAB^6
1 8W^W844^A
2 8W^W844^A
3 8W^W858^A
4 8W^W844^A
...
826136 EIAB^EIAB^6
826137 SICU^6124^A
826138 SICU^6124^A
826139 SICU^6128^A
826140 SICU^6128^A
I just want to keep everything before the second caret, e.g.: 8W^W844, what regex would I use in Python? Similarly PACU^SPAC^06 would be PACU^SPAC. And to apply it to the whole column.
I tried r'[\\^].+$' since I thought it would take the last caret and everything after, but it didn't work.
You can negate the character group to find everything except ^ and put it in a match group. you don't need to escape the ^ in the character group but you do need to escape the one outside.
re.match(r"([^^]+\^[^^]+)", "8W^W844^A").group(1)
This is quite useful in a pandas dataframe. Assuming you want to do this on a single column you can extract the string you want with
df['col'].str.extract(r'^([^^]+\^[^^]+)', expand=False)
NOTE
Originally, I used replace, but the extract solution suggested in the comments executed in 1/4 the time.
import pandas as pd
import numpy as np
from timeit import timeit
df = pd.DataFrame({"foo":np.arange(1_000_000)})
df["bar"] = "8W^W844^A"
df2 = df.copy()
def t1():
df.bar.str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)
def t2():
df.bar.str.extract(r'^([^^]+\^[^^]+)', expand=False)
print("replace", timeit("t1()", globals=globals(), number=20))
print("extract", timeit("t2()", globals=globals(), number=20))
output
replace 39.73989862400049
extract 9.910304663004354
I don't think regex is really necessary here, just slice the string up to the position of the second caret:
>>> s = 'PACU^SPAC^06'
>>> s[:s.find("^", s.find("^") + 1)]
'PACU^SPAC'
Explanation: str.find accepts a second argument of where to start the search, place it just after the position of the first caret.

Regex lambda doesn't iterate through df - prints first row's result for all

def split_it(email):
return re.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)", s)
df['email_list'] = df['email'].apply(lambda x: split_it(x))
This code seems to work for the first row of the df, but then will print the result of the first row on all other rows.
Is it not iterating through all rows? Or does it print the result of row 1 on all rows?
You do not need to use apply here, use Series.str.findall directly:
df['email_list'] = df['email'].str.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)")
If there are several emails per row, you can join the results:
df['email_list'] = df['email'].str.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)").str.join(", ")
Note that the email pattern can be enhanced in many ways, but I would add \s into the negated character classes to exclude whitespace matching, and move \. outside the group to avoid repetition:
r"[^\s#]+#[^\s#]+\.(?:com|se|br|org)"

How can I split a string and take only one from the separated string in Python?

I have a string which includes ":", it shows like this:
: SHOES
and I want to split the colon and SHOES, then make a variable that contains only "SHOES"
I have split them used df.split(':') but then how should I create a variable with "SHOES" only?
You can use the list slicing function. and then use lstrip and rstrip to remove excess spaces before and after the word.
df=": shoes"
d=df.split(":")[-1].lstrip().rstrip()
print(d)
You can use 'apply' method to execute a loop over all dataset and split the column with 'split()'.
This is an example:
import pandas as pd
df=pd.DataFrame({'A':[':abd', ':cda', ':vfe', ':brg']})
# First we create a new column just named a new column -> df['new_column']
# Second, we loop dataset with apply
# Third, we execute a lambda with split function, getting only text after ':'
df['new_column']=df['A'].apply(lambda x: x.split(':')[1] )
df
A new_column
0 :abd abd
1 :cda cda
2 :vfe vfe
3 :brg brg
If your original strings always start with ": " then you could just remove the first two characters using:
myString[2:]
Here is a small working sample. Both stripValue and newString return the same value. It is matter cleaner code vs verbose code:
# set initial string
myString = "string : value"
# split it which will return an array [0,1,2,3...]
stripValue = myString.split(":")
# you can create a new var with the value you want/need from the array
newString = (stripValue[1])
# or you can short hand it
print(stripValue[1])
# calling the new var
print(newString)

How to partial split and take the first portion of string in Python?

Have a scenario where I wanted to split a string partially and pick up the 1st portion of the string.
Say String could be like aloha_maui_d0_b0 or new_york_d9_b10. Note: After d its numerical and it could be any size.
I wanted to partially strip any string before _d* i.e. wanted only _d0_b0 or _d9_b10.
Tried below code, but obviously it removes the split term as well.
print(("aloha_maui_d0_b0").split("_d"))
#Output is : ['aloha_maui', '0_b0']
#But Wanted : _d0_b0
Is there any other way to get the partial portion? Do I need to try out in regexp?
How about just
stArr = "aloha_maui_d0_b0".split("_d")
st2 = '_d' + stArr[1]
This should do the trick if the string always has a '_d' in it
You can use index() to split in 2 parts:
s = 'aloha_maui_d0_b0'
idx = s.index('_d')
l = [s[:idx], s[idx:]]
# l = ['aloha_maui', '_d0_b0']
Edit: You can also use this if you have multiple _d in your string:
s = 'aloha_maui_d0_b0_d1_b1_d2_b2'
idxs = [n for n in range(len(s)) if n == 0 or s.find('_d', n) == n]
parts = [s[i:j] for i,j in zip(idxs, idxs[1:]+[None])]
# parts = ['aloha_maui', '_d0_b0', '_d1_b1', '_d2_b2']
I have two suggestions.
partition()
Use the method partition() to get a tuple containing the delimiter as one of the elements and use the + operator to get the String you want:
teste1 = 'aloha_maui_d0_b0'
partitiontest = teste1.partition('_d')
print(partitiontest)
print(partitiontest[1] + partitiontest[2])
Output:
('aloha_maui', '_d', '0_b0')
_d0_b0
The partition() methods returns a tuple with the first element being what is before the delimiter, the second being the delimiter itself and the third being what is after the delimiter.
The method does that to the first case of the delimiter it finds on the String, so you can't use it to split in more than 3 without extra work on the code. For that my second suggestion would be better.
replace()
Use the method replace() to insert an extra character (or characters) right before your delimiter (_d) and use these as the delimiter on the split() method.
teste2 = 'new_york_d9_b10'
replacetest = teste2.replace('_d', '|_d')
print(replacetest)
splitlist = replacetest.split('|')
print(splitlist)
Output:
new_york|_d9_b10
['new_york', '_d9_b10']
Since it replaces all cases of _d on the String for |_d there is no problem on using it to split in more than 2.
Problem?
A situation to which you may need to be careful would be for unwanted splits because of _d being present in more places than anticipated.
Following the apparent logic of your examples with city names and numericals, you might have something like this:
teste3 = 'rio_de_janeiro_d3_b32'
replacetest = teste3.replace('_d', '|_d')
print(replacetest)
splitlist = replacetest.split('|')
print(splitlist)
Output:
rio|_de_janeiro|_d3_b32
['rio', '_de_janeiro', '_d3_b32']
Assuming you always have the numerical on the end of the String and _d won't happen inside the numerical, rpartition() could be a solution:
rpartitiontest = teste3.rpartition('_d')
print(rpartitiontest)
print(rpartitiontest[1] + rpartitiontest[2])
Output:
('rio_de_janeiro', '_d', '3_b32')
_d3_b32
Since rpartition() starts the search on the String's end and only takes the first match to separate the terms into a tuple, you won't have to worry about the first term (city's name?) causing unexpected splits.
Use regex's split and keep delimiters capability:
import re
patre = re.compile(r"(_d\d)")
#👆 👆
#note the surrounding parenthesises - they're what drives "keep"
for line in """aloha_maui_d0_b0 new_york_d9_b10""".split():
parts = patre.split(line)
print("\n", line)
print(parts)
p1, p2 = parts[0], "".join(parts[1:])
print(p1, p2)
output:
aloha_maui_d0_b0
['aloha_maui', '_d0', '_b0']
aloha_maui _d0_b0
new_york_d9_b10
['new_york', '_d9', '_b10']
new_york _d9_b10
credit due: https://stackoverflow.com/a/15668433

using str.replace() to remove nth character from a string in a pandas dataframe

I have a pandas dataframe that consists of strings. I would like to remove the n-th character from the end of the strings. I have the following code:
DF = pandas.DataFrame({'col': ['stri0ng']})
DF['col'] = DF['col'].str.replace('(.)..$','')
Instead of removing the third to the last character (0 in this case), it removes 0ng. The result should be string but it outputs stri. Where am I wrong?
You may want to rather replace a single character followed by n-1 characters at the end of the string:
DF['col'] = DF['col'].str.replace('.(?=.{2}$)', '')
col
0 string
If you want to make sure you're only removing digits (so that 'string' in one special row doesn't get changed to 'strng'), then use something like '[0-9](?=.{2}$)' as pattern.
Another way using pd.Series.str.slice_replace:
df['col'].str.slice_replace(4,5,'')
Output:
0 string
Name: col, dtype: object

Categories

Resources