How to replace string in python string with specific character? - python

for example, I have a column named Children in data frame of python,
few names are [ tom (peter) , lily, fread, gregson (jaeson 123)] etc.
I want to ask that what code I should write, that could remove part of each name staring with bracket e.g '(' and so on. So that from my given names example tom(peter) will become tom in my column and gregson (123) would become gregson. Since there are thousands of names with bracket part and I want to remove the part of string staring from bracket '(' and ending on bracket ')'. This is a data frame of many columns but i want to do this editing in one specific column named as CHILDREN in my dataframe named DF.

As suggested by #Ruslan S., you can use pandas.Series.str.replace or you could also use re.sub (and there are other methods as well):
import pandas as pd
df = pd.DataFrame({"name":["tom (peter)" , "lily", "fread", "gregson (jaeson 123)"]})
# OPTION 1 with str.replace :
df["name"] = df["name"].str.replace(r"\([a-zA-Z0-9\s]+\)", "").str.strip()
# OPTION 2 :with re sub
import re
r = re.compile(r"\([a-zA-Z0-9\s]+\)")
df["name"] = df["name"].apply(lambda x: r.sub("", x).strip())
And the result in both cases:
name
0 tom
1 lily
2 fread
3 gregson
Note that I also use strip to remove leading and trailing whitespaces here. For more info on the regular expression to use, see re doc for instance.

You can try:
#to remove text between ()
df['columnname'] = df['columnname'].str.replace(r'\((.*)\)', '')
#to remove text between %%
df['columnname'] = df['columnname'].str.replace(r'%(.*)%', '')

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

python regex for filename

I am trying to do a regex on a dataframe.
For example a value will be ia wt template - tdct-c15-c5.doc
The best logic I can think of is to take everything after the - till the last digit in the string.
trying to trim it to tdct-c15-c5
any help would be appreciated.
Components
To stay flexible, assume your input filename(s) contain chunks:
filenames with fix extension .doc (denoting Word files or documents)
some important key (here tdct-c15-c5)
the separator as hyphen possibly surrounded by spaces (here surrounded by spaces -)
some prefix, does not matter currently (here ia wt template)
This information is contained inside ia wt template - tdct-c15-c5.doc.
Decomposition steps
Particularly the chunks (1) and (3) seem pretty stable and fixed constants.
So lets work with them:
we can strip-off from right or remove the extension (1) as ignored
we can split the remaining basename by separator (3) into 2 parts: prefix (4) and key (2)
The last part (2) is what we want to extract.
Implementation (pure Python only)
def extract_key(filename):
basename = filename.rstrip('.doc')
(prefix, key) = basename.split(' - ') # or use lenient regex r'\ ?-\ ?'
return key
filename = 'ia wt template - tdct-c15-c5.doc'
print('extracted key:', extract_key(filename))
Prints:
('extracted key:', 'tdct-c15-c5')
Applied to pandas
Use the function as suggested by C.Nivis inside apply():
df.apply(extract_key)
I don't know if a regex is the better option here. An apply is pretty readable:
mystr = "ia wt template - tdct-c15-c5.doc"
import pandas as pd
df = pd.DataFrame([[mystr] for i in range(4)], columns=['mystr'])
df.mystr.apply(lambda x: x.split(' ')[-1].rstrip('.doc'))
0 tdct-c15-c5
1 tdct-c15-c5
2 tdct-c15-c5
3 tdct-c15-c5
Name: mystr, dtype: object

How to use str.replace by replacing from a specific character and on/forward

This is a extract from a table that i want to clean.
What i've tried to do:
df_sb['SB'] = df_sb['SB'].str.replace('-R*', '', df_sb['SB'].shape[0])
I expected this (Without -Rxx):
But i've got this (Only dash[-] and character "R" where replaced):
Could you please help me get the desired result from item 4?
str.replace works here, you just need to use a regular expression. So your original answer was very close!
df = pd.DataFrame({"EO": ["A33X-22EO-06690"] * 2, "SB": ["A330-22-3123-R01", "A330-22-3123-R02"]})
print(df)
EO SB
0 A33X-22EO-06690 A330-22-3123-R01
1 A33X-22EO-06690 A330-22-3123-R02
df["new_SB"] = df["SB"].str.replace(r"-R\d+$", "")
print(df)
EO SB new_SB
0 A33X-22EO-06690 A330-22-3123-R01 A330-22-3123
1 A33X-22EO-06690 A330-22-3123-R02 A330-22-3123
What the regular expression means:
r"-R\d+$" means find anywhere in the string we see that characters "-R" followed by 1 or more digits (\d+). Then we constrain this to ONLY work if that pattern occurs at the very end of the string. This way we don't accidentally replace an occurrence of -R(digits) that happens to be in the middle of the SB string (e.g. we don't remove "-R101" in the middle of: "A330-22-R101-R20". We would only remove "-R20"). If you would actually like to remove both "-R101" and "-R20", remove the "$" from the regular expression.
An example using str.partition():
s = ['A330-22-3123-R-01','A330-22-3123-R-02']
for e in s:
print(e.partition('-R')[0])
OUTPUT:
A330-22-3123
A330-22-3123
EDIT:
Not tested, but in your case:
df_sb['SB'] = df_sb['SB'].str.partition('-R')[0]

Pandas to match column contents to keywords (with spaces and brackets )

A columns in data frame contains the keywords I want to match with.
I want to check if each column contains any of the keywords. If yes, print them.
Tried below:
import pandas as pd
import re
Keywords = [
"Caden(S, A)",
"Caden(a",
"Caden(.A))",
"Caden.Q",
"Caden.K",
"Caden"
]
data = {'People' : ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]}
df = pd.DataFrame(data)
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
print df["found"]
It returns Nan. I guess the challenge lies in the spaces and brackets in the keywords.
What's the right way to get the ideal outputs? Thank you.
Caden(S, A); Caden.K
Caden(a
Caden(.A))
Caden.Q
Since you do not need to find every keyword, but the longest ones if they are overlapping you may use a regex with findall approach.
The point here is that you need to sort the keywords by length in the descending order first (because there are whitespaces in them), then you need to escape these values as they contain special characters, then you must amend the word boundaries to use unambiguous word boundaries, (?<!\w) and (?!\w) (note that \b is context dependent).
Use
pat = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
See an online Python test:
import re
Keywords = ["Caden(S, A)", "Caden(a","Caden(.A))", "Caden.Q", "Caden.K", "Caden"]
rx = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
# => (?<!\w)(?:Caden\(S,\ A\)|Caden\(\.A\)\)|Caden\(a|Caden\.Q|Caden\.K|Caden)(?!\w)
strs = ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]
for s in strs:
print(re.findall(rx, s))
Output
['Caden(S, A)', 'Caden.K']
['Caden(a']
['Caden(.A))']
['Caden.Q']
Hey don't know if this solution is optimal but it works. I just replaced dot by 8 and '(' by 6 and ')' by 9 don't know why those character are ignored by str.findall ?
A kind of bijection between {8,6,9} and {'.','(',')'}
for i in range(len(Keywords)):
Keywords[i] = Keywords[i].replace('(','6').replace(')','9').replace('.','8')
for i in range(len(df['People'])):
df['People'][i] = df['People'][i].replace('(','6').replace(')','9').replace('.','8')
And then you apply your function
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
Final step get back the {'.','(',')'}
for i in range(len(df['found'])):
df['found'][i] = df['found'][i].replace('6','(').replace('9',')').replace('8','.')
df['People'][i] = df['People'][i].replace('6','(').replace('9',')').replace('8','.')
VoilĂ 

Pandas Split rows based on different delimiters

So i currently have this :
s = final_df['Column Name'].str.split(';').apply(pd.Series, 1).stack()
which splits row when it finds the ; delimiter. However, I will not always have the semicolon as my delimiter. Is there to incorporate re.split or other delimiters into str.split? Basically, there could be ':', ';' ,'|' as my delimiters but I won't know.
I tried to just do split(';', '|') but I knew that wouldn't work.
str.split offers regex just like re.split does. So, you do need to use the latter. The following should do:
s = final_df['Column Name'].str.split(r'[;:|]').apply(pd.Series, 1).stack()
If the starting file contains those delimiters, you could actually provide the regular expression pattern to the sep parameter of the read_table function and set its engine parameter to "python". The following uses the io module and a random string to illustrate the point:
import io
import pandas as pd
mystring = u"hello:world|123;here|we;go,again"
with io.StringIO(mystring) as f:
df = pd.read_table(f, sep=r"[;:|,]", engine="python", header=None)
df
# 0 1 2 3 4 5 6
# 0 hello world 123 here we go again
This one split on :, ;, | and ,.
I hope this proves useful.

Categories

Resources