Say I've got a column in my Pandas Dataframe that looks like this:
s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
I would like to use this column for fuzzy matching and therefore I want to remove characters ('.' , '/' , '-') but only at the end of each string so it looks like this:
s = pd.Series(["ab-cd", "abc", "abc-def", "ab.cde", "abcd"])
So far I started out easy so instead of generating a list with characters I want removed I just repeated commands for different characters like:
if s.str[-1] == '.':
s.str[-1].replace('.', '')
But this simply produces an error. How do I get the result I want, that is strings without characters at the end (characters in the rest of the string need to be preserved)?
Replace with regex will help you get the output
s.replace(r'[./-]$','',regex=True)
or with the help of apply incase looking for an alternative
s.apply(lambda x :x[:-1] if x[-1] is '.' or '-' or '/' else x)
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
You can use str.replace with a regex:
>>> s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
>>> s.str.replace("\.$|/$|\-$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
which can be reduced to this:
>>> s.str.replace("[./-]$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
You can use str.replace with a regular expression
s.str.replace(r'[./-]$','')
Substitute inside [./-] any characters you want to replace. $ means the match should be at the end of the string.
To replace "in-place" use Series.replace
s.replace(r'[./-]$','', inplace=True, regex=True)
I was able to remove characters from the end of strings in a column in a pandas DataFrame with the following line of code:
s.replace(r'[./-]$','',regex=True)
Where all entries in between brackets ( [./-] ) indicate characters to be removed and $ indicate they should be removed from the end
Related
This is an example of a bigger dataframe. Imagine I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({"ID":["4SSS50FX","2TT1897FA"],
"VALUE":[13, 56]})
df
Out[2]:
ID VALUE
0 4SSS50FX 13
1 2TT1897FA 56
I would like to insert "-" in the strings from df["ID"] everytime it changes from number to text and from text to number. So the output should be like:
ID VALUE
0 4-SSS-50-FX 13
1 2-TT-1897-FA 56
I could create specific conditions for each case, but I would like to automate it for all the samples. Anyone could help me?
You can use a regular expression with lookarounds.
df['ID'] = df['ID'].str.replace(r'(?<=\d)(?=[A-Z])|(?<=[A-Z])(?=\d)', '-')
The regexp matches an empty string that's either preceded by a digit and followed by a letter, or vice versa. This empty string is then replaced with -.
Use a regex.
>>> df['ID'].str.replace('(\d+(?=\D)|\D+(?=\d))', r'\1-', regex=True)
0 4-SSS-50-FX
1 2-TT-1897-FA
Name: ID, dtype: object
\d+(?=\D) means digits followed by non-digit.
\D+(?=\d)) means non-digits followed by digit.
Either of those are replaced with themselves plus a - character.
I am new to pandas, I have an issue with strings. So I have a string s = "'hi'+'bikes'-'cars'>=20+'rangers'" I want only the words from the string, not the symbols or the integers. How can I do it?
My input:
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
Excepted Output:
s = "'hi','bikes','cars','rangers'"
try this using regex
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
samp= re.compile('[a-zA-z]+')
word= samp.findall(s)
not sure about pandas, but you can also do it with Regex as well, and here is the solution
import re
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
words = re.findall("(\'.+?\')", s)
output = ','.join(words)
print(output)
For pandas I would convert the column in the dataframe to string first:
df
a b
0 'hi'+'bikes'-'cars'>=20+'rangers' 1
1 random_string 'with'+random,# 4
2 more,weird/stuff=wrong 6
df["a"] = df["a"].astype("string")
df["a"]
0 'hi'+'bikes'-'cars'>=20+'rangers'
1 random_string 'with'+random,#
2 more,weird/stuff=wrong
Name: a, dtype: string
Now you can see that dtype is string, which means you can do string operations on it,
including translate and split (pandas strings). But first you have to make a translate table with punctuation and digits imported from string module string docs
from string import digits, punctuation
Then make a dictionary mapping each of the digits and punctuation to whitespace
from itertools import chain
t = {k: " " for k in chain(punctuation, digits)}
create the translation table using str.maketrans (no import necessary with python 3.8 but may be a bit different with other versions) and apply the translate and split (with "str" in between) to the column)
t = str.maketrans(t)
df["a"] = df["a"].str.translate(t).str.split()
df
a b
0 [hi, bikes, cars, rangers] 1
1 [random, string, with, random] 4
2 [more, weird, stuff, wrong] 6
As you can see you only have the words now.
This is a extract from a table that i want to clean.
What i've tried to do:
df_sb['SB'] = df_sb['SB'].str.replace('-R*', '', df_sb['SB'].shape[0])
I expected this (Without -Rxx):
But i've got this (Only dash[-] and character "R" where replaced):
Could you please help me get the desired result from item 4?
str.replace works here, you just need to use a regular expression. So your original answer was very close!
df = pd.DataFrame({"EO": ["A33X-22EO-06690"] * 2, "SB": ["A330-22-3123-R01", "A330-22-3123-R02"]})
print(df)
EO SB
0 A33X-22EO-06690 A330-22-3123-R01
1 A33X-22EO-06690 A330-22-3123-R02
df["new_SB"] = df["SB"].str.replace(r"-R\d+$", "")
print(df)
EO SB new_SB
0 A33X-22EO-06690 A330-22-3123-R01 A330-22-3123
1 A33X-22EO-06690 A330-22-3123-R02 A330-22-3123
What the regular expression means:
r"-R\d+$" means find anywhere in the string we see that characters "-R" followed by 1 or more digits (\d+). Then we constrain this to ONLY work if that pattern occurs at the very end of the string. This way we don't accidentally replace an occurrence of -R(digits) that happens to be in the middle of the SB string (e.g. we don't remove "-R101" in the middle of: "A330-22-R101-R20". We would only remove "-R20"). If you would actually like to remove both "-R101" and "-R20", remove the "$" from the regular expression.
An example using str.partition():
s = ['A330-22-3123-R-01','A330-22-3123-R-02']
for e in s:
print(e.partition('-R')[0])
OUTPUT:
A330-22-3123
A330-22-3123
EDIT:
Not tested, but in your case:
df_sb['SB'] = df_sb['SB'].str.partition('-R')[0]
Below is a list of unique values from a column df
aa 2
aaa 10
aaaa 14
aaaaa 2
aaaaaa 1
aableasing 25
yy 1
yyy 6
überimexcars 1
üüberimexcars 1
üüüüüüüüü 2
The aim is to 'clean' the data by grouping on Name.
Thus:
aa = aaa = aaaa
ü = üüü = üüüüüü
...
The desired output would be as shown below
a 29
aableasing 25
y 7
überimexcars 2
üüüüüüüüü 2
I was thinking of something like
df['name'] = df['name'].astype(str).str.replace('aaa', 'a')
However, I would have to do it for each letter. Furthermore, that's not really an efficient of doing the thing.
Using Regular Expression in this case might be a better option?
Thanks anyone who is helping!
This should do the trick:
df['name']=df['name'].replace(r"^(.)\1*$", r"\1", regex=True)
Some explanation:
It will try to match the whole cell (from the beginning - ^ , till the end - $) to any character (.) which then is repeated 0, or more times (reference to first group, denoted by square brackets) - \1* and all this will be replaced (if it's matched only) with this first group \1.
if t contains a string, e.g. 'aaaaa', try the following:
t.join(sorted(set(t), key=t.index))
you'll get 'a'.
Now run this on your dataframe and group
I have a column Name with data in format below:
Name Name2
0 MORR1223ldkeha12 ldkeha12
1 FRAN2771yetg4fq1 yetg4fq1
2 MORR56333gft4tsd1 gft4tsd1
I wanted to separate name as per column Name2. There is a pattern of 4 upper case chars, followed by 4-5 digits, and I'm interested in what follows these 4-5 digits.
Is there any way to achieve this?
You can try below logic:
import re
_names = ['MORR1223ldkeha12', 'FRAN2771yetg4fq1', 'MORR56333gft4tsd1']
result = []
for _name in _names:
m = re.search('^[A-Z]{4}[0-9]{4,5}(.+)', _name)
result.append(m.group(1))
print(result)
Using str.extract
import pandas as pd
df = pd.DataFrame({"Name": ['MORR1223ldkeha12', 'FRAN2771yetg4fq1', 'MORR56333gft4tsd1']})
df["Name2"] = df["Name"].str.extract(r"\d{4,5}(.*)")
print(df)
Output:
Name Name2
0 MORR1223ldkeha12 ldkeha12
1 FRAN2771yetg4fq1 yetg4fq1
2 MORR56333gft4tsd1 gft4tsd1
You could use a regex to find out if there are 4 or 5 digits and then remove either the first 8 or 9 letters. So if the pattern ^[A-Z]{4}[0-9]{5}.* matches, there are 5 digits, else 4.
If you change your re like this '(^[A-Z]{4})([0-9]{4,5})(.+)' you can access the different parts using the submatches of the match result.
So in Anil's code, group(0) will return the whole match, 1 the first group, 2 the second one and 3 the rest.