Below is a list of unique values from a column df
aa 2
aaa 10
aaaa 14
aaaaa 2
aaaaaa 1
aableasing 25
yy 1
yyy 6
überimexcars 1
üüberimexcars 1
üüüüüüüüü 2
The aim is to 'clean' the data by grouping on Name.
Thus:
aa = aaa = aaaa
ü = üüü = üüüüüü
...
The desired output would be as shown below
a 29
aableasing 25
y 7
überimexcars 2
üüüüüüüüü 2
I was thinking of something like
df['name'] = df['name'].astype(str).str.replace('aaa', 'a')
However, I would have to do it for each letter. Furthermore, that's not really an efficient of doing the thing.
Using Regular Expression in this case might be a better option?
Thanks anyone who is helping!
This should do the trick:
df['name']=df['name'].replace(r"^(.)\1*$", r"\1", regex=True)
Some explanation:
It will try to match the whole cell (from the beginning - ^ , till the end - $) to any character (.) which then is repeated 0, or more times (reference to first group, denoted by square brackets) - \1* and all this will be replaced (if it's matched only) with this first group \1.
if t contains a string, e.g. 'aaaaa', try the following:
t.join(sorted(set(t), key=t.index))
you'll get 'a'.
Now run this on your dataframe and group
Related
This is an example of a bigger dataframe. Imagine I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({"ID":["4SSS50FX","2TT1897FA"],
"VALUE":[13, 56]})
df
Out[2]:
ID VALUE
0 4SSS50FX 13
1 2TT1897FA 56
I would like to insert "-" in the strings from df["ID"] everytime it changes from number to text and from text to number. So the output should be like:
ID VALUE
0 4-SSS-50-FX 13
1 2-TT-1897-FA 56
I could create specific conditions for each case, but I would like to automate it for all the samples. Anyone could help me?
You can use a regular expression with lookarounds.
df['ID'] = df['ID'].str.replace(r'(?<=\d)(?=[A-Z])|(?<=[A-Z])(?=\d)', '-')
The regexp matches an empty string that's either preceded by a digit and followed by a letter, or vice versa. This empty string is then replaced with -.
Use a regex.
>>> df['ID'].str.replace('(\d+(?=\D)|\D+(?=\d))', r'\1-', regex=True)
0 4-SSS-50-FX
1 2-TT-1897-FA
Name: ID, dtype: object
\d+(?=\D) means digits followed by non-digit.
\D+(?=\d)) means non-digits followed by digit.
Either of those are replaced with themselves plus a - character.
This is a extract from a table that i want to clean.
What i've tried to do:
df_sb['SB'] = df_sb['SB'].str.replace('-R*', '', df_sb['SB'].shape[0])
I expected this (Without -Rxx):
But i've got this (Only dash[-] and character "R" where replaced):
Could you please help me get the desired result from item 4?
str.replace works here, you just need to use a regular expression. So your original answer was very close!
df = pd.DataFrame({"EO": ["A33X-22EO-06690"] * 2, "SB": ["A330-22-3123-R01", "A330-22-3123-R02"]})
print(df)
EO SB
0 A33X-22EO-06690 A330-22-3123-R01
1 A33X-22EO-06690 A330-22-3123-R02
df["new_SB"] = df["SB"].str.replace(r"-R\d+$", "")
print(df)
EO SB new_SB
0 A33X-22EO-06690 A330-22-3123-R01 A330-22-3123
1 A33X-22EO-06690 A330-22-3123-R02 A330-22-3123
What the regular expression means:
r"-R\d+$" means find anywhere in the string we see that characters "-R" followed by 1 or more digits (\d+). Then we constrain this to ONLY work if that pattern occurs at the very end of the string. This way we don't accidentally replace an occurrence of -R(digits) that happens to be in the middle of the SB string (e.g. we don't remove "-R101" in the middle of: "A330-22-R101-R20". We would only remove "-R20"). If you would actually like to remove both "-R101" and "-R20", remove the "$" from the regular expression.
An example using str.partition():
s = ['A330-22-3123-R-01','A330-22-3123-R-02']
for e in s:
print(e.partition('-R')[0])
OUTPUT:
A330-22-3123
A330-22-3123
EDIT:
Not tested, but in your case:
df_sb['SB'] = df_sb['SB'].str.partition('-R')[0]
I have a column Name with data in format below:
Name Name2
0 MORR1223ldkeha12 ldkeha12
1 FRAN2771yetg4fq1 yetg4fq1
2 MORR56333gft4tsd1 gft4tsd1
I wanted to separate name as per column Name2. There is a pattern of 4 upper case chars, followed by 4-5 digits, and I'm interested in what follows these 4-5 digits.
Is there any way to achieve this?
You can try below logic:
import re
_names = ['MORR1223ldkeha12', 'FRAN2771yetg4fq1', 'MORR56333gft4tsd1']
result = []
for _name in _names:
m = re.search('^[A-Z]{4}[0-9]{4,5}(.+)', _name)
result.append(m.group(1))
print(result)
Using str.extract
import pandas as pd
df = pd.DataFrame({"Name": ['MORR1223ldkeha12', 'FRAN2771yetg4fq1', 'MORR56333gft4tsd1']})
df["Name2"] = df["Name"].str.extract(r"\d{4,5}(.*)")
print(df)
Output:
Name Name2
0 MORR1223ldkeha12 ldkeha12
1 FRAN2771yetg4fq1 yetg4fq1
2 MORR56333gft4tsd1 gft4tsd1
You could use a regex to find out if there are 4 or 5 digits and then remove either the first 8 or 9 letters. So if the pattern ^[A-Z]{4}[0-9]{5}.* matches, there are 5 digits, else 4.
If you change your re like this '(^[A-Z]{4})([0-9]{4,5})(.+)' you can access the different parts using the submatches of the match result.
So in Anil's code, group(0) will return the whole match, 1 the first group, 2 the second one and 3 the rest.
Say I've got a column in my Pandas Dataframe that looks like this:
s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
I would like to use this column for fuzzy matching and therefore I want to remove characters ('.' , '/' , '-') but only at the end of each string so it looks like this:
s = pd.Series(["ab-cd", "abc", "abc-def", "ab.cde", "abcd"])
So far I started out easy so instead of generating a list with characters I want removed I just repeated commands for different characters like:
if s.str[-1] == '.':
s.str[-1].replace('.', '')
But this simply produces an error. How do I get the result I want, that is strings without characters at the end (characters in the rest of the string need to be preserved)?
Replace with regex will help you get the output
s.replace(r'[./-]$','',regex=True)
or with the help of apply incase looking for an alternative
s.apply(lambda x :x[:-1] if x[-1] is '.' or '-' or '/' else x)
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
You can use str.replace with a regex:
>>> s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
>>> s.str.replace("\.$|/$|\-$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
which can be reduced to this:
>>> s.str.replace("[./-]$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
You can use str.replace with a regular expression
s.str.replace(r'[./-]$','')
Substitute inside [./-] any characters you want to replace. $ means the match should be at the end of the string.
To replace "in-place" use Series.replace
s.replace(r'[./-]$','', inplace=True, regex=True)
I was able to remove characters from the end of strings in a column in a pandas DataFrame with the following line of code:
s.replace(r'[./-]$','',regex=True)
Where all entries in between brackets ( [./-] ) indicate characters to be removed and $ indicate they should be removed from the end
I have a pandas DataFrame with one column of prices that contains strings of various forms such as US$250.00, MYR35.50, and S$50, and have been facing trouble in developing a suitable regex in order to split the non-numerical portion from the numerical portion. The end result I would like to have is to split this single column of prices into two new columns. One of the columns would hold the alphabetical part as a string and be named "Currency", while the other column would hold the numbers as "Price".
The only possible alphabetical parts I would encounter in the strings, prepended to the numerical parts, are just of the forms: US$, BAHT, MYR, S$. Sometimes there might be a whitespace between the alphabetical part and numerical part, sometimes there might not be. All the help that I need here is just figure out the right regex for this job.
Help please! Thank you so much!
If you want to extend #Tristan's answer to pandas you can use the extractall method in the str accessor.
First create some data
s=pd.Series(['US$250.00', 'MYR35.50','&*', 'S$ 50', '50'])
0 US$250.00
1 MYR35.50
2 &*
3 S$ 50
4 50
Then use extractall. Notice that this method skips rows that do not have a match.
s.str.extractall('([A-Z$]+)\s*([\d.]+)')
0 1
match
0 0 US$ 250.00
1 0 MYR 35.50
3 0 S$ 50
You can use re.match on each cell with a regex like this:
import re
cell = 'US$50.00'
result = re.match(r'([A-Z$]+)\s*([\d.]+)', cell)
print(result.groups()[0], result.groups()[1])
The relevant different parts are captured in groups and can be accessed separately, while the optional whitespace is ignored.
Trick is to use ‘\$* *’ in your search pattern.
Since $ is a meta-character in RegEx, it needs to be escaped to be considered as literal $. So ‘\$*’ part tells RegRx that $ sign may appear zero or more times. Similarly ‘ *’ tells RegEx that space may appear zero or more times.
Hope this helps.
>>> import re
>>> string = 'Rs50 US$56 MYR83 S$102 Baht 105 Us$77'
>>> M = re.findall(r'[A-z]+\$*',string)
>>> M
['Rs', 'US$', 'MYR', 'S$', 'Baht', 'Us$']
>>> C = re.findall(r'[A-z]+\$* *([0-9]+)',string)
>>> C
['50', '56', '83', '102', '105', '77']
With this regex
^([^0-9]+)([0-9]+\.?[0-9]*)$
Group 1 will be the currency part and Group 2 will be the numerical part:
https://regex101.com/delete/MjfCYY4H8g1uCfCywL0TFImZ