I have a dataframe column with names:
df = pd.DataFrame({'Names': ['ROS-053', 'ROS-54', 'ROS-51', 'ROS-051B', 'ROS-051A', 'ROS-52']})
df.replace(to_replace=r'[a-zA-Z]{3}-\d{2}$', value='new', regex=True)
The format needs to be three letters followed by - then three numbers. So ROS-51 should be replaced with ROS-051.. And ROS-051B should be ROS-051. I have tried numerous things but can't seem to figure it out.
Any help would be highly appreciated:)
You can do:
df['Names'] = df.Names.replace('^([a-zA-Z]{3})-0?(\d{2})(.*)$', r'\1-0\2', regex=True)
Output:
Names
0 ROS-053
1 ROS-054
2 ROS-051
3 ROS-051
4 ROS-051
5 ROS-052
Here is one option using regex replacement with a callback:
repl = lambda m: m.group(1) + ('00' + m.group(2))[-3:] + m.group(3)
df.str.replace(r'^([A-Z]{3}-)(\d+)(.*)$', repl)
Note this answer is flexible and will left pad with zeroes either a single or double digit only to three digits.
Here's another way to do it:
df = pd.DataFrame({'Names': ['ROS-053', 'ROS-54', 'ROS-51', 'ROS-051B', 'ROS-051A', 'ROS-52']})
df['Names'] = df['Names'].str.replace(r'[A-Z]$', '')
df['Names'] = df['Names'].str.split('-').str[0] + '-' + df['Names'].str.split('-').str[1].apply(lambda x: x.zfill(3))
print(df)
Output:
Names
0 ROS-053
1 ROS-054
2 ROS-051
3 ROS-051
4 ROS-051
5 ROS-052
Related
in Pandas (df), a column with following strings. looking to pad 0 when number within string are <100
Freq
XXX100KHz
XYC200KHz
YYY80KHz
YYY50KHz
to:
Freq
XXX100KHz
XYC200KHz
YYY080KHz
YYY050KHz
following function doesn't work, as \1 then 0 won't work as \10 doesn't exist.
df.replace({'Freq':'^([A-Za-z]+)(\d\d[A-Za-z]*)$'},{'Freq':r'\1**0**\2'},regex=True, inplace=True)
Try:
df["Freq"] = df["Freq"].str.replace(
r"(?<=\D)\d{1,2}(?=KHz)",
lambda g: "{:0>3}".format(g.group()),
regex=True,
)
print(df)
Prints:
Freq
0 XXX100KHz
1 XYC200KHz
2 YYY080KHz
3 YYY050KHz
Amtindoccurre
1008.59 (right space is there after this value)
1008.59-
Need to format those values in proper format like 1008.59(without any right side space)and -1008.59 in Pandas Data frame.
Can any one tell the way how to do.
Thanks
DS
You can use .str.replace() with regex to parse the float numbers with trailing sign and space and put the sign and space at the front. Then use .str.strip() to remove any space remaining:
data = {'val': ['1008.59 ', '1008.59-', '57,039.54 ', '4,232.49 ', '4,191.59-', '1,257,039.54 ', '2,257,039.54-']}
df = pd.DataFrame(data)
df['val'] = df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip()
Result:
print(df)
val
0 1008.59
1 -1008.59
2 57,039.54
3 4,232.49
4 -4,191.59
5 1,257,039.54
6 -2,257,039.54
If you want to remove the thousand separators , also, you can use:
df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip().str.replace(',', '', regex=True)
Result:
print(df)
val
0 1008.59
1 -1008.59
2 57039.54
3 4232.49
4 -4191.59
5 1257039.54
6 -2257039.54
If you want to further convert the numbers from string to numeric values (float type), you can use:
df['val'] = df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip().str.replace(',', '', regex=True).astype(float)
dfcolumn = [PUEF2CarmenXFc034DpEd, PUEF2BalulanFc034CamH, CARF1BalulanFc013Baca, ...]
My output should be:
dfnewcolumn1 = [PUEF2, PUEF2 , CARF1]
dfnewcolumn2 = [CarmenXFc034DpEd, BalulanFc034CamH, BalulanFc013Baca]
Assuming your split criteria is by fixed number of characters (e.g. 5 here), you can use:
df['dfnewcolumn1'] = df['dfcolumn'].str[:5]
df['dfnewcolumn2'] = df['dfcolumn'].str[5:]
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
If your split criteria is by the first digit in the string, you can use:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Using the following modified original data with more test cases:
dfcolumn
0 PUEF2CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH
2 CARF1BalulanFc013Baca
3 CAF1BalulanFc013Baca
4 PUEFA2BalulanFc034CamH
Run code:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
3 CAF1BalulanFc013Baca CAF1 BalulanFc013Baca
4 PUEFA2BalulanFc034CamH PUEFA2 BalulanFc034CamH
Assuming your prefix consists of a sequence of alphabets followed by a sequence of digits, which both have variable length. Then a regex split function can be constructed and applied on each cell.
Solution
import pandas as pd
import re
# data
df = pd.DataFrame()
df["dfcolumn"] = ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]
def f_split(s: str):
"""Split two part by regex"""
# alphabet(s) followed by digit(s)
o = re.match(r"^([A-Za-z]+\d+)(.*)$", s)
# may add exception handling here if there is no match
return o.group(1), o.group(2)
df[["dfnewcolumn1", "dfnewcolumn2"]] = df["dfcolumn"].apply(f_split).to_list()
Note the .to_list() to convert tuples into lists, which is required for the new column assignment to work.
Result
print(df)
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
Hoe about this compact solution:
import pandas as pd
df = pd.DataFrame({"original": ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]})
df2 = pd.DataFrame(df.original.str.split(r"(\d)", n=1).to_list(), columns=["part1", "separator", "part2"])
df2.part1 = df2.part1 + df2.separator.astype(str)
df2
part1 separator part2
0 PUEF2 2 CarmenXFc034DpEd
1 PUEF2 2 BalulanFc034CamH
2 CARF1 1 BalulanFc013Baca
I use:
Series.str.split with a regex pattern and a kwarg to specify that it should only split on the first match.
in th regex pattern, I use a group (the round braces in (\d)) to capture the separating character
to_list() to output the split as a list of lists
DataFrame constructor to build a new DataFrame from that list
string concat of two columns
can anyone make me understand this piece of code.
def remove_digit(data):
newData = ''.join([i for i in data if not i.isdigit()])
i = newData.find('(')
if i>-1: newData = newData[:i]
return newData.strip()
Why don't you use regex. [0-9()] looks for matching characters between 0-9, ( and )
newData = re.sub('[0-9()]', '', data)
Give this df:
data
0 a43
1 b((
2 cr3r3
3 d
You can remove digits and parenthesis from the column in this way:
df['data'] = df['data'].str.replace('\d|\(|\)','')
Output:
data
0 a
1 b
2 crr
3 d
I have a column named "KL" with for example:
sem_0405M4209F2057_1.000
sem_A_0103M5836F4798_1.000
Now I want to extract the four digits after "M" and the four digits after "F". But with df["KL"].str.extract I can't get it to work.
Locations of M and F vary, thus just using the slice [9:13] won't work for the complete column.
If you want to use str.extract, here's how:
>>> df['KL'].str.extract(r'M(?P<M>[0-9]{4})F(?P<F>[0-9]{4})')
M F
0 4209 2057
1 5836 4798
Here, M(?P<M>[0-9]{4}) matches the character 'M' and then captures 4 digits following it (the [0-9]{4} part). This is put in the column M (specified with ?P<M> inside the capturing group). The same thing is done for F.
You could use split to achieve this, probably a better way exists:
In [147]:
s = pd.Series(['sem_0405M4209F2057_1.000','sem_A_0103M5836F4798_1.000'])
s
Out[147]:
0 sem_0405M4209F2057_1.000
1 sem_A_0103M5836F4798_1.000
dtype: object
In [153]:
m = s.str.split('M').str[1].str.split('F').str[0][:4]
f = s.str.split('M').str[1].str.split('F').str[1].str[:4]
print(m)
print(f)
0 4209
1 5836
dtype: object
0 2057
1 4798
dtype: object
You can also use regex:
import re
def get_data(x):
data = re.search( r'M(\d{4})F(\d{4})', x)
if data:
m = data.group(1)
f = data.group(2)
return m, f
df = pd.DataFrame(data={'a': ['sem_0405M4209F2057_1.000', 'sem_0405M4239F2027_1.000']})
df['data'] = df['a'].apply(lambda x: get_data(x))
>>
a data
0 sem_0405M4209F2057_1.000 (4209, 2057)
1 sem_0405M4239F2027_1.000 (4239, 2027)