One Column of my dataset is like this:
0 10,000+
1 500,000+
2 5,000,000+
3 50,000,000+
4 100,000+
Name: Installs, dtype: object
and I want to change these 'xxx,yyy,zzz+' strings to integers.
first I tried this function:
df['Installs'] = pd.to_numeric(df['Installs'])
and I got this error:
ValueError: Unable to parse string "10,000" at position 0
and then I tried to remove '+' and ',' with this method:
df['Installs'] = df['Installs'].str.replace('+','',regex = True)
df['Installs'] = df['Installs'].str.replace(',','',regex = True)
but nothing changed!
How can I convert these strings to integers?
With regex=True, the + (plus) character is interepreted specially, as a regex feature. You can either disable regular expression replacement (regex=False), or even better, change your regular expression to match + or , and remove them at once:
df['Installs'] = df['Installs'].str.replace('[+,]', '', regex=True).astype(int)
Output:
>>> df['Installs']
0 10000
1 500000
2 5000000
3 50000000
4 100000
Name: 0, dtype: int64
+ is not a valid regex, use:
df['Installs'] = pd.to_numeric(df['Installs'].str.replace(r'\D', '', regex=True))
Related
Amtindoccurre
1008.59 (right space is there after this value)
1008.59-
Need to format those values in proper format like 1008.59(without any right side space)and -1008.59 in Pandas Data frame.
Can any one tell the way how to do.
Thanks
DS
You can use .str.replace() with regex to parse the float numbers with trailing sign and space and put the sign and space at the front. Then use .str.strip() to remove any space remaining:
data = {'val': ['1008.59 ', '1008.59-', '57,039.54 ', '4,232.49 ', '4,191.59-', '1,257,039.54 ', '2,257,039.54-']}
df = pd.DataFrame(data)
df['val'] = df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip()
Result:
print(df)
val
0 1008.59
1 -1008.59
2 57,039.54
3 4,232.49
4 -4,191.59
5 1,257,039.54
6 -2,257,039.54
If you want to remove the thousand separators , also, you can use:
df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip().str.replace(',', '', regex=True)
Result:
print(df)
val
0 1008.59
1 -1008.59
2 57039.54
3 4232.49
4 -4191.59
5 1257039.54
6 -2257039.54
If you want to further convert the numbers from string to numeric values (float type), you can use:
df['val'] = df['val'].str.replace(r'(\d+(?:,\d+)*(?:\.\d+)?)(\s|-)', r'\2\1', regex=True).str.strip().str.replace(',', '', regex=True).astype(float)
dfcolumn = [PUEF2CarmenXFc034DpEd, PUEF2BalulanFc034CamH, CARF1BalulanFc013Baca, ...]
My output should be:
dfnewcolumn1 = [PUEF2, PUEF2 , CARF1]
dfnewcolumn2 = [CarmenXFc034DpEd, BalulanFc034CamH, BalulanFc013Baca]
Assuming your split criteria is by fixed number of characters (e.g. 5 here), you can use:
df['dfnewcolumn1'] = df['dfcolumn'].str[:5]
df['dfnewcolumn2'] = df['dfcolumn'].str[5:]
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
If your split criteria is by the first digit in the string, you can use:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Using the following modified original data with more test cases:
dfcolumn
0 PUEF2CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH
2 CARF1BalulanFc013Baca
3 CAF1BalulanFc013Baca
4 PUEFA2BalulanFc034CamH
Run code:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
3 CAF1BalulanFc013Baca CAF1 BalulanFc013Baca
4 PUEFA2BalulanFc034CamH PUEFA2 BalulanFc034CamH
Assuming your prefix consists of a sequence of alphabets followed by a sequence of digits, which both have variable length. Then a regex split function can be constructed and applied on each cell.
Solution
import pandas as pd
import re
# data
df = pd.DataFrame()
df["dfcolumn"] = ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]
def f_split(s: str):
"""Split two part by regex"""
# alphabet(s) followed by digit(s)
o = re.match(r"^([A-Za-z]+\d+)(.*)$", s)
# may add exception handling here if there is no match
return o.group(1), o.group(2)
df[["dfnewcolumn1", "dfnewcolumn2"]] = df["dfcolumn"].apply(f_split).to_list()
Note the .to_list() to convert tuples into lists, which is required for the new column assignment to work.
Result
print(df)
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
Hoe about this compact solution:
import pandas as pd
df = pd.DataFrame({"original": ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]})
df2 = pd.DataFrame(df.original.str.split(r"(\d)", n=1).to_list(), columns=["part1", "separator", "part2"])
df2.part1 = df2.part1 + df2.separator.astype(str)
df2
part1 separator part2
0 PUEF2 2 CarmenXFc034DpEd
1 PUEF2 2 BalulanFc034CamH
2 CARF1 1 BalulanFc013Baca
I use:
Series.str.split with a regex pattern and a kwarg to specify that it should only split on the first match.
in th regex pattern, I use a group (the round braces in (\d)) to capture the separating character
to_list() to output the split as a list of lists
DataFrame constructor to build a new DataFrame from that list
string concat of two columns
For this dataframe, what is the best way to get ride of the * of "Stad Brussel*". In the real dataframe, the * is also on the upside. Please refer to the pic. Thanks.
Dutch name postcode Population
0 Anderlecht 1070 118241
1 Oudergem 1160 33313
2 Sint-Agatha-Berchem 1082 24701
3 Stad Brussel* 1000 176545
4 Etterbeek 1040 47414
Desired results:
Dutch name postcode Population
0 Anderlecht 1070 118241
1 Oudergem 1160 33313
2 Sint-Agatha-Berchem 1082 24701
3 Stad Brussel 1000 176545
4 Etterbeek 1040 47414
You can try:
df['Dutch name'] = df['Dutch name'].replace({'\*':''}, regex = True)
This will remove all * characters in the 'Dutch name' column. If you need to remove the character from multiple columns use:
df.replace({'\*':''}, regex = True)
If you manipulate only strings you can use regular expression matching. See here.
Something like :
import re
txt = 'Your file as a string here'
out = re.sub('\*', '', txt)
out now contain what you want.
for dataframe, first define column(s) to be checked:
cols_to_check = ['4']
then,
df[cols_to_check] = df[cols_to_check].replace({'*':''}, regex=True)
can anyone make me understand this piece of code.
def remove_digit(data):
newData = ''.join([i for i in data if not i.isdigit()])
i = newData.find('(')
if i>-1: newData = newData[:i]
return newData.strip()
Why don't you use regex. [0-9()] looks for matching characters between 0-9, ( and )
newData = re.sub('[0-9()]', '', data)
Give this df:
data
0 a43
1 b((
2 cr3r3
3 d
You can remove digits and parenthesis from the column in this way:
df['data'] = df['data'].str.replace('\d|\(|\)','')
Output:
data
0 a
1 b
2 crr
3 d
I have a column named "KL" with for example:
sem_0405M4209F2057_1.000
sem_A_0103M5836F4798_1.000
Now I want to extract the four digits after "M" and the four digits after "F". But with df["KL"].str.extract I can't get it to work.
Locations of M and F vary, thus just using the slice [9:13] won't work for the complete column.
If you want to use str.extract, here's how:
>>> df['KL'].str.extract(r'M(?P<M>[0-9]{4})F(?P<F>[0-9]{4})')
M F
0 4209 2057
1 5836 4798
Here, M(?P<M>[0-9]{4}) matches the character 'M' and then captures 4 digits following it (the [0-9]{4} part). This is put in the column M (specified with ?P<M> inside the capturing group). The same thing is done for F.
You could use split to achieve this, probably a better way exists:
In [147]:
s = pd.Series(['sem_0405M4209F2057_1.000','sem_A_0103M5836F4798_1.000'])
s
Out[147]:
0 sem_0405M4209F2057_1.000
1 sem_A_0103M5836F4798_1.000
dtype: object
In [153]:
m = s.str.split('M').str[1].str.split('F').str[0][:4]
f = s.str.split('M').str[1].str.split('F').str[1].str[:4]
print(m)
print(f)
0 4209
1 5836
dtype: object
0 2057
1 4798
dtype: object
You can also use regex:
import re
def get_data(x):
data = re.search( r'M(\d{4})F(\d{4})', x)
if data:
m = data.group(1)
f = data.group(2)
return m, f
df = pd.DataFrame(data={'a': ['sem_0405M4209F2057_1.000', 'sem_0405M4239F2027_1.000']})
df['data'] = df['a'].apply(lambda x: get_data(x))
>>
a data
0 sem_0405M4209F2057_1.000 (4209, 2057)
1 sem_0405M4239F2027_1.000 (4239, 2027)