Replacing Special Symbols in Pandas Dataframe - python

Can anyone help me to remove extra characters from the dataframe column? Should I have to use replace string method?
For example,
`$7.99` -> `7.99`
`$16.99` -> `16.99`
`$0.99` -> `0.99`

You can remove the first character with .str[1:]:
df["Column1"] = df["Column1"].str[1:].astype(float)
print(df)
Prints:
Column1
0 7.99
1 16.99
2 0.99
Dataframe used:
Column1
0 $7.99
1 $16.99
2 $0.99

IIUC: consider the following dataframe:
df = pd.DataFrame(['$7.99', '$3.45', '$56.99'])
you can use replace to do:
df[0].str.replace('$', '', regex=False)
Output:
0 7.99
1 3.45
2 56.99
Name: 0, dtype: object

Related

Left justify pandas string column with pattern

I have a large pandas dataset with a messy string column which contains for example:
72.1
61
25.73.20
33.12
I'd like to fill the gaps in order to match a pattern like XX.XX.XX (X are only numbers):
72.10.00
61.00.00
25.73.20
33.12.00
thank you!
How about defining base_string = '00.00.00' then fill other string in each row with base_string:
base_str = '00.00.00'
df = pd.DataFrame({'ms_str':['72.1','61','25.73.20','33.12']})
print(df)
df['ms_str'] = df['ms_str'].apply(lambda x: x+base_str[len(x):])
print(df)
Output:
ms_str
0 72.1
1 61
2 25.73.20
3 33.12
ms_str
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Here is a vectorized solution, that works for this particular pattern. First fill with zeros on the right, then replace every third character by a dot:
df['col'].str.ljust(8, fillchar='0').str.replace(r'(..).', r'\1.', regex=True)
Output:
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Name: col, dtype: object

Parse Out Last Sequence Of Numbers From Pandas Column to create new column

I have a dataframe with codes like the following and would like to create a new column that has the last sequence of numbers parse out.
array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
So the new column would contain the following:
array([2,2,2,12,None])
Sample data
df:
codes
0 K9ADXXL2
1 K9ADXL2
2 K9ADXS2
3 IVERMAXSCM12
4 HPDMUDOGDRYL
Use str.extract gets digits at the end of string and passing to pd.to_numeric
pd.to_numeric(df.codes.str.extract(r'(\d+$)')[0], errors='coerce')
Out[11]:
0 2.0
1 2.0
2 2.0
3 12.0
4 NaN
Name: 0, dtype: float64
If you want get value as string of numbers, you may use str.extract or str.findall as follow
df.codes.str.findall(r'\d+$').str[0]
or
df.codes.str.extract(r'(\d+$)')[0]
Out[20]:
0 2
1 2
2 2
3 12
4 NaN
Name: codes, dtype: object
import re
import pandas as pd
def get_trailing_digits(s):
match = re.search("[0-9]+$",s)
return match.group(0) if match else None
original_column = pd.array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
new_column = pd.array([get_trailing_digits(s) for s in original_column])
# ['2', '2', '2', '12', None]
0-9] means any digit
+ means one or more times
$means only at the end of the string
You can use the apply function of a series/data frame with get_trailing_digits as the function.
eg.
my_df["new column"] = my_df["old column"].apply(get_trailing_digits)

How to convert a string in float with a space - pandas

When I import an Excel file, some numbers in a column are float and some are not. How can I convert all to float? The space in 3 000,00 is causing me problems.
df['column']:
column
0 3 000,00
1 156.00
2 0
I am trying:
df['column'] = df['column'].str.replace(' ','')
but it's not working. I would do after .astype(float), but cannot get there.
Any solutions? 1 is already a float, but 0 is a string.
Just cast them all as a string first:
df['column'] = [float(str(val).replace(' ','').replace(',','.')) for val in df['column'].values]
Example:
>>> df = pd.DataFrame({'column':['3 000,00', 156.00, 0]})
>>> df['column2'] = [float(str(val).replace(' ','').replace(',','.')) for val in df['column'].values]
>>> df
column column2
0 3 000,00 3000.0
1 156 156.0
2 0 0.0
import re
df['column'] = df['column'].apply(lambda x: re.sub("[^0-9.]", "", str(x).replace(',','.'))).astype(float)

Remove certain words from string

I have a dataframe df with a column containing string. I have another dataframe df2 with 1 column (so it can be a serie) which contains 1 word each row.
I would like to remove all the words from df that are in df2.
Example:
df:
ColString
0 I would like to buy apples.
df2:
Wordlist
0 like
1 apples
Result:
df:
ColString
0 I would to buy .
Any ideas ? Thanks for help !
You can using replace with regex=True
df1.col.replace(df2.Wordlist.str.cat(sep='|'),'',regex=True)
Out[510]:
0 I would to buy .
Name: col, dtype: object

Python, pandas: how to remove greater than sign

Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.
You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]
You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3
Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64
>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3

Categories

Resources