How to replace string separator(,) in Numerical Columns? - python

I'm trying to covert "Quantity" column to int.
The quantity column has a string(,) divider or separator for the numerical values
using code
data['Quantity'] = data['Quantity'].astype('int')
data['Quantity'] = data['Quantity'].astype('float')
I am getting this error:
ValueError: could not convert string to float: '16,000'
ValueError: invalid literal for int() with base 10: '16,000'
Data
Date Quantity
2019-06-25 200
2019-03-30 100
2019-11-02 250
2018-10-23 100
2018-07-17 150
2018-05-31 150
2018-07-05 100
2018-10-04 100
2018-02-23 100
2019-09-16 204
2019-09-16 315
2019-11-09 113
2019-08-29 5
2019-08-23 4
2019-06-18 78
2019-12-06 4
2019-12-06 2
2019-10-03 16,000
2019-07-03 8,000
2018-12-12 32
Name: Quantity, dtype: object
It's a pandas dataframe with 124964 rows. I added the head and tail of the data
What can I do to fix this problem?

Solution
# Replace string "," with ""
data["Quantity"] = data["Quantity"].apply(lambda x: str(x.replace(',','')))
data['Quantity'] = data['Quantity'].astype('float')

'16,000' is neither a valid representation of an int or float, and actually the format is ambiguous - depending on locale standard, it could mean either 16.0 (float) or 16000 (int).
You first need to specify how this data should be interpreted, then fix the string so that it's a valid representation of either a float or int, then apply asType() with the correct type.
To make '16,000' a valid float representation, you just have to replace the comma with a dot:
val = '16,000'
val = val.replace(",", ".")
To make it an int (with value 16000) you just remove the comma:
val = '16,000'
val = val.replace(",", "")
I don't use panda so I can't tell how to best do this with a dataframe, but this is surely documented.
As a general rule: when working on data coming from the wild outside world (anything outside your own code), never trust the data, always make sure you validate and sanitize it before use.

number = '16,000'
act_num = ''
for char in number:
try:
character = int(char)
act_num+=(char)
except:
if char == '-' or char == '.':
act_num+= (char)
print(float(act_num))

data.Quantity = data.Quantity.astype(str).astype(int)

Related

Convert the string into a float value

I have copied a table with three columns from a pdf file. I am attaching the screenshot from the PDF here:
The values in the column padj are exponential values, however, when you copy from the pdf to an excel and then open it with pandas, these are strings or object data types. Hence, these values cannot be parsed as floats or numeric values. I need these values as floats, not as strings. Can someone help me with some suggestions?
So far this is what I have tried.
The excel or the csv file is then opened in python using the escape_unicode encoding in order to circumvent the UnicodeDecodeError
## open the file
df = pd.read_csv("S2_GSE184956.csv",header=0,sep=',',encoding='unicode_escape')[["DEGs","LFC","padj"]]
df.head()
DEGs padj LFC
0 JUNB 1.5 ×10-8 -1.273329
1 HOOK2 2.39×10-7 -1.109320
2 EGR1 3.17×10-6 -4.187828
3 DUSP1 3.95×10-6 -3.251030
4 IL6 3.95×10-6 -3.415500
5 ARL4C 5.06×10-6 -2.147519
6 NR4A2 2.94×10-4 -3.001167
7 CCL3L1 4.026×10-4 -5.293694
# Convert the string to float by replacing the x10- with exponential sign
df['padj'] = df['padj'].apply(lambda x: (unidecode(x).replace('x10-','x10-e'))).astype(float)
That threw an error,
ValueError: could not convert string to float: '1.5 x10-e8'
Any suggestions would be appreciated. Thanks
With the dataframe shared in the question on this last edit, the following using pandas.Series.str.replace and pandas.Series.astype will do the work:
df['padj'] = df['padj'].str.replace('×10','e').str.replace(' ', '').astype(float)
The goal is to get the cells to look like the following 1.560000e-08.
Notes:
Depending on the rest of the dataframe, additional adjustments might still be required, such as, removing the spaces ' that might exist in one of the cells. For that one can use pandas.Series.str.replace as follows
df['padj'] = df['padj'].str.replace("'", '')
Considering your sample (column padj), the code below should work:
f_value = eval(str_float.replace('x10', 'e').replace(' ', ''))
Updated based on the data you provided above. The most significant thing being that the x is actually a times symbol:
import pandas as pd
DEGs = ["JUNB", "HOOK2", "EGR1", "DUSP1", "IL6", "ARL4C", "NR4A2", "CCL3L1"]
padj = ["1.5 ×10-8", "2.39×10-7", "3.17×10-6", "3.95×10-6", "3.95×10-6", "5.06×10-6", "2.94×10-4", "4.026×10-4"]
LFC = ["-1.273329", "-1.109320", "-4.187828", "-3.251030", "-3.415500", "-2.147519", "-3.001167", "-5.293694"]
df = pd.DataFrame({'DEGs': DEGs, 'padj': padj, 'LFC': LFC})
# change to python-friendly float format
df['padj'] = df['padj'].str.replace(' ×10-', 'e-', regex=False)
df['padj'] = df['padj'].str.replace('×10-', 'e-', regex=False)
# convert padj from string to float
df['padj'] = df['padj'].astype(float)
will give you this dataframe:
If you want a numerical vectorial solution, you can use:
df['float'] = (df['padj'].str.extract(r'(\d+(?:\.\d+))\s*×10(.?\d+)')
.apply(pd.to_numeric).pipe(lambda d: d[0].mul(10.**d[1]))
)
output:
DEGs padj LFC float
0 JUNB 1.5 ×10-8 -1.273329 1.500000e-08
1 HOOK2 2.39×10-7 -1.109320 2.390000e-07
2 EGR1 3.17×10-6 -4.187828 3.170000e-06
3 DUSP1 3.95×10-6 -3.251030 3.950000e-06
4 IL6 3.95×10-6 -3.415500 3.950000e-06
5 ARL4C 5.06×10-6 -2.147519 5.060000e-06
6 NR4A2 2.94×10-4 -3.001167 2.940000e-04
7 CCL3L1 4.026×10-4 -5.293694 4.026000e-04
Intermediate:
df['padj'].str.extract('(\d+(?:\.\d+))\s*×10(.?\d+)')
0 1
0 1.5 -8
1 2.39 -7
2 3.17 -6
3 3.95 -6
4 3.95 -6
5 5.06 -6
6 2.94 -4
7 4.026 -4

pandas: convert column with multiple datatypes to int, ignore errors

I have a column with data that needs some massaging. the column may contain strings or floats. some strings are in exponential form. Id like to best try to format all data in this column as a whole number where possible, expanding any exponential notation to integer. So here is an example
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].astype(int, errors = 'ignore')
The above code does not seem to do a thing. i know i can convert the exponential notation and decimals with simply using the int function, and i would think the above astype would do the same, but it does not. for example, the following code work in python:
int(1170E1), int(1.17E+04), int(11700.0)
> (11700, 11700, 11700)
Any help in solving this would be appreciated. What i'm expecting the output to look like is:
0 '11700'
1 '11700'
2 '11700
3 '24477G'
4 '124601'
5 '247602'
You may check with pd.to_numeric
df.code = pd.to_numeric(df.code,errors='coerce').fillna(df.code)
Out[800]:
0 11700.0
1 11700.0
2 11700.0
3 24477G
4 124601.0
5 247602.0
Name: code, dtype: object
Update
df['code'] = df['code'].astype(object)
s = pd.to_numeric(df['code'],errors='coerce')
df.loc[s.notna(),'code'] = s.dropna().astype(int)
df
Out[829]:
code
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
BENY's answer should work, although you potentially leave yourself open to catching exceptions and filling that you don't want to. This will also do the integer conversion you are looking for.
def convert(x):
try:
return str(int(float(x)))
except ValueError:
return x
df = pd.DataFrame({'code': ['1170E1', '1.17E+04', 11700.0, '24477G', '124601', 247602.0]})
df['code'] = df['code'].apply(convert)
outputs
0 11700
1 11700
2 11700
3 24477G
4 124601
5 247602
where each element is a string.
I will be the first to say, I'm not proud of that triple cast.

while filtering the data from dataframe TypeError: must be real number, not str

we have a dataframe as
print(df)
Empld EmpName Date
1234 Ram 2020-01-01 01:01:01
2332 Andy 2010-11-11 01:01:01
2233 Jim 2009-01-11 01:01:01
when i try to filter the data in the dataframe
dfemp = df[df['Empld'] == '1234']
print(dfemp)
Empld EmpName Date
1234 Ram 2020-01-01 01:01:01
my code is like below i am trying to assign only date value to a variable as the dataframe will always have only one record with id '1234'
if dfemp.empty :
EmDt = "2000-01-01"
else :
EmDt = dfemp['Date'].values[0].replace("[","").replace("]","")[:10]
i am getting below error
Error: TypeError: must be real number, not str
Is there any way to overcome this error, i am trying to get final output to a variable
EmDt=2020-01-01(if it has value then "2020-01-01" if not "2000-01-01" static value)
I assume that all columns in df are of string type.
When you create dfemp, it is a DataFrame, and you want to read
Date column from the first row, also as a string.
To do it run:
if dfemp.empty:
EmDt = "2000-01-01"
else:
EmDt = dfemp.iloc[0].Date[:10]
replace is not needed here.
Another detail to check:
print(type(dfemp.iloc[0].Date).__name__)
The result should be "str". If the result is other then there is
something wrong / unexpected with your source data.
According to what I understood from your question, this should work.
dfemp['Date']= pd.to_datetime(dfemp['Date'])
if dfemp.empty :
EmDt = "2000-01-01"
else :
EmDt = dfemp['Date'].dt.date[0]
print(EmDt)

How to Convert Panda Strings Containing " $ - , " Characters to Float

I have a df where some object columns contain $, ,, negative numbers and .:
Date Person Salary Change
0 11/1/15 Mike $100.52 ($20)
1 11/1/15 Bill $300.11 ($300.22)
2 11/1/15 Jake - ($1,100)
3 11/1/15 Jack $411.43 $500
4 11/1/15 Faye NaN $1,000.12
5 11/1/15 Clay $122.00 $100
6 11/1/15 Dick $1,663.33 -
I want to convert them to float, but when I try:
df['Salary'] = df['Salary'].str.replace(',', '').str.replace('$', '').str.replace('-', '').astype(float)
I get an empty ValueError: could not convert string to float:. It seems like it's the - is causing some issues, so is there an elegant way of handling it?
I would use a plain Python function because it is easier to write and test:
def conv(txt):
txt = str(txt)
txt = txt.strip()
neg = txt.endswith(')')
try:
val = float(txt.strip('$()-,').replace(',', ''))
except:
val = np.nan
return -val if neg else val
df['Salary'] = df['Salary'].apply(conv)
Try:
df['Salary'] = df['Salary'].str.replace(',', '').str.replace('$', '').str.replace('-', '0').astype(float)
Your issue is most likely trying to convert blank strings to float. Python does not treat '' as a float. You are better off replacing it with 0.
Or a better solution:
df['Salary'] = df['Salary'].str.replace(',', '').str.replace('$', '').str.replace('-', '0')
df['Salary'] = pd.to_numeric(df['Salary'], errors = 'coerce', downcast = 'float')
If you want to see which rows are causing the issue since pd.to_numeric will coerce will return Nan.

print index and value if value is str in a numeric data type column pandas dataframe

I am new to data science and currently I'm exploring a bit further. I have over 600,000 columns of a data set and I'm currently cleaning and checking it for inconsistency or outliers. I came across a problem which I am not sure how to solve it. I have some solutions in mind but I am not sure how to do it with pandas.
I have converted the data types of some columns from object to int. I got no errors and checked whether it's in int and it was. I checked the values of one column to check for the factual data. This involves age and I got an error saying my column has a string. so I checked it using this method:
print('if there is string in numeric column',np.any([isinstance(val, str) for val in homicide_df['Perpetrator Age']])
Now, I wanted to print all indices and with their values and type only on this column which has the string data type.
currently I came up with this solution that works fine:
def check_type(homicide_df):
for age in homicide_df['Perpetrator Age']:
if type(age) is str:
print(age, type(age))
check_type(homicide_df)
Here are some of the questions I have:
is there a pandas way to do the same thing?
how should I convert these elements to int?
why were some elements on the columns did not convert to int?
I would appreciate any help. Thank you very much
You can use iteritems:
def check_type(homicide_df):
for i, age in homicide_df['Perpetrator Age'].iteritems():
if type(age) is str:
print(i, age, type(age))
homicide_df = pd.DataFrame({'Perpetrator Age':[10, '15', 'aa']})
print (homicide_df)
Perpetrator Age
0 10
1 15
2 aa
def check_type(homicide_df):
for i, age in homicide_df['Perpetrator Age'].iteritems():
if type(age) is str:
print(i, age, type(age))
check_type(homicide_df)
1 15 <class 'str'>
2 aa <class 'str'>
If values are mixed - numeric with non numeric, better is check :
def check_type(homicide_df):
return homicide_df.loc[homicide_df['Perpetrator Age'].apply(type)==str,'Perpetrator Age']
print (check_type(homicide_df))
1 15
2 aa
Name: Perpetrator Age, dtype: object
If all values are numeric, but all types are str:
print ((homicide_df['Perpetrator Age'].apply(type)==str).all())
True
homicide_df = pd.DataFrame({'Perpetrator Age':['10', '15']})
homicide_df['Perpetrator Age'] = homicide_df['Perpetrator Age'].astype(int)
print (homicide_df)
Perpetrator Age
0 10
1 15
print (homicide_df['Perpetrator Age'].dtypes)
int32
But if some numeric with strings:
Solution for convert to int with to_numeric which replace non numeric values to NaN. then is necessary replace NaN to some numeric like 0 and last cast to int:
homicide_df = pd.DataFrame({'Perpetrator Age':[10, '15', 'aa']})
homicide_df['Perpetrator Age']=pd.to_numeric(homicide_df['Perpetrator Age'], errors='coerce')
print (homicide_df)
Perpetrator Age
0 10.0
1 15.0
2 NaN
homicide_df['Perpetrator Age'] = homicide_df['Perpetrator Age'].fillna(0).astype(int)
print (homicide_df)
Perpetrator Age
0 10
1 15
2 0

Categories

Resources