I have a df where some object columns contain $, ,, negative numbers and .:
Date Person Salary Change
0 11/1/15 Mike $100.52 ($20)
1 11/1/15 Bill $300.11 ($300.22)
2 11/1/15 Jake - ($1,100)
3 11/1/15 Jack $411.43 $500
4 11/1/15 Faye NaN $1,000.12
5 11/1/15 Clay $122.00 $100
6 11/1/15 Dick $1,663.33 -
I want to convert them to float, but when I try:
df['Salary'] = df['Salary'].str.replace(',', '').str.replace('$', '').str.replace('-', '').astype(float)
I get an empty ValueError: could not convert string to float:. It seems like it's the - is causing some issues, so is there an elegant way of handling it?
I would use a plain Python function because it is easier to write and test:
def conv(txt):
txt = str(txt)
txt = txt.strip()
neg = txt.endswith(')')
try:
val = float(txt.strip('$()-,').replace(',', ''))
except:
val = np.nan
return -val if neg else val
df['Salary'] = df['Salary'].apply(conv)
Try:
df['Salary'] = df['Salary'].str.replace(',', '').str.replace('$', '').str.replace('-', '0').astype(float)
Your issue is most likely trying to convert blank strings to float. Python does not treat '' as a float. You are better off replacing it with 0.
Or a better solution:
df['Salary'] = df['Salary'].str.replace(',', '').str.replace('$', '').str.replace('-', '0')
df['Salary'] = pd.to_numeric(df['Salary'], errors = 'coerce', downcast = 'float')
If you want to see which rows are causing the issue since pd.to_numeric will coerce will return Nan.
Related
Loading in the data
in: import pandas as pd
in: df = pd.read_csv('name', sep = ';', encoding='unicode_escape')
in : df.dtypes
out: amount object
I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.
df['amount']
0 31
1 150,01
2 50
3 54,4
4 32,79
...
4950 25,5
4951 39,5
4952 75,56
4953 5,9
4954 43,69
Name: amount, Length: 4955, dtype: object
Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:
In: import locale
setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' - '
Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.
Unfortunately, due to the comma separated structure, all the values would turn into -1.
0 -1
1 -1
2 -1
3 -1
4 -1
..
4950 -1
4951 -1
4952 -1
4953 -1
4954 -1
Name: amount, Length: 4955, dtype: int64
How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.
Please help!
Documentation that I came across
-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas
-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas
p.s. This is my first post, please be kind
Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:
data.csv
thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22
To read it, specify the column, decimal and thousands separator as needed:
import pandas as pd
df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)
Output:
thing amount
0 thing1 31.00
1 thing2 150.01
2 thing3 50.00
3 thing4 54.40
4 thing5 1500.22
Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):
Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.
When importing the file, do something like:
with open("C:/path/to/file.csv", 'r') as f:
data = f.read()
result = chardet.detect(data.encode())
charencode = result['encoding']
# now re-set the handler to the beginning and re-read the file:
f.seek(0, 0)
data = pd.read_csv(f, delimiter=';', encoding=charencode)
Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do
data = pd.read_csv('C:/path/to/file.csv', engine='python')
#Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:
def stripThousands(self, df_column):
df_column.replace(',', '', regex=True, inplace=True)
df_column = df_column.apply(pd.to_numeric, errors='coerce')
return df_column
If you don't care about the entries that are just hyphens, you could use a function like
def screw_hyphens(self, column):
column.replace(['-'], np.nan, inplace=True)
or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)
**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)
I am abstracting a number value from a csv column like:
column=[None, you earn 5%]
It would be great if it can store 'None' as 0 and simply 5% for the second one.
I tried to transfer the % with the following code. But it raise error as
"TypeError: expected string or bytes-like object"
data.loc[(data['column'] == re.findall(r'([\w]+)', data['column'])), 'disc'] = re.findall(r'([0-9]+\%)',data['column'])
And for loop. But doesn't seemed helpful
def fs(a):
for i in a:
if i == 'None':
a[i] = 0
else:
a[i]=re.search(r'(?<=\().+?(?=\))', a[i])
If you have a Data Frame that has a string column and you want to replace the string 'None" by 0 and also keep numbers and % then do:
df.textColumn.str.replace("None","0").str.replace("[^0-9.%]", "")
Example:
import pandas as pd
df = pd.DataFrame({'n':[1,2,3,4], 'text':["None","you earn 5%", "this is 3.4%", "5.5"]})
df['text'] = df.text.str.replace("None","0").str.replace("[^0-9.%]", "")
df
n text
0 1 0
1 2 5%
2 3 3.4%
3 4 5.5
I am new to data science and currently I'm exploring a bit further. I have over 600,000 columns of a data set and I'm currently cleaning and checking it for inconsistency or outliers. I came across a problem which I am not sure how to solve it. I have some solutions in mind but I am not sure how to do it with pandas.
I have converted the data types of some columns from object to int. I got no errors and checked whether it's in int and it was. I checked the values of one column to check for the factual data. This involves age and I got an error saying my column has a string. so I checked it using this method:
print('if there is string in numeric column',np.any([isinstance(val, str) for val in homicide_df['Perpetrator Age']])
Now, I wanted to print all indices and with their values and type only on this column which has the string data type.
currently I came up with this solution that works fine:
def check_type(homicide_df):
for age in homicide_df['Perpetrator Age']:
if type(age) is str:
print(age, type(age))
check_type(homicide_df)
Here are some of the questions I have:
is there a pandas way to do the same thing?
how should I convert these elements to int?
why were some elements on the columns did not convert to int?
I would appreciate any help. Thank you very much
You can use iteritems:
def check_type(homicide_df):
for i, age in homicide_df['Perpetrator Age'].iteritems():
if type(age) is str:
print(i, age, type(age))
homicide_df = pd.DataFrame({'Perpetrator Age':[10, '15', 'aa']})
print (homicide_df)
Perpetrator Age
0 10
1 15
2 aa
def check_type(homicide_df):
for i, age in homicide_df['Perpetrator Age'].iteritems():
if type(age) is str:
print(i, age, type(age))
check_type(homicide_df)
1 15 <class 'str'>
2 aa <class 'str'>
If values are mixed - numeric with non numeric, better is check :
def check_type(homicide_df):
return homicide_df.loc[homicide_df['Perpetrator Age'].apply(type)==str,'Perpetrator Age']
print (check_type(homicide_df))
1 15
2 aa
Name: Perpetrator Age, dtype: object
If all values are numeric, but all types are str:
print ((homicide_df['Perpetrator Age'].apply(type)==str).all())
True
homicide_df = pd.DataFrame({'Perpetrator Age':['10', '15']})
homicide_df['Perpetrator Age'] = homicide_df['Perpetrator Age'].astype(int)
print (homicide_df)
Perpetrator Age
0 10
1 15
print (homicide_df['Perpetrator Age'].dtypes)
int32
But if some numeric with strings:
Solution for convert to int with to_numeric which replace non numeric values to NaN. then is necessary replace NaN to some numeric like 0 and last cast to int:
homicide_df = pd.DataFrame({'Perpetrator Age':[10, '15', 'aa']})
homicide_df['Perpetrator Age']=pd.to_numeric(homicide_df['Perpetrator Age'], errors='coerce')
print (homicide_df)
Perpetrator Age
0 10.0
1 15.0
2 NaN
homicide_df['Perpetrator Age'] = homicide_df['Perpetrator Age'].fillna(0).astype(int)
print (homicide_df)
Perpetrator Age
0 10
1 15
2 0
I am fairly new to Pandas and I am working on project where I have a column that looks like the following:
AverageTotalPayments
$7064.38
$7455.75
$6921.90
ETC
I am trying to get the cost factor out of it where the cost could be anything above 7000. First, this column is an object. Thus, I know that I probably cannot do a comparison with it to a number. My code, that I have looks like the following:
import pandas as pd
health_data = pd.read_csv("inpatientCharges.csv")
state = input("What is your state: ")
issue = input("What is your issue: ")
#This line of code will create a new dataframe based on the two letter state code
state_data = health_data[(health_data.ProviderState == state)]
#With the new data set I search it for the injury the person has.
issue_data=state_data[state_data.DRGDefinition.str.contains(issue.upper())]
#I then make it replace the $ sign with a '' so I have a number. I also believe at this point my code may be starting to break down.
issue_data = issue_data['AverageTotalPayments'].str.replace('$', '')
#Since the previous line took out the $ I convert it from an object to a float
issue_data = issue_data[['AverageTotalPayments']].astype(float)
#I attempt to print out the values.
cost = issue_data[(issue_data.AverageTotalPayments >= 10000)]
print(cost)
When I run this code I simply get nan back. Not exactly what I want. Any help with what is wrong would be great! Thank you in advance.
Try this:
In [83]: df
Out[83]:
AverageTotalPayments
0 $7064.38
1 $7455.75
2 $6921.90
3 aaa
In [84]: df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000
Out[84]:
0 True
1 True
2 False
3 False
Name: AverageTotalPayments, dtype: bool
In [85]: df[df.AverageTotalPayments.str.extract(r'.*?(\d+\.*\d*)', expand=False).astype(float) > 7000]
Out[85]:
AverageTotalPayments
0 $7064.38
1 $7455.75
Consider the pd.Series s
s
0 $7064.38
1 $7455.75
2 $6921.90
Name: AverageTotalPayments, dtype: object
This gets the float values
pd.to_numeric(s.str.replace('$', ''), 'ignore')
0 7064.38
1 7455.75
2 6921.90
Name: AverageTotalPayments, dtype: float64
Filter s
s[pd.to_numeric(s.str.replace('$', ''), 'ignore') > 7000]
0 $7064.38
1 $7455.75
Name: AverageTotalPayments, dtype: object
I want to read a dataframe from a fixed width flat file. This is a somewhat performance sensitive operation.
I would like all blank whitespace to be stripped from column value. After that whitespace is stripped, I want blank strings to be converted to NaN or None values. Here are the two ideas I had:
pd.read_fwf(path, colspecs=markers, names=columns,
converters=create_convert_dict(columns))
def create_convert_dict(columns):
convert_dict = {}
for col in columns:
convert_dict[col] = null_convert
return convert_dict
def null_convert(value):
value = value.strip()
if value == "":
return None
else:
return value
or:
pd.read_fwf(path, colspecs=markers, names=columns, na_values='',
converters=create_convert_dict(columns))
def create_convert_dict(columns):
convert_dict = {}
for col in columns:
convert_dict[col] = col_strip
return convert_dict
def col_strip(value):
return value.strip()
The second option depends on the converter (which strips whitespace) be evaluated before na_values.
I was wondering if the second one would work. The reason I am curious is because it seems better to retain NaN has the Null value opposed to None.
I am also open to any other suggestions for how I might perform this operation (stripping whitespace and then converting blank strings to NaN).
I do not have access to a computer with pandas installed at the moment, which is why I cannot test this myself.
In case of fixed width file, no need to do anything special to strip white space, or handle missing fields. Below a small example of a fixed width file, three columns each of width 5. There is trailing and leading white space + missing data.
In [57]: data = """\
A B C
0 foo
3 bar 2.0
1 3.0
"""
In [58]: df = pandas.read_fwf(StringIO(data), widths=[5, 5, 5])
In [59]: df
Out[59]:
A B C
0 0 foo NaN
1 3 bar 2
2 1 NaN 3
In [60]: df.dtypes
Out[60]:
A int64
B object
C float64