Pandas Sales Analysis Help - ValueError: could not convert string to float: '' - python

I'm currently running a sales analysis on an excel file with roughly 500 transactions. I have a category called "Sale Price" which should be read in as a float. Pandas read in the dtype as an object, and when trying to change the dtype to a float using:
df['Sale Price'].fillna(0).astype(float)
I get the following error:
ValueError: could not convert string to float: ''
I've tried mixing in various command combinations such as:
df.loc[pd.to_numeric(df['Sale Price'], errors='coerce').isnull()]
and:
pd.to_numeric(df['Sale Price']).astype(int)
in order to convert the column to a float, but now I'm thinking the issue is in how the data is being read in. I used the basic:
df = pd.read_excel('...')
Hopefully someone can help clarify where the issue is coming from as I've been stuck for awhile. Thank you!

You could replace your empty strings with 0 before changing it to float:
df["Sale Price"] = df["Sale Price"].astype(str).str.strip().replace("",0).astype(float)

You have a empty string somewhere in your sale price. As indicated by the error
ValueError: could not convert string to float: ''
To fix this first run:
df['Sale Price'] = df['Sale Price'].where(df['Sale Price'] != '', 0)
This will replace any empty strings with zero

Related

Pandas converting string to numeric - getting invalid literal for int() with base 10 error

I am trying to convert data from a csv file to a numeric type so that I can find the greatest and least value in each category. This is a short view of the data I am referencing:
Course
Grades_Recieved
098321
A,B,D
324323
C,B,D,F
213323
A,B,D,F
I am trying to convert the grades_received to numeric types so that I can create new categories that list the highest grade received and the lowest grade received in each course.
This is my code so far:
import pandas as pd
df = pd.read_csv('grades.csv')
df.astype({Grades_Recieved':'int64'}).dtypes`
I have tried the code above, I have tried using to_numeric, but I keep getting an error: invalid literal for int() with base 10: 'A,B,D' and I am not sure how to fix this. I have also tried getting rid of the ',' but the error remains the same.
You can't convert a list of non-numeric strings into int/float, but you can get the desired result doing something like this:
df['Highest_Grade'] = df['Grades_Recieved'].str.split(',').apply(lambda x: min(x))
df['Lowest_Grade'] = df['Grades_Recieved'].str.split(',').apply(lambda x: max(x))

Unable to format double values in pyspark

I have a CSV data like the below:
time_value,annual_salary
5/01/19 01:02:16,120.56
06/01/19 2:02:17,12800
7/01/19 03:02:18,123.00
08/01/19 4:02:19,123isdhad
I want to consider only numeric values along with decimal values. Basically, I want to ignore the last record since it is alphanumeric in the case of annual_salary and which I was able to do so. But, when I tried it convert it to the proper decimal values it is giving me incorrect results. Below is my code:
df = df.withColumn("annual_salary", regexp_replace(col("annual_salary"), "\.", ""))
df = df.filter(~col("annual_salary").rlike("[^0-9]"))
df.show(truncate=False)
df.withColumn("annual_salary", col("annual_salary").cast("double")).show(truncate=False)
But it gives me records like the below:
which is incorrect.
Expected output:
annual_salary
120.56
12800.00
123.00
What could be wrong here? Should I need to implement UDF for this type of conversion?
Please Try cast Decimal Type.
df.where(~col('annual_salary').rlike('[A-Za-z]')).withColumn('annual_salary', col('annual_salary').cast(DecimalType(38,2))).show()
+----------------+-------------+
| time_value|annual_salary|
+----------------+-------------+
|5/01/19 01:02:16| 120.56|
|06/01/19 2:02:17| 12800.00|
|7/01/19 03:02:18| 123.00|
+----------------+-------------+

cleaning numeric columns in pandas

I have some difficulties to exploit csv scraping file in pandas.
I have several columns, one of them contain prices as '1 800 €'
After to import csv as dataframe, I can not convert my columns in Integrer
I deleted euro symbol without problem
data['prix']= data['prix'].str.strip('€')
I tried to delete space with the same approach, but the space still remaied
data['prix']= data['prix'].str.strip()
or
data['prix']= data['prix'].str.strip(' ')
or
data['prix']= data['prix'].str.replace(' ', '')
I tried to force the conversion in Int
data['prix']= pd.to_numeric(data['prix'], errors='coerce')
My column was fill by Nan value
I tried to convert before operation of replace space in string
data = data.convert_dtypes(convert_string=True)
But same result : impossible to achieve my aim
the spaces are always present and I can not convert in integer
I looked with Excel into dataset, I can not identify special problem in the data
I tried also to change encoding standard in read_csv ... ditto
In this same dataset I had the same problem for the kilometrage as 15 256 km
And I had no problem to retreat and convert to int ...
I would like to test through REGEX to copy only numbers of the field et create new column with
How to proceed ?
I am also interested by other ideas
Thank you
Use str.findall:
I would like to test through REGEX to copy only numbers of the field et create new column with
data['prix2'] = data['prix'].str.findall(r'\d+').str.join('').astype(int)
# Or if it raises an exception
data['prix2'] = pd.to_numeric(data['prix'].str.findall('(\d+)').str.join(''), errors='coerce')
To delete the white space use this line:
data['prix']= data['prix'].str.replace(" ","")
and to convert the string into a int use this line:
data['prix'] = [int(i) for i in data['prix']]

Problem with converting a string that contains numbers to a numbers data type

Currently, I'm working with a column called 'amount' that contains transaction amounts. This column is from the string datatype and I would like to convert it to a number data type.
The problem I ran into was that the code I wrote to convert the string data type to numbers worked but the only problem is that when I removed the ',' in the code below and changed it to numbers, the decimals were added which causes extremely high values in my data. So, 100000,95 became 10000095. I used the following code to convert my string data type to numbers:
df["amount"] = df["amount"].str.replace(',', '')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
Can someone help me with this problem?
EDIT: Not all values contain decimals. I'm looking for a solution for only the values that contain a ','.
You need repalce by comma if need floats:
df["amount"] = df["amount"].str.replace(',', '.')

Error message when multiplying two columns together

Trying to create a new column for my data set using pandas that is the product of two columns multiplied together. One set is a value in dollars called price and the other is a number called installs. Running the multiplication code by itself gives me an error 'can't multiply sequence by non-int of type 'str''
I tried running the following code to convert the strings into integers.
pd.to_numeric(appdata['Installs'], errors ='ignore')
pd.to_numeric(appdata['Price'], errors= 'ignore')
appdata[Income]= appdata['Installs'] * appdata[('Price')]
But this gives me the same error.
What other way could I convert my data into integer format?
Thanks in advance.
pd.to_numeric() does not edit the column in place. You should do:
appdata['Installs'] = pd.to_numeric(appdata['Installs'], errors ='ignore')
appdata['Price'] = pd.to_numeric(appdata['Price'], errors= 'ignore')
appdata['Income']= appdata['Installs'] * appdata['Price']
# remove , and + from 'Installs' to make the cells look integer to Pandas
appdata['Installs'] = pd.to_numeric(appdata['Installs'].str.replace(r'[\,\+]+', '', regex=True))
# remove , from 'Price' to make the cells look integer to Pandas
appdata['Price'] = pd.to_numeric(appdata['Price'].str.replace(r'[\,]+', '', regex=True))
# calculate "Income"
appdata['Income'] = appdata['Installs'] * appdata['Price']

Categories

Resources