Python : ValueError: could not convert string to float: '1,086' - python

I've been learning how to make a prediction model by looking at this step-by-step tutorial website: https://towardsdatascience.com/step-by-step-guide-building-a-prediction-model-in-python-ac441e8b9e8b
The data I was using are the Covid-19 cases in Peru from last Jan to this Sep, and with this data, I want to predict death cases from this Oct to Dec.
However, the “New Cases” data type can not be converted to float. So, I added this:
df = df.replace(r'^\s*$', np.nan, regex=True)
But it did not work too... What should I do?
df = df[['Date', 'New Cases']]
df.head()
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.astype({"New Cases": float})
df["Date"] = pd.to_datetime(df.Date, format="%Y/%m/%d")
df.dtypes
df.index = df['Date']
plt.plot(df["New Cases"],label='2020/1/1~2021/9/30 New Cases')`

There could be one of two problems here, but it would be necessary to see the file you're loading and the code used to import it to be sure. As the error message says, Python cannot convert the string 1,086 to a float, as commas should never appear in numbers in Python. Outside of Python, depending on the country, commas can either be thousand separators (in other words, the number is meant to be 1086) or decimal points (in other words, the number is meant to be 1 + 86/1000). Python always uses a period . for the decimal point, and usually there's no thousands separator (though technically you can use _ but it's uncommon).
Assuming you're using pandas.read_csv to load this file, there is a solution to both problems. If the comma indicates the thousands separator, you can add the argument thousands="," to the list of arguments to pandas.read_csv, and it will remove them for you. If the comma indicates decimal places, you can instead add the argument decimal=",", and it will convert all of the commas to periods before trying to convert them to numbers.

Related

How to retain 2 decimals without rounding in python/pandas?

How can I retrain only 2 decimals for each values in a Pandas series? (I'm working with latitudes and longitudes). dtype is float64.
series = [-74.002568, -74.003085, -74.003546]
I tried using the round function but as the name suggests, it rounds. I looked into trunc() but this can only remove all decimals. Then I figures why not try running a For loop. I tried the following:
for i in series:
i = "{0:.2f}".format(i)
I was able to run the code without any errors but it didn't modify the data in any way.
Expected output would be the following:
[-74.00, -74.00, -74.00]
Anyone knows how to achieve this? Thanks!
series = [-74.002568, -74.003085, -74.003546]
["%0.2f" % (x,) for x in series]
['-74.00', '-74.00', '-74.00']
It will convert your data to string/object data type. It is just for display purpose. If you want to use it for calculation purpose then you can cast it to float. Then only one digit decimal will be visible.
[float('{0:.2f}'.format(x)) for x in series]
[-74.0, -74.0, -74.0]
here is one way to do it
assuming you meant pandas.Series, and if its true then
# you indicated its a series but defined only a list
# assuming you meant pandas.Series, and if its true then
series = [-74.002568, -74.003085, -74.003546]
s=pd.Series(series)
# use regex extract to pick the number until first two decimal places
out=s.astype(str).str.extract(r"(.*\..{2})")[0]
out
0 -74.00
1 -74.00
2 -74.00
Name: 0, dtype: object
Change the display options. This shouldn't change your underlying data.
pd.options.display.float_format = "{:,.2f}".format

Pandas df.style.bar while maintaining rounding

When I apply the bar styling to a pandas dataframe after rounding I lose the rounding formatting, and I can't figure out how to apply the rounding formatting after because df.style.bar doesn't return a dataframe but a "Styler" object.
df = pd.DataFrame({'A': [1.23456, 2.34567,3.45678], 'B':[2,3,4]})
df['A'] = df['A'].round(2)
df.style.bar(subset='A')
This returns
but I don't want all of those extra zeros displayed.
You will have to treat a styler as purely a rendering of the original dataframe. This means you can use a format to display the data rounded to 2 decimal places.
The basic idea behind styling is that a user will want to modify the way the data is presented but still preserve the underlying format for further manipulation.
f = {'A':'{:.2f}'} #column col A to 2 decimals
df.style.format(f).bar(subset='A')
Read this excellent tutorial for exploring what all you can do with it and how.
EDIT: Added a formatting dict to show general use and to only apply the format to a single column.

Error Converting Pandas Dataframe to float with comma

so I got a Dataframe with at least 2-3 columns with numbers running from 1 to 3000,
and the numbers have comma. I need to convert the numbers to float or int in all the relevant columns.this is an example for my Dataframe:
data = pd.read_csv('exampleData.csv')
data.head(10)
Out[179]:
Rank Total
1 2
20 40
1,200 1,400
NaN NaN
as you can see from the example, my Dataframe consists of numbers, numbers with comma and some NaNs.I've read several posts here about converting to float or int, but I always get error messages such as: 'str' object has no attribute 'astype'.
my approach is as follows for several columns:
cols = ['Rank', 'Total']
data[cols] = data[cols].apply(lambda x: pd.to_numeric(x.astype(str)
.str.replace(',',''), errors='coerce'))
Use the argument thousands.
pd.read_csv('exampleData.csv', thousands=',')
John's solution won't work for numbers with multiple commas, like 1,384,496.
A more scalable solution would be to just do
data = data.replace({",":""}, regex=True)
Then convert the strings to numeric.
Pandas read_csv() takes many arguments which allow you to control how fields are converted. From the documentation:
decimal : str, default ‘.’
Character to recognize as decimal point (e.g. use ‘,’ for European data).
So, here's a crazy idea: convert the numerical fields using the keyword argument, "decimal = ',' ". Then, multiply the numerical fields by 1000.

Suppress Scientific Format in a Dataframe Column

I have a column called accountnumber with values similar to 4.11889000e+11 in a pandas dataframe. I want to suppress the scientific notation and convert the values to 4118890000. I have tried the following method and did not work.
df = pd.read_csv(data.csv)
pd.options.display.float_format = '{:,.3f}'.format
Please recommend.
You don't need the thousand separators "," and the 3 decimals for the account numbers.
Use the following instead.
pd.options.display.float_format = '{:.0f}'.format
I assume the exponential notation for the account numbers must come from the data file. If I create a small csv with the full account numbers, pandas will interpret them as integers.
acct_num
0 4118890000
1 9876543210
df['acct_num'].dtype
Out[51]: dtype('int64')
However, if the account numbers in the csv are represented in exponential notation then pandas will read them as floats.
acct_num
0 4.118890e+11
1 9.876543e+11
df['acct_num'].dtype
Out[54]: dtype('float64')
You have 2 options. First, correct the process that creates the csv so the account numbers are written out correctly. The second is to change the data type of the acct_num column to integer.
df['acct_num'] = df['acct_num'].astype('int64')
df
Out[66]:
acct_num
0 411889000000
1 987654321000

Avoiding Excel's Scientific Notation Rounding when Parsing with Pandas

I have an excel file produced automatically with occasional very large numbers like 135061808695. In the excel file when you click on the cell it shows the full number 135061808695 however visually with the automatic "General" format the number appears as 1.35063E+11.
When I use ExcelFile in Pandas the it pulls the value in scientific notation 1.350618e+11 instead of the full 135061808695. Is there any way to get Pandas to pull the full value without going in an messing with the excel file?
Pandas might very well be pulling the full value but not showing it in its default output:
df = pd.DataFrame({ 'x':[135061808695.] })
df.x
0 1.350618e+11
Name: x, dtype: float64
Standard python format:
print "%15.0f" % df.x
135061808695
Or in pandas, convert to an integer type to get integer formatting:
df.x.astype(np.int64)
0 135061808695
Name: x, dtype: int64

Categories

Resources