so I got a Dataframe with at least 2-3 columns with numbers running from 1 to 3000,
and the numbers have comma. I need to convert the numbers to float or int in all the relevant columns.this is an example for my Dataframe:
data = pd.read_csv('exampleData.csv')
data.head(10)
Out[179]:
Rank Total
1 2
20 40
1,200 1,400
NaN NaN
as you can see from the example, my Dataframe consists of numbers, numbers with comma and some NaNs.I've read several posts here about converting to float or int, but I always get error messages such as: 'str' object has no attribute 'astype'.
my approach is as follows for several columns:
cols = ['Rank', 'Total']
data[cols] = data[cols].apply(lambda x: pd.to_numeric(x.astype(str)
.str.replace(',',''), errors='coerce'))
Use the argument thousands.
pd.read_csv('exampleData.csv', thousands=',')
John's solution won't work for numbers with multiple commas, like 1,384,496.
A more scalable solution would be to just do
data = data.replace({",":""}, regex=True)
Then convert the strings to numeric.
Pandas read_csv() takes many arguments which allow you to control how fields are converted. From the documentation:
decimal : str, default ‘.’
Character to recognize as decimal point (e.g. use ‘,’ for European data).
So, here's a crazy idea: convert the numerical fields using the keyword argument, "decimal = ',' ". Then, multiply the numerical fields by 1000.
Related
I've a pandas dataset which has columns and it's Dtype is object. The columns however has numerical float values inside it along with '?' and I'm trying to convert it to float. I want to remove these '?' from the entire column and making those values Nan but not 0 and then convert the column to float64.
The output of value_count() of Voltage column look like this :
? 3771
240.67 363
240.48 356
240.74 356
240.62 356
...
227.61 1
227.01 1
226.36 1
227.28 1
227.02 1
Name: Voltage, Length: 2276, dtype: int64
What is the best way to do that in case I've entire dataset which has "?" inside them along with numbers and i want to convert them all at once.
I tried something like this but it's not working. I want to do this operation for all the columns. Thanks
df['Voltage'] = df['Voltage'].apply(lambda x: float(x.split()[0].replace('?', '')))
1 More question. How can I get "?" from all the columns. I tried something like. Thanks
list = []
for i in df.columns:
if '?' in df[i]
continue
series = df[i].value_counts()['?']
list.append(series)
So, from your value_count, it is clear, that you just have some values that are floats, in a string, and some values that contain ? (apparently that ARE ?).
So, the one thing NOT to do, is use apply or applymap.
Those are just one step below for loops and iterrows in the hierarchy of what not to do.
The only cases where you should use apply is when, otherwise, you would have to iterate rows with for. And those cases almost never happen (in my real life, I've used apply only once. And that was when I was a beginner, and I am pretty sure that if I were to review that code now, I would find another way).
In your case
df.Voltage = df.Voltage.where(~df.Voltage.str.contains('\?')).astype(float)
should do what you want
df.Voltage.str.contains('\?') is a True/False series saying if a row contains a '?'. So ~df.Voltage.str.contains('\?') is the opposite (True if the row does not contain a '\?'. So df.Voltage.where(~df.Voltage.str.contains('\?')) is a serie where values that match ~df.Voltage.str.contains('\?') are left as is, and the other are replaced by the 2nd argument, or, if there is no 2nd argument (which is our case) by NaN. So exactly what you want. Adding .astype(float) convert everyhting to float, since it should now be possible (all rows contains either strings representing a float such as 230.18, or a NaN. So, all convertible to float).
An alternative, closer to what you where trying, that is replacing first, in place, the ?, would be
df.loc[df.Voltage=='?', 'Voltage']=None
# And then, df.Voltage.astype(float) converts to float, with NaN where you put None
I've been learning how to make a prediction model by looking at this step-by-step tutorial website: https://towardsdatascience.com/step-by-step-guide-building-a-prediction-model-in-python-ac441e8b9e8b
The data I was using are the Covid-19 cases in Peru from last Jan to this Sep, and with this data, I want to predict death cases from this Oct to Dec.
However, the “New Cases” data type can not be converted to float. So, I added this:
df = df.replace(r'^\s*$', np.nan, regex=True)
But it did not work too... What should I do?
df = df[['Date', 'New Cases']]
df.head()
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.astype({"New Cases": float})
df["Date"] = pd.to_datetime(df.Date, format="%Y/%m/%d")
df.dtypes
df.index = df['Date']
plt.plot(df["New Cases"],label='2020/1/1~2021/9/30 New Cases')`
There could be one of two problems here, but it would be necessary to see the file you're loading and the code used to import it to be sure. As the error message says, Python cannot convert the string 1,086 to a float, as commas should never appear in numbers in Python. Outside of Python, depending on the country, commas can either be thousand separators (in other words, the number is meant to be 1086) or decimal points (in other words, the number is meant to be 1 + 86/1000). Python always uses a period . for the decimal point, and usually there's no thousands separator (though technically you can use _ but it's uncommon).
Assuming you're using pandas.read_csv to load this file, there is a solution to both problems. If the comma indicates the thousands separator, you can add the argument thousands="," to the list of arguments to pandas.read_csv, and it will remove them for you. If the comma indicates decimal places, you can instead add the argument decimal=",", and it will convert all of the commas to periods before trying to convert them to numbers.
probably a trivial question: I have a pandas dataframe and a column with mixed dtypes. I would like to run various string methods on the column items, e.g. str.upper(), str.lower(), str.capitalize() etc. It works well for just string values in the column, however with numeric values (int/float) I get nan.
Example with str.upper():
output_table.iloc[:,0] = input_table.iloc[:,0].str.upper()
Justtext -> JUSTTEXT
Textwith500number -> TEXTWITH500NUMBER
500 -> nan
-11.6 -> nan
As the dataframe can become quite large (> 1m rows) I would like to have a fast routine to convert the input column by means of the respective string methods. How can I keep the numeric values untouched as they were (not returning nan) and only convert the string values? Something along the lines of pandas errors='ignore'.
Any help is much appreciated. Thank you!
You can use list comprehension:
df = pd.DataFrame({'desc': ['apple', "Textwidh500number", 500, -11.6]})
df["desc"] = [i.upper() if isinstance(i, str) else i for i in df["desc"]]
print (df)
desc
0 APPLE
1 TEXTWIDH500NUMBER
2 500
3 -11.6
I just did something similar with pd.to_numeric and passing errors='coerce' and .notnull(). Try this:
input_table.loc[(pd.to_numeric(input_table['Col_Name'], errors='coerce').notnull()),'Col_Name'].str.upper()
I have a column called accountnumber with values similar to 4.11889000e+11 in a pandas dataframe. I want to suppress the scientific notation and convert the values to 4118890000. I have tried the following method and did not work.
df = pd.read_csv(data.csv)
pd.options.display.float_format = '{:,.3f}'.format
Please recommend.
You don't need the thousand separators "," and the 3 decimals for the account numbers.
Use the following instead.
pd.options.display.float_format = '{:.0f}'.format
I assume the exponential notation for the account numbers must come from the data file. If I create a small csv with the full account numbers, pandas will interpret them as integers.
acct_num
0 4118890000
1 9876543210
df['acct_num'].dtype
Out[51]: dtype('int64')
However, if the account numbers in the csv are represented in exponential notation then pandas will read them as floats.
acct_num
0 4.118890e+11
1 9.876543e+11
df['acct_num'].dtype
Out[54]: dtype('float64')
You have 2 options. First, correct the process that creates the csv so the account numbers are written out correctly. The second is to change the data type of the acct_num column to integer.
df['acct_num'] = df['acct_num'].astype('int64')
df
Out[66]:
acct_num
0 411889000000
1 987654321000
I have a dataframe with a column with floats and NaNs.
Those are phone numbers and they look strange as floats (it gets a ".0" in the end, and the phone number looks like this 5551981180099.0). I tried to use df['phone'].astype(int) to fix that, however it bugs with the NaNs and I get a "can't convert NAs to int" error.
So I went to brutal force with this:
for i in range(len(df.index)):
if pd.isnull(df['phone'][i]) == False:
df['phobe'][i] = int(df['phone'][i])
But when I print(type(df['phone'][i])) it tells me that it is still a class 'numpy.float64'.
I tried everything to turn that into something else to make the phone numbers look nice (turn into string and take the two last characters out, apply astype(str), astype(int), etc) but nothing seems to work.
Any suggestions?
If you have NaN values with int, by design all values are convert to floats.
You can replace NaN to some int and then is possible convert column to int.
df['phone'] = df['phone'].fillna(0).astype(int)
Or remove NaN rows first:
df = df.dropna(subset=['phone'])
df['phone'] = df['phone'].astype(int)
Or replace all values to str and then remove .0, but then get strings NaN (not missing value):
df['phone'] = df['phone'].astype(str).str.replace('\.0', '')
Last if need remove last 2 chars use indexing with str:
df['phone'] = df['phone'].astype(str).str.replace('\.0', '').str[:-2]