When I import an Excel file, some numbers in a column are float and some are not. How can I convert all to float? The space in 3 000,00 is causing me problems.
df['column']:
column
0 3 000,00
1 156.00
2 0
I am trying:
df['column'] = df['column'].str.replace(' ','')
but it's not working. I would do after .astype(float), but cannot get there.
Any solutions? 1 is already a float, but 0 is a string.
Just cast them all as a string first:
df['column'] = [float(str(val).replace(' ','').replace(',','.')) for val in df['column'].values]
Example:
>>> df = pd.DataFrame({'column':['3 000,00', 156.00, 0]})
>>> df['column2'] = [float(str(val).replace(' ','').replace(',','.')) for val in df['column'].values]
>>> df
column column2
0 3 000,00 3000.0
1 156 156.0
2 0 0.0
import re
df['column'] = df['column'].apply(lambda x: re.sub("[^0-9.]", "", str(x).replace(',','.'))).astype(float)
Related
I have a large pandas dataset with a messy string column which contains for example:
72.1
61
25.73.20
33.12
I'd like to fill the gaps in order to match a pattern like XX.XX.XX (X are only numbers):
72.10.00
61.00.00
25.73.20
33.12.00
thank you!
How about defining base_string = '00.00.00' then fill other string in each row with base_string:
base_str = '00.00.00'
df = pd.DataFrame({'ms_str':['72.1','61','25.73.20','33.12']})
print(df)
df['ms_str'] = df['ms_str'].apply(lambda x: x+base_str[len(x):])
print(df)
Output:
ms_str
0 72.1
1 61
2 25.73.20
3 33.12
ms_str
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Here is a vectorized solution, that works for this particular pattern. First fill with zeros on the right, then replace every third character by a dot:
df['col'].str.ljust(8, fillchar='0').str.replace(r'(..).', r'\1.', regex=True)
Output:
0 72.10.00
1 61.00.00
2 25.73.20
3 33.12.00
Name: col, dtype: object
I would like to remove the last characters in the column and convert the column into float. The column type is object.
13.3\T
9.4\J
24.09006465036784\C
24.4140625\B
35.73069852941176\M
I tried to use the df[column] = df[column].str[:5] but not successful.
df['column'] = df['column'].str[:4]
df['column'].astype(float)
it's not dropping the last characters.
Getting error. Unable to convert the string into float
You can use Series.str.extract for get floats or integers, then cast by Series.astype and last round by Series.round:
df['column'] = (df['column'].str.extract(r'(\d+\.\d+|\d+)', expand=False)
.astype(float)
.round(2))
print (df)
column
0 13.30
1 9.40
2 24.09
3 24.41
4 35.73
If always only floats:
df['column'] = df['column'].str.extract(r'(\d+\.\d+)', expand=False).astype(float).round(2)
print (df)
column
0 13.30
1 9.40
2 24.09
3 24.41
4 35.73
EDIT:
def my_round(x):
x = x.str.extract(r'(\d+\.\d+)', expand=False)
x = x.astype(float).round(2)
return(x)
df.iloc[:, 61:64] = df.iloc[:, 61:64].astype(str).apply(my_round)
Another idea is convert only object non numeric columns:
cols = df.iloc[:, 61:64].select_dtypes(object).columns
df[cols] = df[cols].apply(my_round)
Use the following to drop the last 2 characters and convert to float:
df[column] = df[column].str[:-2].astype(float)
You can also use the following approach:
df[column] = pd.to_numeric(df[column].str[:-2])
You can then use the following to round your data to 2 decimal places:
df = df.round(2)
print(df)
Output:
0 13.30
1 9.40
2 24.09
3 24.41
4 35.73
I have a column in a df that looks like this:
pd.DataFrame(["[u'one_element']", "[u'two_elememts', u'two_elements']", "[u'three_elements', u'three_elements', u'three_elements']"])
0
0 [u'one_element']
1 [u'two_elememts', u'two_elements']
2 [u'three_elements', u'three_elements', u'three_elements']
Those elements are strings:
type(df[0].iloc[2]) == str
The end result should look like:
0
0 one_element
1 two_elememts, two_elements
2 three_elements, three_elements, three_elements
I tried with:
df[column] = df[column].map(lambda x: x.lstrip('[u').rstrip(']').replace("u'","").replace("'",""))
But obviously this is slow when you have many rows.
Is there a better way to do it? The df has many columns of different types: strings, integers, floats.
Thanks!
You can use regex and strip i.e
df[0] = df[0].str.strip("[]").str.replace("u'|'",'')
0 one_element
1 two_elememts, two_elements
2 three_elements, three_elements, three_elements
Name: 0, dtype: object
Using the ast module.
import pandas as pd
import ast
df = pd.DataFrame(["[u'one_element']", "[u'two_elememts', u'two_elements']", "[u'three_elements', u'three_elements', u'three_elements']"])
print(df[0].apply(lambda x: ", ".join(ast.literal_eval(x))))
Output:
0 one_element
1 two_elememts, two_elements
2 three_elements, three_elements, three_elements
Name: 0, dtype: object
You don't need map, you can use the str attribute for pandas Series:
(df[0].str.lstrip('[u')
.str.rstrip(']')
.str.replace("u'","")
.str.replace("'","")))
achieves the same result but does not use map
0 one_element
1 two_elememts, two_elements
2 three_elements, three_elements, three_elements
Name: 0, dtype: object
I'm trying to assign a value to a cell, yet Pandas rounds it to zero. (I'm using Python 3.6)
in: df['column1']['row1'] = 1 / 331616
in: print(df['column1']['row1'])
out: 0
But if I try to assign this value to a standard Python dictionary key, it works fine.
in: {'column1': {'row1': 1/331616}}
out: {'column1': {'row1': 3.0155360416867704e-06}}
I've already done this, but it didn't help:
pd.set_option('precision',50)
pd.set_option('chop_threshold',
.00000000005)
Please, help.
pandas appears to be presuming that your datatype is an integer (int).
There are several ways to address this, either by setting the datatype to a float when the DataFrame is constructed OR by changing (or casting) the datatype (also referred to as a dtype) to a float on the fly.
setting the datatype (dtype) during construction:
>>> import pandas as pd
In making this simple DataFrame, we provide a single example value (1) and the columns for the DataFrame are defined as containing floats during creation
>>> df = pd.DataFrame([[1]], columns=['column1'], index=['row1'], dtype=float)
>>> df['column1']['row1'] = 1 / 331616
>>> df
column1
row1 0.000003
converting the datatype on the fly:
>>> df = pd.DataFrame([[1]], columns=['column1'], index=['row1'], dtype=int)
>>> df['column1'] = df['column1'].astype(float)
>>> df['column1']['row1'] = 1 / 331616
df
column1
row1 0.000003
Your column's datatype most likely is set to int. You'll need to either convert it to float or mixed types object before assigning the value:
df = pd.DataFrame([1,2,3,4,5,6])
df.dtypes
# 0 int64
# dtype: object
df[0][4] = 7/125
df
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 0
# 5 6
df[0] = df[0].astype('O')
df[0][4] = 7 / 22
df
# 0
# 0 1
# 1 2
# 2 3
# 3 4
# 4 0.318182
# 5 6
df.dtypes
# 0 object
# dtype: object
Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.
You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]
You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3
Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64
>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3