Pandas: convert column with empty strings to float - python

In my application, I receive a pandas DataFrame (say, block), that has a column called est. This column can contain a mix of strings or floats. I need to convert all values in the column to floats and have the column type be float64. I do so using the following code:
block[est].convert_objects(convert_numeric=True)
block[est].astype('float')
This works for most cases. However, in one case, est contains all empty strings. In this case, the first statement executes without error, but the empty strings in the column remain empty strings. The second statement then causes an error: ValueError: could not convert string to float:.
How can I modify my code to handle a column with all empty strings?
Edit: I know I can just do block[est].replace("", np.NaN), but I was wondering if there's some way to do it with just convert_objects or astype that I'm missing.
Clarification: For project-specific reasons, I need to use pandas 0.16.2.
Here's an interaction with some sample data that demonstrates the failure:
>>> block = pd.DataFrame({"eps":["", ""]})
>>> block = block.convert_objects(convert_numeric=True)
>>> block["eps"]
0
1
Name: eps, dtype: object
>>> block["eps"].astype('float')
...
ValueError: could not convert string to float:

It's easier to do it using:
pandas.to_numeric
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.to_numeric.html
import pandas as pd
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
df['eps'] = pd.to_numeric(df['eps'], errors='coerce')
'coerce' will convert any value error to NaN
df['eps'].astype('float')
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64
Then you can apply other functions without getting errors :
df['eps'].round()
0 1.0
1 2.0
2 2.0
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64

def convert_float(val):
try:
return float(val)
except ValueError:
return np.nan
df = pd.DataFrame({'eps': ['1', 1.6, '1.6', 'a', '', 'a1']})
>>> df.eps.apply(lambda x: convert_float(x))
0 1.0
1 1.6
2 1.6
3 NaN
4 NaN
5 NaN
Name: eps, dtype: float64

Related

Replace str values in series into np.nan

I have the following series
s = pd.Series({'A':['hey','hey',2,2.14},index=1,2,3,4)
I basically want to mask, the series and check if the values are a str if so i want to replace then with np.nan, how could i achieve that?
Wanted result
s = pd.Series({'A':[np.nan,np.nan,2,2.14},index=1,2,3,4)
I tried this
s.mask(isinstance(s,str))
But i got the following ValueError: Array conditional must be same shape as self, i am kinda a newb when it comes to these methods would appreciate a explanation on the why
You can use
out = s.mask(s.apply(type).eq(str))
print(out)
1 NaN
2 NaN
3 2
4 2.14
dtype: object
If you are set on using mask, you could try:
s = pd.Series(['hey','hey',2,2.14],index=[1,2,3,4])
s.mask(s.apply(isinstance,args = [str]))
print(s)
1 NaN
2 NaN
3 2
4 2.14
dtype: object
But as you can see, many roads leading to Rome...
Use to_numeric with the errors="coerce" parameter.
s = pd.to_numeric(s, errors = 'coerce')
Out[73]:
1 NaN
2 NaN
3 2.00
4 2.14
dtype: float64
IIUC, You need to create pd.Series like below then use isinstance like below.
import numpy as np
import pandas as pd
s = pd.Series(['hey','hey',2,2.14],index=[1,2,3,4])
s = s.apply(lambda x: np.nan if isinstance(x, str) else x)
print(s)
1 NaN
2 NaN
3 2.00
4 2.14
dtype: float64
You could use:
s[s.str.match('\D+').fillna(False)] = np.nan
But if you are looking to convert all string 'types' not just representations like "1.23" then refer to #Ynjxsjmh's answer.

data type conversion in dataFrame

I have a csv file which has a column named population. In this CSV file the values of this column are shown as decimal (float) i.e. for e.g. 12345.00. I have converted whole of this file to ttl RDF format, and the population literal is shown as the same i.e. 12345.0 in the ttl file. I want it to show as integer (whole number) i.e. 12345 - Do I need to convert the data type of this column or what to do? Also, I would ask how can I check the data type of a column of a dataFrame in python?
(A beginner in python)- Thanks
You can try change the column data type first.
For example
df = pd.DataFrame([1.0,2.0,3.0,4.0], columns=['A'])
A
0 1.0
1 2.0
2 3.0
3 4.0
Name: A, dtype: float64
now
df['A'] = df['A'].astype(int)
A
0 1.0
1 2.0
2 3.0
3 4.0
Name: A, dtype: int32
If you have some np.NaN in the column you can try
df = df.astype('Int64')
This will get you
A
0 1.0
1 2.0
2 3.0
3 4.0
4 <NA>
Where < NA> is the Int64 equilavent to np.NaN. Is important to know that np:NaN is a float and < NA> is not widely used yet and is not memory and performance optimized, you can read more about here
https://pandas.pydata.org/docs/user_guide/missing_data.html#integer-dtypes-and-missing-data
csv_data['theColName'] = csv_data['theColName'].fillna(0)
csv_data['theColName'] = csv_data['theColName'].astype('int64') worked and the column is successfully converted to int64. Thanks everybody

Parse Out Last Sequence Of Numbers From Pandas Column to create new column

I have a dataframe with codes like the following and would like to create a new column that has the last sequence of numbers parse out.
array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
So the new column would contain the following:
array([2,2,2,12,None])
Sample data
df:
codes
0 K9ADXXL2
1 K9ADXL2
2 K9ADXS2
3 IVERMAXSCM12
4 HPDMUDOGDRYL
Use str.extract gets digits at the end of string and passing to pd.to_numeric
pd.to_numeric(df.codes.str.extract(r'(\d+$)')[0], errors='coerce')
Out[11]:
0 2.0
1 2.0
2 2.0
3 12.0
4 NaN
Name: 0, dtype: float64
If you want get value as string of numbers, you may use str.extract or str.findall as follow
df.codes.str.findall(r'\d+$').str[0]
or
df.codes.str.extract(r'(\d+$)')[0]
Out[20]:
0 2
1 2
2 2
3 12
4 NaN
Name: codes, dtype: object
import re
import pandas as pd
def get_trailing_digits(s):
match = re.search("[0-9]+$",s)
return match.group(0) if match else None
original_column = pd.array(['K9ADXXL2', 'K9ADXL2', 'K9ADXS2', 'IVERMAXSCM12', 'HPDMUDOGDRYL'])
new_column = pd.array([get_trailing_digits(s) for s in original_column])
# ['2', '2', '2', '12', None]
0-9] means any digit
+ means one or more times
$means only at the end of the string
You can use the apply function of a series/data frame with get_trailing_digits as the function.
eg.
my_df["new column"] = my_df["old column"].apply(get_trailing_digits)

removing numbers from a column in python pandas

I want to remove all numbers within the entries of a certain column in a Python pandas dataframe. Unfortunately, commands like .join() and .find() are not iterable (when I define a function to iterate on the entries, it gives me a message that floating variables do not have .find and .join attributes). Are there any commands that take care of this in pandas?
def remove(data):
for i in data if not i.isdigit():
data=''
data=data.join(i)
return data
myfile['column_name']=myfile['column_name'].apply(remove())
You can remove all numbers like this:
import pandas as pd
df = pd.DataFrame ( {'x' : ['1','2','C','4']})
df[ df["x"].str.isdigit() ] = "NaN"
Impossible to know for sure without a data sample, but your code implies data contains strings since you call isdigit on the elements.
Assuming the above, there are many ways to do what you want. One of them is conditional list comprehension:
import pandas as pd
s = pd.DataFrame({'x':['p','2','3','d','f','0']})
out = [ x if x.isdigit() else '' for x in s['x'] ]
# Output: ['', '2', '3', '', '', '0']
Or look at using pd.to_numeric with errors='coerce' to cast the column as numeric and eliminate non-numeric values:
Using #Raidex setup:
s = pd.DataFrame({'x':['p','2','3','d','f','0']})
pd.to_numeric(s['x'], errors='coerce')
Output:
0 NaN
1 2.0
2 3.0
3 NaN
4 NaN
5 0.0
Name: x, dtype: float64
EDIT to handle either situation.
s['x'].where(~s['x'].str.isdigit())
Output:
0 p
1 NaN
2 NaN
3 d
4 f
5 NaN
Name: x, dtype: object
OR
s['x'].where(s['x'].str.isdigit())
Output:
0 NaN
1 2
2 3
3 NaN
4 NaN
5 0
Name: x, dtype: object

Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's

I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?
Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object
Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.

Categories

Resources