I have a column, having float values,in a dataframe (so I am calling this column as Float series). I want to convert all the values to integer or just round it up so that there are no decimals.
Let us say the dataframe is df and the column is a, I tried this :
df['a'] = round(df['a'])
I got an error saying this method can't be applied to a Series, only applicable to individual values.
Next I tried this :
for obj in df['a']:
obj =int(round(obj))
After this I printed df but there was no change.
Where am I going wrong?
round won't work as it's being called on a pandas Series which is array-like rather than a scalar value, there is the built in method pd.Series.round to operate on the whole Series array after which you can change the dtype using astype:
In [43]:
df = pd.DataFrame({'a':np.random.randn(5)})
df['a'] = df['a'] * 100
df
Out[43]:
a
0 -4.489462
1 -133.556951
2 -136.397189
3 -106.993288
4 -89.820355
In [45]:
df['a'] = df['a'].round(0).astype(int)
df
Out[45]:
a
0 -4
1 -134
2 -136
3 -107
4 -90
Also it's unnecessary to iterate over the rows when there are vectorised methods available
Also this:
for obj in df['a']:
obj =int(round(obj))
Does not mutate the individual cell in the Series, it's operating on a copy of the value which is why the df is not mutated.
The code in your loop:
obj = int(round(obj))
Only changes which object the name obj refers to. It does not modify the data stored in the series. If you want to do this you need to know where in the series the data is stored and update it there.
E.g.
for i, num in enumerate(df['a']):
df['a'].iloc[i] = int(round(obj))
When converting a float to an integer, I found out using df.dtypes that the column I was trying to round off was an object not a float. The round command won't work on objects so to do the conversion I did:
df['a'] = pd.to_numeric(df['a'])
df['a'] = df['a'].round(0).astype(int)
or as one line:
df['a'] = pd.to_numeric(df['a']).round(0).astype(int)
If you specifically want to round up as your question states, you can use np.ceil:
import numpy as np
df['a'] = np.ceil(df['a'])
See also Floor or ceiling of a pandas series in python?
Not sure there's much advantage to type converting to int; pandas and numpy love floats.
Related
Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)
You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']
You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.
df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')
Simple one-liner:
df.select_dtypes('number').columns
Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']
This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index
We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'
Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.
Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.
Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.
A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.
def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)
In a pandas dataframe, a column with dtype = object can, in fact, contain items of mixed types, eg integers and strings.
In this example, column a is dtype object, but the first item is string while all the others are int:
import numpy as np, pandas as pd
df=pd.DataFrame()
df['a']=np.arange(0,9)
df.iloc[0,0]='test'
print(df.dtypes)
print(type(df.iloc[0,0]))
print(type(df.iloc[1,0]))
My question is: is there a quick way to identify which columns with dtype=object contain, in fact, mixed types like above? Since pandas does not have a dtype = str, this is not immediately apparent.
However, I have had situations where, importing a large csv file into pandas, I would get a warning like:
sys:1: DtypeWarning: Columns (15,16) have mixed types. Specify dtype option on import or set low_memory=False
Is there an easy way to replicate that and explicitly list the columns with mixed types? Or do I manually have to go through them one by one, see if I can convert them to string, etc?
The background is that I am trying to export a dataframe to a Microsoft SQL Server using DataFrame.to_sql and SQLAlchemy. I get an
OverflowError: int too big to convert
but my dataframe does not contain columns with dtype int - only object and float64. I'm guessing this is because one of the object columns must have both strings and integers.
Setup
df = pd.DataFrame(np.ones((3, 3)), columns=list('WXY')).assign(Z='c')
df.iloc[0, 0] = 'a'
df.iloc[1, 2] = 'b'
df
W X Y Z
0 a 1.0 1 c
1 1 1.0 b c
2 1 1.0 1 c
Solution
Find all types and count how many unique ones per column.
df.loc[:, df.applymap(type).nunique().gt(1)]
W Y
0 a 1
1 1 b
2 1 1
So there's a DataFrame say:
>>> df = pd.DataFrame({
... 'A':[1,2,'Three',4],
... 'B':[1,'Two',3,4]})
>>> df
A B
0 1 1
1 2 Two
2 Three 3
3 4 4
I want to select the rows whose datatype of particular row of a particular column is of type str.
For example I want to select the row where type of data in the column A is a str.
so it should print something like:
A B
2 Three 3
Whose intuitive code would be like:
df[type(df.A) == str]
Which obviously doesn't works!
Thanks please help!
This works:
df[df['A'].apply(lambda x: isinstance(x, str))]
You can do something similar to what you're asking with
In [14]: df[pd.to_numeric(df.A, errors='coerce').isnull()]
Out[14]:
A B
2 Three 3
Why only similar? Because Pandas stores things in homogeneous columns (all entries in a column are of the same type). Even though you constructed the DataFrame from heterogeneous types, they are all made into columns each of the lowest common denominator:
In [16]: df.A.dtype
Out[16]: dtype('O')
Consequently, you can't ask which rows are of what type - they will all be of the same type. What you can do is to try to convert the entries to numbers, and check where the conversion failed (this is what the code above does).
It's generally a bad idea to use a series to hold mixed numeric and non-numeric types. This will cause your series to have dtype object, which is nothing more than a sequence of pointers. Much like list and, indeed, many operations on such series can be more efficiently processed with list.
With this disclaimer, you can use Boolean indexing via a list comprehension:
res = df[[isinstance(value, str) for value in df['A']]]
print(res)
A B
2 Three 3
The equivalent is possible with pd.Series.apply, but this is no more than a thinly veiled loop and may be slower than the list comprehension:
res = df[df['A'].apply(lambda x: isinstance(x, str))]
If you are certain all non-numeric values must be strings, then you can convert to numeric and look for nulls, i.e. values that cannot be converted:
res = df[pd.to_numeric(df['A'], errors='coerce').isnull()]
I constantly struggle with cleanly iterating or applying a function to Pandas DataFrames of variable length. Specifically, a length 1 DataFrame slice (Pandas Series).
Simple example, a DataFrame and a function that acts on each row of it. The format of the dataframe is known/expected.
def stringify(row):
return "-".join([row["y"], str(row["x"]), str(row["z"])])
df = pd.DataFrame(dict(x=[1,2,3],y=["foo","bar","bro"],z=[-99,1.04,213]))
Out[600]:
x y z
0 1 foo -99.00
1 2 bar 1.04
2 3 bro 213.00
df_slice = df.iloc[0] # This is a Series
Usually, you can apply the function in one of the following ways:
stringy = df.apply(stringify,axis=1)
# or
stringy = [stringify(row) for _,row in df.iterrows()]
Out[611]: ['foo-1--99.0', 'bar-2-1.04', 'bro-3-213.0']
## Error with same syntax if Series
stringy = df_slice.apply(stringify, axis=1)
If the dataframe is empty, or has only one entry, these methods no longer work. A Series does not have an iterrows() method and apply applies the function to each column (not rows).
Is there a cleaner built in method to iterate/apply functions to DataFrames of variable length? Otherwise you have to constantly write cumbersome logic.
if type(df) is pd.DataFrame:
if len(df) == 0:
return None
else:
return df.apply(stringify, axis=1)
elif type(df) is pd.Series:
return stringify(df)
I realize there are methods to ensure you form length 1 DataFrames, but what I am asking is for a clean way to apply/iterate on the various pandas data structures when it could be like-formatted dataframes or series.
There is no generic way to write a function which will seemlessly handle both
DataFrames and Series. You would either need to use an if-statement to check
for type, or use try..except to handle exceptions.
Instead of doing either of those things, I think it is better to make sure you create the right type of object before calling apply. For example, instead of using df.iloc[0] which returns a Series, use df.iloc[:1] to select a DataFrame of length 1. As long as you pass a slice range instead of a single value to df.iloc, you'll get back a DataFrame.
In [155]: df.iloc[0]
Out[155]:
x 1
y foo
z -99
Name: 0, dtype: object
In [156]: df.iloc[:1]
Out[156]:
x y z
0 1 foo -99
I have a Pandas Dataframe with different dtypes for the different columns. E.g. df.dtypes returns the following.
Date datetime64[ns]
FundID int64
FundName object
CumPos int64
MTMPrice float64
PricingMechanism object
Various of cheese columns have missing values in them. Doing a group operations on it with NaN values in place cause problems. To get rid of them with the .fillna() method is the obvious choice. Problem is the obvious clouse for strings are .fillna("") while .fillna(0) is the correct choice for ints and floats. Using either method on DataFrame throws exception. Any elegant solutions besides doing them individually (have about 30 columns)? I have a lot of code depending on the DataFrame and would prefer not to retype the columns as it is likely to break some other logic.
Can do:
df.FundID.fillna(0)
df.FundName.fillna("")
etc
You can iterate through them and use an if statement!
for col in df:
#get dtype for column
dt = df[col].dtype
#check if it is a number
if dt == int or dt == float:
df[col].fillna(0)
else:
df[col].fillna("")
When you iterate through a pandas DataFrame, you will get the names of each of the columns, so to access those columns, you use df[col]. This way you don't need to do it manually and the script can just go through each column and check its dtype!
You can grab the float64 and object columns using:
In [11]: float_cols = df.blocks['float64'].columns
In [12]: object_cols = df.blocks['object'].columns
and int columns won't have NaNs else they would be upcast to float.
Now you can apply the respective fillnas, one cheeky way:
In [13]: d1 = dict((col, '') for col in object_cols)
In [14]: d2 = dict((col, 0) for col in float_cols)
In [15]: df.fillna(value=dict(d1, **d2))
A compact version example:
#replace Nan with '' for columns of type 'object'
df=df.select_dtypes(include='object').fillna('')
However, after the above operation, the dataframe will only contain the 'object' type columns. For keeping all columns, use the solution proposed by #Ryan Saxe.
#Ryan Saxe's answer is accurate. To get it to work on my data I had to set inplace=True and also data= 0 and data= "". See code below:
for col in df:
#get dtype for column
dt = df[col].dtype
#check if it is a number
if dt == int or dt == float:
df[col].fillna(data=0, inplace=True)
else:
df[col].fillna(data="", inplace=True)
similar to #Guddi: A bit verbose, but still more concise then #Ryan's answer and keeping all columns:
df[df.select_dtypes("object").columns] = df.select_dtypes("object").fillna("")
Rather than running the conversion one column at a time, which is inefficient, here is a way to grab all of the int or float columns and change in one shot.
int_float_cols = df.select_dtypes(include=['int', 'float']).columns
df[int_float_cols] = df[int_float_cols].fillna(value=0)
Obvious how to adapt this to handle object.
I'm aware that in Pandas older versions, there were no NAs allowed in integers, so grabbing the "ints" is not strictly necessary and it may accidentially promote ints to floats. However, in our use case, it is better to be safe than sorry.
I ran into this because ordinary approach, df.fillna(0) corrupted all of the datetime variables.