How to pick the numeric columns in pd.Dataframe() [duplicate] - python

Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)

You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)

Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']

You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.

df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')

Simple one-liner:
df.select_dtypes('number').columns

Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']

This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index

We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'

Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.

Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.

Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.

A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.

def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)

Related

change dtype pandas by column number for multiple columns

I would like to change the dtype of a dataframe which I am going to read in using python pandas. I know that I can change the dtype by the column name like this:
df = pd.read_csv("blablab.csv", dtype = {"Age":int}
However, I would like to set the dtype by the column number. E.g. column 1,3,5 to "datetime" and the dtype of column 6 until the last column to dtype "float". Is there anything like:
df = pd.read_csv("blablab.csv", dtype = {1,3,5: datetime64, 6-end: float64}
Thank you very much, your help is greatly appreciated!
I would recommend building the dtype variable ahead of the import by importing one row for you to make a default dict comprehension of a default type and then modify the columns to special types. I pulled in StringIO just for running a test case below.
import pandas as pd
import numpy as np
from io import StringIO
dummyCSV = """header 1,header 2,header 3
1,2,3
4,5,6
7,8,9
11,12,13
14,15,16"""
blabblab_csv = StringIO(dummyCSV, newline='\n')
limitedRead = pd.read_csv(blabblab_csv, sep=",", nrows = 1)
#set a default type and populate all column types
defaultType = np.float64
dTypes = {key: defaultType for key in list(limitedRead.columns)}
#then override the columns you want, using the integer position
dTypes[limitedRead.columns[1]] = np.int32
blabblab_csv = StringIO(dummyCSV, newline='\n') #reset virtual file
fullRead = pd.read_csv(blabblab_csv, sep=",", dtype = dTypes)
I know its probably a little late for you, but I just had to do this for a project I'm working on so hopefully next search that hits this topic there will be an answer waiting for them.
One way is to change the type after creating the DataFrame like this:
import pandas as pd
df = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['c', 'd', 'e'],
'c' : ['1','2','3'],'d' : ['4','5','6']})
df[df.columns[2:]] = df[df.columns[2:]].astype(float)
df['c']
Output:
0 1.0
1 2.0
2 3.0
Name: c, dtype: float64
Here I change the last 2 columns's type to float

Pandas: combining duplicate index values

I have a pandas series that I would like to combine in three different ways. The series is as follows:
import pandas as pd
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
series = pd.Series(quantities, index=timestamps)
Clearly the timestamps have 3 values of 1, 1 value of 2, 3 values of 3 and 1 value of 1. I would like to generate the following series:
1. Sum of the duplicate index values:
pd.Series([12,6,17,0], index=[1,2,3,4])
2. Median of the duplicate index values:
pd.Series([2,6,7,0], index=[1,2,3,4])
2. The number of duplicate index values:
pd.Series([3,1,3,1], index=[1,2,3,4])
In numpy I would achieve this using a unique_elements_to_indices method:
from typing import Dict
import numpy as np
def unique_elements_to_indices(array: np.array) -> Dict:
mapping = {}
for unique_element in np.unique(array):
mapping[unique_element] = np.where(array == unique_element)[0]
return mapping
... and then I would loop through the unique_elements and use np.where to locate the quantities for that given unique_element.
Is there away to achieve this quickly in pandas, please?
Thanks.
Here is possible use functions sum, median for separate outputs with parameter level=0 for aggregate by index:
print (series.sum(level=0))
print (series.median(level=0))
But generaly aggregate by index with function:
print (series.groupby(level=0).sum())
print (series.groupby(level=0).median())
#difference between count and size is count exclude NaNs values
print (series.groupby(level=0).size())
print (series.groupby(level=0).count())
If need all together to new DataFrame use GroupBy.agg with list of aggregate functions:
print(series.groupby(level=0).agg(['sum', 'median', 'size']))
You could use .groupby for this:
import pandas as pd
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
sr = pd.Series(quantities, index=timestamps)
print(sr.groupby(sr.index).sum())
print(sr.groupby(sr.index).median())
print(sr.groupby(sr.index).count())
When you are working with pandas library then advisable to convert your data into dataframe. The Easiest way is as below in pandas
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
d = {'quantities': quantities, 'timestamps': timestamps}
df = pd.DataFrame(d)
df.groupby('timestamps').sum().reset_index()
The Similar way you can also use other function as well. Kindly let me know if this works for you.

Why does referencing a concatenated pandas dataframe return multiple entries?

When I create a dataframe using concat like this:
import pandas as pd
dfa = pd.DataFrame({'a':[1],'b':[2]})
dfb = pd.DataFrame({'a':[3],'b':[4]})
dfc = pd.concat([dfa,dfb])
And I try to reference like I would for any other DataFrame I get the following result:
>>> dfc['a'][0]
0 1
0 3
Name: a, dtype: int64
I would expect my concatenated DataFrame to behave like a normal DataFrame and return the integer that I want like this simple DataFrame does:
>>> dfa['a'][0]
1
I am just a beginner, is there a simple explanation for why the same call is returning an entire DataFrame and not the single entry that I want? Or, even better, an easy way to get my concatenated DataFrame to respond like a normal DataFrame when I try to reference it? Or should I be using something other than concat?
You've mistaken what normal behavior is. dfc['a'][0] is a label lookup and matches anything with an index value of 0 in which there are two because you concatenated two dataframes with index values including 0.
in order to specify position of 0
dfc['a'].iloc[0]
or you could have constructed dfc like
dfc = pd.concat([dfa,dfb], ignore_index=True)
dfc['a'][0]
Both returning
1
EDITED (thx piRSquared's comment)
Use append() instead pd.concat():
dfc = dfa.append(dfb, ignore_index=True)
dfc['a'][0]
1

Convert all elements in float Series to integer

I have a column, having float values,in a dataframe (so I am calling this column as Float series). I want to convert all the values to integer or just round it up so that there are no decimals.
Let us say the dataframe is df and the column is a, I tried this :
df['a'] = round(df['a'])
I got an error saying this method can't be applied to a Series, only applicable to individual values.
Next I tried this :
for obj in df['a']:
obj =int(round(obj))
After this I printed df but there was no change.
Where am I going wrong?
round won't work as it's being called on a pandas Series which is array-like rather than a scalar value, there is the built in method pd.Series.round to operate on the whole Series array after which you can change the dtype using astype:
In [43]:
df = pd.DataFrame({'a':np.random.randn(5)})
df['a'] = df['a'] * 100
df
Out[43]:
a
0 -4.489462
1 -133.556951
2 -136.397189
3 -106.993288
4 -89.820355
In [45]:
df['a'] = df['a'].round(0).astype(int)
df
Out[45]:
a
0 -4
1 -134
2 -136
3 -107
4 -90
Also it's unnecessary to iterate over the rows when there are vectorised methods available
Also this:
for obj in df['a']:
obj =int(round(obj))
Does not mutate the individual cell in the Series, it's operating on a copy of the value which is why the df is not mutated.
The code in your loop:
obj = int(round(obj))
Only changes which object the name obj refers to. It does not modify the data stored in the series. If you want to do this you need to know where in the series the data is stored and update it there.
E.g.
for i, num in enumerate(df['a']):
df['a'].iloc[i] = int(round(obj))
When converting a float to an integer, I found out using df.dtypes that the column I was trying to round off was an object not a float. The round command won't work on objects so to do the conversion I did:
df['a'] = pd.to_numeric(df['a'])
df['a'] = df['a'].round(0).astype(int)
or as one line:
df['a'] = pd.to_numeric(df['a']).round(0).astype(int)
If you specifically want to round up as your question states, you can use np.ceil:
import numpy as np
df['a'] = np.ceil(df['a'])
See also Floor or ceiling of a pandas series in python?
Not sure there's much advantage to type converting to int; pandas and numpy love floats.

Find mixed types in Pandas columns

Ever so often I get this warning when parsing data files:
WARNING:py.warnings:/usr/local/python3/miniconda/lib/python3.4/site-
packages/pandas-0.16.0_12_gdcc7431-py3.4-linux-x86_64.egg/pandas
/io/parsers.py:1164: DtypeWarning: Columns (0,2,14,20) have mixed types.
Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
But if the data is large (I have 50k rows), how can I find WHERE in the data the change of dtype occurs?
I'm not entirely sure what you're after, but it's easy enough to find the rows which contain elements which don't share the type of the first row. For example:
>>> df = pd.DataFrame({"A": np.arange(500), "B": np.arange(500.0)})
>>> df.loc[321, "A"] = "Fred"
>>> df.loc[325, "B"] = True
>>> weird = (df.applymap(type) != df.iloc[0].apply(type)).any(axis=1)
>>> df[weird]
A B
321 Fred 321
325 325 True
In addition to DSM's answer, with a many-column dataframe it can be helpful to find the columns that change type like so:
for col in df.columns:
weird = (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
if len(df[weird]) > 0:
print(col)
This approach uses pandas.api.types.infer_dtype to find the columns which have mixed dtypes. It was tested with Pandas 1 under Python 3.8.
Note that this answer has multiple uses of assignment expressions which work only with Python 3.8 or newer. It can however trivially be modified to not use them.
if mixed_dtypes := {c: dtype for c in df.columns if (dtype := pd.api.types.infer_dtype(df[c])).startswith("mixed")}:
raise TypeError(f"Dataframe has one more mixed dtypes: {mixed_dtypes}")
This approach doesn't however find a row with the changed dtype.
Create sample data with a column that has 2 data types
import seaborn
iris = seaborn.load_dataset("iris")
# Change one row to another type
iris.loc[0,"sepal_length"] = iris.loc[0,"sepal_length"].astype(str)
When columns use more than one type, print the column name and the types used:
for col in iris.columns:
unique_types = iris[col].apply(type).unique()
if len(unique_types) > 1:
print(col, unique_types)
To fix the column types you can:
use df[col] = df[col].astype(str) to change the data type.
or if the data frame was read from a csv file define the ̀dtype` argument in a dictionary of columns.

Categories

Resources