Pandas series - how to validate each item is categorical - python

I am importing data which should be categorical from an externally sourced csv file into a pandas dataframe.
The first thing I want to do is to validate that the values are valid for the categorical type.
My strategy is to create an instance of CategoricalDtype and then using apply to test each value.
Question: The only way I can figure out is to test each value is in CategoricalDtype.categories.values but is there a "better" way? are there any methods I can use to achieve the same? I'm new to CategoricalDtype and it doesnt feel like this is the best way to be testing the data value.
# example of what I'm doing
import pandas as pd
from pandas.api.types import CategoricalDtype
df = pd.read_csv('data.csv')
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)
df['data_is_valid']=df['data_field'].apply(lambda x: x in cat.categories.values)

If need test if exist values from column data_field :
df['data_is_valid']=df['data_field'].isin(cat.categories)
If need test also categorical_dtype:
from pandas.api.types import is_categorical_dtype
df['data_is_valid']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
Difference is possible see in data sample:
from pandas.api.types import CategoricalDtype
from pandas.api.types import is_categorical_dtype
df = pd.DataFrame({ "data_field": ["A", "B", "C", "D", 'E']})
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)
#categories match but not Categorical
df['data_is_valid1']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid2']=df['data_field'].isin(cat.categories)
cat_type = CategoricalDtype(categories=["A", "B", "C", 'D', 'E'], ordered=True)
#created Categorical column
df['data_field'] = df['data_field'].astype(cat_type)
#categoriesand Categorical match
df['data_is_valid3']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid4']=df['data_field'].isin(cat.categories)
print (df)
data_field data_is_valid1 data_is_valid2 data_is_valid3 data_is_valid4
0 A False True True True
1 B False True True True
2 C False True True True
3 D False False False False
4 E False False False False

Related

How do I check if a value is of, or promotable to, a column type in pandas?

For example, suppose I have the following DataFrame.
import pandas as pd
df = pd.DataFrame([['a', 1.3, 10], ['b', 2, 20]], columns=['id', 'v1', 'v2'])
df = df.astype({col: 'category' for col in df.columns[df.dtypes == object]})
print(df)
print()
print(df.dtypes)
id v1 v2
0 a 1.3 10
1 b 2.0 20
id category
v1 float64
v2 int64
Given a value and a column identifier, I need to know whether the type of the value is compatible with the column. (Of the same type or promotable.)
For category fields, I'd like to know if a value is in the category. I can do something like
'x' in df['id'].unique()
but there may be a more efficient way.
Thanks.
Suppose you have a value x=4.3
you can simply compare like:
df.v1.dtype will give the data type of the particular column you want to compare (v1 in this case)
type(x)==df.v1.dtype
#output True
I think 'x' in df['id'].unique() will check if the value 'x' is in the column—which is not the same as checking if it is one of the categories.
Based on this answer it looks like you check if a value is in the categorical as follows:
'x' in df["id"].cat.categories
Test:
ids = pd.Categorical(["a", "b"], categories=["a", "b", "c"])
assert(("c" in ids) is False)
assert(("c" in ids.categories) is True)
UPDATE:
Is this what you wanted?
def check_type(x, df, name):
try:
return x in df[name].cat.categories
except AttributeError:
return type(x) == df[name].dtype
Test:
assert(check_type('a', df, "ids") is True)
assert(check_type('c', df, "ids") is True)
assert(check_type(3.4, df, "ids") is False)
assert(check_type(3.4, df, "v1") is True)
assert(check_type(3.4, df, "v2") is False)
assert(check_type(3, df, "v1") is False)
assert(check_type(3, df, "v2") is True)

How to implement python custom function on dictionary of dataframes

I have a dictionary that contains 3 dataframes.
How do I implement a custom function to each dataframes in the dictionary.
In simpler terms, I want to apply the function find_outliers as seen below
# User defined function : find_outliers
#(I)
from scipy import stats
outlier_threshold = 1.5
ddof = 0
def find_outliers(s: pd.Series):
outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
# replace boolean values with corresponding strings
return ['background-color:blue' if val else '' for val in outlier_mask]
To the dictionary of dataframes dict_of_dfs below
# the dataset
import numpy as np
import pandas as pd
df = {
'col_A':['A_1001', 'A_1001', 'A_1001', 'A_1001', 'B_1002','B_1002','B_1002','B_1002','D_1003','D_1003','D_1003','D_1003'],
'col_X':[110.21, 191.12, 190.21, 12.00, 245.09,4321.8,122.99,122.88,134.28,148.14,161.17,132.17],
'col_Y':[100.22,199.10, 191.13,199.99, 255.19,131.22,144.27,192.21,7005.15,12.02,185.42,198.00],
'col_Z':[140.29, 291.07, 390.22, 245.09, 4122.62,4004.52,395.17,149.19,288.91,123.93,913.17,1434.85]
}
df = pd.DataFrame(df)
df
#dictionary_of_dataframes
#(II)
dict_of_dfs=dict(tuple(df.groupby('col_A')))
and lastly, flag outliers in each df of the dict_of_dfs
# end goal is to have find/flag outliers in each `df` of the `dict_of_dfs`
#(III)
desired_cols = ['col_X','col_Y','col_Z']
dict_of_dfs.style.apply(find_outliers, subset=desired_cols)
summarily, I want to apply I to II and finally flag outliers in III
Thanks for your attempt. :)
Desired output should look like this, but for the three dataframes
This may not be what you want, but this is how I'd approach it, but you'll have to work out the details of the function because you have it written to receive a series rather a dataframe. Groupby apply() will send the subsets of rows and then you can perform the actions on that subset and return the result.
For consideration:
inside the function you may be able to handle all columns like so:
def find_outliers(x):
for col in ['col_X','col_Y','col_Z']:
outlier_mask = np.abs(stats.zscore(x[col], ddof=ddof)) > outlier_threshold
x[col] = ['outlier' if val else '' for val in outlier_mask]
return x
newdf = df.groupby('col_A').apply(find_outliers)
col_A col_X col_Y col_Z
0 A_1001 outlier
1 A_1001
2 A_1001
3 A_1001 outlier
4 B_1002 outlier
5 B_1002 outlier
6 B_1002
7 B_1002
8 D_1003 outlier
9 D_1003
10 D_1003

python pandas dataframe filling e.g. bfill, ffill

I have two problems with filling out a very large dataframe. There is a section of the picture. I want the 1000 in E and F to be pulled down to 26 and no further. In the same way I want the 2000 to be pulled up to -1 and down to the next 26. I thought I could do this with bfill and ffill, but unfortunately I don't know how...(picture1)
Another problem is that columns occur in which the values from -1 to 26 do not contain any values in E and F. How can I delete or fill them with 0 so that no bfill or ffill makes wrong entries there?
(picture2)
import pandas as pd
import numpy as np
data = '/Users/Hanna/Desktop/Coding/Code.csv'
df_1 = pd.read_csv(data,usecols=["A",
"B",
"C",
"D",
"E",
"F",
],nrows=75)
base_list =[-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]
df_c = pd.MultiIndex.from_product([
[4000074],
["SP000796746","SP001811642"],
[201824, 201828, 201832, 201835, 201837, 201839, 201845, 201850, 201910, 201918, 201922, 201926, 201909, 201916, 201918, 201920],
base_list],
names=["A", "B", "C", "D"]).to_frame(index=False)
df_3 = pd.merge(df_c, df_1, how='outer')
To understand it better, I have shortened the example a bit. Picture 3 shows how it looks like when it is filled and picture 4 shows it correctly filled
Assuming you have to find and fill values for a particular segment.
data = pd.read_csv('/Users/Hanna/Desktop/Coding/Code.csv')
for i in range(0,data.shape[0],27):
if i+27 < data.shape[0]:
data.loc[i:i+27,'E'] = max(data['E'].iloc[i:i+27])
else:
data.loc[i:data.shape[0],'E'] = max(data['E'].iloc[i:data.shape[0]])
you can replace the max to whatever you want.
could find the indexes where you have -1 and then slice/loop over the columns to fill.
just to create the sample data:
import pandas as pd
df = pd.DataFrame(columns=list('ABE'))
df['A']=list(range(-1, 26)) * 10
add random values at each section
import random
for i in df.index:
if i%27 == 0:
df.loc[i,'B'] = random.random()
else:
df.loc[i, 'B'] = 0
find the indexes to slice over
indx = df[df['A'] == -1].index.values
fill out data in column "E"
for i, j in zip(indx[:-1], indx[1:]):
df.loc[i:j-1, 'E'] = df.loc[i:j-1, 'B'].max()
if j == indx[-1]:
df.loc[j:, 'E'] = df.loc[j:, 'B'].max()

A more efficient way to iterate over multiple DataFrames

I am trying to create custom DataFrame that will represent all missing (NaN) values in my data.
Solution I came up with works, but it is slow and ineffective over a set with 300 rows and 50 columns.
Pandas Version = "0.24.2"
import pandas as pd
data = {
'city_code' : ['Sydney2017', 'London2017', 'Sydney2018', 'London2018'],
'population_mil': [5.441, 7.375, pd.np.nan, pd.np.nan]
}
class NaNData:
def __init__(self, data: dict):
self.data: dict = data
#property
def data_df(self) -> pd.DataFrame:
""" Returns input data as a DataFrame. """
return pd.DataFrame(self.data)
def select_city(self, city_code: str) -> pd.DataFrame:
""" Creates DataFrame where city_code column value matches
requested city_code string. """
df = self.data_df
return df.loc[df['city_code'] == city_code]
#property
def df(self) -> pd.DataFrame:
""" Creates custom summary DataFrame to represent missing data. """
data_df = self.data_df
# There are duplicates in 'city_code' column. Make sure your cities
# are unique values only.
all_cities = list(set(data_df['city_code']))
# Check whether given city has any NaN values in any column.
has_nan = [
self.select_city(i).isnull().values.any() for i in all_cities
]
data = {
'cities' : all_cities,
'has_NaN': has_nan,
}
df = pd.DataFrame(data)
return df
nan_data = NaNData(data)
print(nan_data.df)
# Output:
# cities has_NaN
# 0 London2018 True
# 1 London2017 False
# 2 Sydney2018 True
# 3 Sydney2017 False
I feel like the way I approach iteration in pandas is not right. Is there a proper (or common) solution for this kind of problem? Should I be somehow using groupby for these kind of operations?
Any input is very appreciated,
Thank you for your time.
You don't need to iterate over multiple dataframes to obtaion your result, you can indeed use groupby with apply:
import pandas as pd
data = {
'city_code' : ['Sydney2017', 'London2017', 'Sydney2018', 'London2018'],
'population_mil': [5.441, 7.375, pd.np.nan, pd.np.nan],
'temp': [28, pd.np.nan, 24, 25]
}
df = pd.DataFrame(data)
df.groupby('city_code').apply(lambda x: x.isna().any()).any(axis=1)
I think you can use the isna() function to do the na check:
df = pd.DataFrame(data)
df.assign(has_NaN=df.population_mil.isna()).drop('population_mil',1)
city_code has_NaN
0 Sydney2017 False
1 London2017 False
2 Sydney2018 True
3 London2018 True

Creating list of set values from a single value containing multiple value sets under one parenthesis

so I currently have a column containing values like this:
d = {'col1': [LINESTRING(174.76028 -36.80417,174.76041 -36.80389, 175.76232 -36.82345)]
df = pd.DataFrame(d)
and I am trying to make it so that I can:
1) apply a function to each of the numerical values and
2) end up with something like this.
d = {'col1': [LINESTRING], 'col2': [(174.76028, -36.80417),(174.76041 -36.80389), (175.76232 -36.82345)]
df = pd.DataFrame(d)
Any thoughts?
Thanks
Here is one way. Note that LineString accepts an ordered collection of tuples as an input. See the docs for more information.
We use operator.attrgetter to access the required attributes: coords and __class__.__name__.
import pandas as pd
from operator import attrgetter
class LineString():
def __init__(self, list_of_coords):
self.coords = list_of_coords
pass
df = pd.DataFrame({'col1': [LineString([(174.76028, -36.80417), (174.76041, -36.80389), (175.76232, -36.82345)])]})
df['col2'] = df['col1'].apply(attrgetter('coords'))
df['col1'] = df['col1'].apply(attrgetter('__class__')).apply(attrgetter('__name__'))
print(df)
col1 col2
0 LineString [(174.76028, -36.80417), (174.76041, -36.80389...

Categories

Resources