How to validate a dataframe index using SchemaModel in Pandera

How to validate a dataframe index using SchemaModel in Pandera - python

I can validate a DataFrame index using the DataFrameSchema like this:
import pandera as pa
from pandera import Column, DataFrameSchema, Check, Index
schema = DataFrameSchema(
columns={
"column1": pa.Column(int),
},
index=pa.Index(int, name="index_name"),
)
# raises the error as expected
schema.validate(
pd.DataFrame({"column1": [1, 2, 3]}, index=pd.Index([1, 2, 3], name="index_incorrect_name"))
)
Is there a way to do the same using a SchemaModel?

You can do as follows -
import pandera as pa
from pandera.typing import Index, Series
class Schema(pa.SchemaModel):
idx: Index[int] = pa.Field(ge=0, check_name=True)
column1: Series[int]
df = pd.DataFrame({"column1": [1, 2, 3]}, index=pd.Index([1, 2, 3], name="index_incorrect_name"))
Schema.validate(df)

Found an answer in GitHub
You can use pa.typing.Index to type-annotate an index.
class Schema(pa.SchemaModel):
column1: pa.typing.Series[int]
index_name: pa.typing.Index[int] = pa.Field(check_name=True)
See how you can validate a MultiIndex index: https://pandera.readthedocs.io/en/stable/schema_models.html#multiindex

Related

module 'tsfresh.feature_extraction.feature_calculators' has no attribute

I'm new using tsfresh, when I use the following lines, I get the extracted feature as desired
import numpy as np
import pandas as pd
from tsfresh.feature_extraction import ComprehensiveFCParameters
from tsfresh import extract_features
df = pd.DataFrame(np.array([[1, 2, 3, 4],[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]),
columns=['Context ID','Time Elapsed', 'time_serie A', 'time_serie B'])
settings = ComprehensiveFCParameters()
kind_to_fc_parameters = {
"time_serie A": {},
"time_serie B": {"mean": None}
}
extract_features = extract_features(df, kind_to_fc_parameters =kind_to_fc_parameters,
column_id='Context ID', column_sort="Time Elapsed")
extract_features
However, when I change {"mean": None} by {"absolute_maximum": None} or "count_above": [{"t": 0.05}] it'won't work anymore:
module 'tsfresh.feature_extraction.feature_calculators' has no
attribute 'absolute_maximum'
What do I miss ?

I just had a similar issue with another calculation I chose and found it's just not in the feature_calculators.py (you can open it from yourdirectory\Python\Python37\Lib\site-packages\tsfresh\feature_extraction), so I did pip install tsfresh -U in terminal to get the latest tsfresh, checked feature_calculators.py again, my desired function is there and code runs fine then.

Python ValueError from np.where create flag based on one condition

If the city has been mentioned in cities_specific I would like to create a flag in the cities_all data. It's just a minimal example and in reality I would like to create multiple of these flags based on multiple data frames. That's why I tried to solve it with isin instead of a join.
However, I am running into ValueError: Length of values (3) does not match length of index (7).
# import packages
import pandas as pd
import numpy as np
# create minimal data
cities_specific = pd.DataFrame({'city': ['Melbourne', 'Cairns', 'Sydney'],
'n': [10, 4, 8]})
cities_all = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000]})
# get value error
# how can this be solved differently?
cities_all.assign(in_cities_specific=np.where(cities_specific.city.isin(cities_all.city), '1', '0'))
# that's the solution I would like to get
expected_solution = pd.DataFrame({'city': ['Vancouver', 'Melbourne', 'Athen', 'Vienna', 'Cairns',
'Berlin', 'Sydney'],
'inhabitants': [675218, 5000000, 664046, 1897000, 150041, 3769000, 5312000],
'in_cities': [0, 1, 0, 0, 1, 0, 1]})

I think you are changing the position in the condition.
Here you have some alternatives:
cities_all.assign(
in_cities_specific=np.where(cities_all.city.isin(cities_specific.city), '1', '0')
)
or
cities_all["in_cities_specific"] =
cities_all["city"].isin(cities_specific["city"]).astype(int).astype(str)
or
condlist = [cities_all["city"].isin(cities_specific["city"])]
choicelist = ["1"]
cities_all["in_cities_specific"] = np.select(condlist, choicelist,default="0")

Create values in column after groupby

I have a data frame that is obtained after grouping an initial data frame by the 'hour' and 'site' column. So the current data frame has details of 'value' grouped per 'hour' and 'site'. What I want is to fill the hour which has no 'value' with zero. 'Hour' range is from 0-23. how can I do this?
Left is input, right is expected output

You can try this:
import numpy as np
import pandas as pd
raw_df = pd.DataFrame(
{
"Hour": [1, 2, 4, 12, 0, 2, 7, 13],
"Site": ["x", "x", "x", "x", "y", "y", "y", "y"],
"Value": [1, 1, 1, 1, 1, 1, 1, 1],
}
)
full_hour = pd.DataFrame(
{
"Hour": np.concatenate(
[range(24) for site_name in raw_df["Site"].unique()]
),
"Site": np.concatenate(
[[site_name] * 24 for site_name in raw_df["Site"].unique()]
),
}
)
result = full_hour.merge(raw_df, on=["Hour", "Site"], how="left").fillna(0)
Then you can get what you want. But I suggest you copy your test data in your question instead an image. You know, we have no responsibility to create your data. You should think more about how can make others answer your question comfortably.

So if you want to change the value in hours column to zero, where the value is not in range of 0-23, here is what to do.I actually didn't get your question clearly so i assume this must be what you want.I have taken a dummy example as you have not provided you own data.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011',
'13/2/2011','14/2/2011'],
'Product':['Umbrella', 'Matress', 'Badminton', 'Shuttle','ewf'],
'Last_Price':[1200, 1500, 1600, 352,'ee'],
'Updated_Price':[12, 24, 0, 1,np.nan],
'Discount':[10, 10, 10, 10, 11]})
df['Updated_Price'] = df['Updated_Price'].fillna(0)
df.loc[df['Updated_Price']>23,'Updated_Price']=0
This replaces all nan values with 0 and and for values greater than 23, also replaces with 0

Pandas ignore missing dates to find percentiles

I have a dataframe. I am trying to find percentiles of datetimes. I am using the function:
Dataframe:
student, attempts, time
student 1,14, 9/3/2019 12:32:32 AM
student 2,2, 9/3/2019 9:37:14 PM
student 3, 5
student 4, 16, 9/5/2019 8:58:14 PM
studentInfo2 = [14, 4, Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data['time'].notnull(), student2Info[2], 'rank')
where student2Info[2] holds the datetime for a particular student. When I try and do this I get the error:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Any ideas on how I can get the percentile to calculate correctly even when there are missing times in the columns?

You need to transform the Timestamps into units that percentileofscore can understand. Also, pd.DataFrame.notnull() returns a boolean list that you may use to filter your DataFrame, it does not return the filtered list, so I've updated that for you. Here is a working example:
import pandas as pd
import scipy.stats as stats
data = pd.DataFrame.from_dict({
"student": [1, 2, 3, 4],
"attempts": [14, 2, 5, 16],
"time_0001": [
"9/3/2019 12:32:32 AM",
"9/3/2019 9:37:14 PM",
"",
"9/5/2019 8:58:14 PM"
]
})
student2Info = [14, 4, pd.Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data[data['time'].notnull()].time.transform(pd.Timestamp.toordinal), student2Info[2].toordinal(), 'rank')
print(perc1_first) #-> 66.66666666666667

DataType Category not understood?

import pandas
data = pandas.read_table("D:\Machine Learning SW\MusicRec\lastfm-dataset- 360K\usersha1-artmbid-artname-plays.tsv",
usecols=[0, 2, 3],
names=['user', 'artist', 'plays'])
# map each artist and user to a unique numeric value
data['user'] = data['user'].astype("category")
data['artist'] = data['artist'].astype("category")
# create a sparse matrix of all the artist/user/play triples
plays = coo_matrix((data['plays'].astype(float),
(data['artist'].cat.codes,
data['user'].cat.codes)))
The dtype of the data is object. How do i typecast it to category?

If the type of values in your dataset are object, try the dtype = object option when you read your file:
data = pandas.read_table("your_file.tsv", usecols=[0, 2, 3],
names=['user', 'artist', 'plays'],dtype = object)
And if it's only for a particular column:
data = pandas.read_table("your_file.tsv", usecols=[0, 2, 3],
names=['user', 'artist', 'plays'],dtype = {col_name : object})

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to validate a dataframe index using SchemaModel in Pandera - python

Related

module 'tsfresh.feature_extraction.feature_calculators' has no attribute

Python ValueError from np.where create flag based on one condition

Create values in column after groupby

Pandas ignore missing dates to find percentiles

DataType Category not understood?

Categories

Resources