Pandera: Is cell based dataframe data validation possible? - python

Every row of my dataframe contain a record with a unique key combination. The data validation will be based on the columns and on key combination. For example, in a single column, cells may have a different min/max requirement based on the key combination.
Several questions:
can Pandera validate on a cell basis as opposed to column basis ?
does Pandera have a schema generator capable of this type of flexibility. Perhaps it scans a "golden dataframe" as a starting place to create a schema based on some provided criteria. I realize the schema generator output may need a bit of tweaking.
The library does look cool, and I am interested to pursue further.
thanks

so you can create a validator that validates a single value at a time with the element_size=True kwarg, you can read more here.
import pandera as pa
check = pa.Check(lambda x: 0 <= x <= 100, element_wise=True)
The function must take an individual value as input and output a boolean.
Can you elaborate on the exact check that you want to perform? If you want to do a dataframe-level row-wise check you can use an element-wise check at the dataframe-level as a wide check.
does Pandera have a schema generator capable of this type of flexibility. Perhaps it scans a "golden dataframe" as a starting place to create a schema based on some provided criteria. I realize the schema generator output may need a bit of tweaking.
You can use the schema = pandera.infer_schema(golden_dataframe) function to bootstrap a starter schema, then write it out to a file with schema.to_script("path/to/file") to further iterate.

Related

Pandas df.loc with regex

I'm working with a data set consisting of several csv files of nearly the same form. Each csv describes a particular date, and labels the data by state/province. However, the format of one of the column headers in the data set was altered from Province/State to Province_State, so that all csv's created before a certain date use the first format and all csv's created after that date use the second format.
I'm trying to sum up all the entries corresponding to a particular state. At present, the code I'm working with is as follows:
daily_data.loc[daily_data[areaLabel] == location].sum()
where daily_data is the dataframe containing the csv data, location is the name of the state I'm looking for, and arealabel is a variable storing either 'Province/State' or 'Province_State' depending on the result of a date check. I would like to eliminate the date check by e.g. conditioning on a regular expression like Province(/|_)State, but I'm having a lot of trouble finding a way to index into a pandas dataframe by regular expression. Is this doable (and in a way that would make the code more elegant rather than less elegant)? If so, I'd appreciate it if someone could point me in the right direction.
Use filter to get the columns that match your regex
>>> df.filter(regex="Province(/|_)State").columns[0]
'Province/State'
Then use this to select only rows that match your location:
df[df[df.filter(regex="Province(/|_)State").columns[0]]==location].sum()
This however assumes that there are no other columns that would match the regex.

Extract pandas dataframe column names from query string

I have a dataset with a lot of fields, so I don't want to load all of it into a pd.DataFrame, but just the basic ones.
Sometimes, I would like to do some filtering upon loading and I would like to apply the filter via the query or eval methods, which means that I need a query string in the form of, i.e. "PROBABILITY > 10 and DISTANCE <= 50", but these columns need to be loaded in the dataframe.
Is is possible to extract the column names from the query string in order to load them from the dataset?
I know some magic using regex is possible, but I'm sure that it would break sooner or later, as the conditions get complicated.
So, I'm asking if there is a native pandas way to extract the column names from the query string.
I think you can use when you load your dataframe the term use cols I use it when I load a csv I dont know that is possible when you use a SQL or other format.
Columns_to use=['Column1','Column3']
pd.read_csv(use_cols=Columns_to_use,...)
Thank you

Pyspark, find all the data types of a column

I am doing some data cleaning and data profiling work. So the given data could be quite messy.
I want to get all the potential data types of a column using pyspark.
Data types like:
Integer
Real number
Date/Time
String (Text)
etc
I will need to do more processing to generate the respective metadata based on what types do the column has.
A column can contain more than one type. I don't mean the built-in data types. The given data are all of type string, but some are in the form of "1234" which is actually an int, and some are in the form of "2019/11/19", which is actually a date.
For example, the column number could contain values like
"123"
"123.456"
"123,456.789"
"NUMBER 123"
In the above example, the data types would be INTEGER, REAL NUMBER, STRING.
If I used df.schema[col].dataType, it just simply give me StringType.
I was thinking that I can iterate through the entire column, and use regex to see which type does the row belong to, but I am curious if there is some better way to do it, since it's a relatively large dataset.
Now I kind of solve the issue by iterating through the column, and do some type checking:
df = spark.sql('SELECT col as _col FROM table GROUP BY _col')
df.rdd.map(lambda s: typeChecker(s))
where in typeChecker I just check which type does s._col belongs to.
Thanks.

is there a way to convert an rdd to df ignoring lines that don't fit the schema?

I have a large file of jsons (can't post any, sorry for that) varying in number of keys and datatype of values of identical keys as well - is there a way to put all the lines that follow a schema (hard-coded or inferred) into a dataframe and leave all the lines that don't fit the schema in an rdd?
Eventually, I would like to iterate through such a process and get a couple of df's each one with its own schema at the end.
here is a close to reality example:
a = [['aaa', 'bbb', 'ccc']]*22
b =[['aaa', 'bbb', 'ccc', 'ddd']]*22
rdd_1 = sc.parallelize(a+b)
rdd_1.toDF().show(30)
this fails with:
Caused by: java.lang.IllegalStateException: Input row doesn't have
expected number of values required by the schema. 3 fields are
required while 4 values are provided.
In this specific case I could form a function that adds null in case of less than max fields, but I'm after a more generic try and except method that could tackle nested data with unpredictable schema changes.
any ideas would be very much appreciated.
Instead of working with rdd you could load this file to a dataframe (if it's present in any stable storage like HDFS/Amazone S3 of course), with mode = PERMISSIVE. First prepare a generic schema of yours to work with. The code is the following.
df = sqlContext.read.schema(<your-schema>).option("mode", "PERMISSIVE").json(<file-path>)
The spark documentation says -
PERMISSIVE : sets other fields to null when it meets a corrupted record and puts the malformed string into a new field configured by spark.sql.columnNameOfCorruptRecord. When a schema is set by user, it sets null for extra fields.
DROPMALFORMED : ignores the whole corrupted records.
FAILFAST: throws an exception when it meets corrupted records.
Find details in this link.
Hope, this helps.

Creating a Cross Tab Query in SQL Alchemy

I was doing some reading on google and the sqlalchmey documentation but could not find any kind of built in functionlity that could take a standard sequel formated table and transform it into a cross tab query like Microsoft Access.
I have in the past when using excel and microsoft access created "cross tab" queries. Below is the sequel code from an example:
TRANSFORM Min([Fixed Day-19_Month-8_142040].VoltageAPhase) AS MinOfVoltageAPhase
SELECT [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
FROM [Fixed Day-19_Month-8_142040]
GROUP BY [Fixed Day-19_Month-8_142040].Substation, [Fixed Day-19_Month-8_142040].Feeder, [Fixed Day-19_Month-8_142040].MeterID
PIVOT [Fixed Day-19_Month-8_142040].Date;
I am very unskilled when it comes to sequel and the only way I was able to write this was by generating it in access.
My question is: "Since SQL alchemy python code is really just a nice way of calling or generating sequel code using python functions/methods, is there a way I could use SQL alchemy to call a custom query that generates the sequel code (in above block) to make a cross tab query? Obviously, I would have to change some of the sequel code to shoehorn it in with the correct fields and names but the keywords should be the same right?
The other problem is...in addition to returning the objects for each entry in the table, I would need the field names...I think this is called "meta-data"? The end goal being once I had that information, I would want to output to excel or csv using another package.
UPDATED
Okay, so Van's suggestion to use pandas I think is the way to go, I'm currently in the process of figuring out how to create the cross tab:
def OnCSVfile(self,event):
query = session.query(Exception).filter_by(company = self.company)
data_frame = pandas.read_sql(query.statement,query.session.bind) ## Get data frame in pandas
pivot = data_frame.crosstab()
So I have been reading the pandas link you provided and have a question about the parameters.
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True)
Since, I'm calling "crosstab" off the dataframe object, I assume there must be some kind of built-in way the dataframe recognizes column and row names. For index, I would pass in a list of strings that specify which fields I want tabulated in rows? Columns I would pass in a list of strings that specifiy which field I want along the column? From what I know about cross tab queries, there should only be one specification field for column right? For values, I want minimum function, so I would have to pass some parameter to return the minimum value. Currently searching for an answer.
So if I have the following fields in my flat data frame (my original Sequel Query).
Name, Date and Rank
And I want to pivot the data as follows:
Name = Row of Crosstab
Date = Column of Crosstab
Rank = Min Value of Crosstab
Would the function call be something like:
data_frame.crosstab(['Name'], ['Date'], values=['Rank'],aggfunc = min)
I tried this code below:
query = session.query(Exception)
data_frame = pandas.read_sql(query.statement,query.session.bind)
row_list = pandas.Series(['meter_form'])
col_list = pandas.Series(['company'])
print row_list
pivot = data_frame.crosstab(row_list,col_list)
But I get this error about data_frame not having the attribute cross tab:
I guess this might be too much new information for you at once. Nonetheless, I would approach it completely differently. I would basically use pandas python library to do all the tasks:
Retrive the data: since you are using sqlalchemy already, you can simply query the database for only the data you need (flat, without any CROSSTAB/PIVOT)
Transform: put it intoa pandas.DataFrame. For example, like this:
import pandas as pd
query = session.query(FixedDay...)
df = pd.read_sql(query.statement, query.session.bind)
Pivot: Call pivot = df.crosstab(...) to create a pivot in memory. See pd.crosstab for more information.
Export: Save it to Excel/csv using DataFrame.to_excel

Categories

Resources