Using Python Great Expectations to remove invalid data

Using Python Great Expectations to remove invalid data - python

I just started with Great Expectations library and I want to know if it is possible to use it to remove invalidated data from Pandas DataFrame. And how I can do that if is possible ?
Also I want to insert invalid data to PostgreSQL database.
I didn't find anything about this in the documentation and on searching the Web.
Later Edit :
To clarify: I need that in the case great expectation for example find 5 rows in a DataFrame that are invalid (for example df.expect_column_values_to_not_be_null('age') has 5 rows with null) to remove them from original DataFrame and insert them in a PostgreSQL errors table

Great Expectations is a powerful tool to validate data.
Like all powerful tools, it's not that straightforward.
You can start from here:
import great_expectations as ge
import numpy as np
import pandas as pd
# get some random numbers and create a pandas df
df_raw = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# initialize a "great_expectations" df
df = ge.from_pandas(df_raw)
# search for invalidate data on column 'A'.
# In this case, i'm looking for any null value from column 'A'.
df.expect_column_values_to_not_be_null('A')
Results:
{
"exception_info": null,
"expectation_config": {
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {
"column": "A",
"result_format": "BASIC"
},
"meta": {}
},
"meta": {},
"success": true,
"result": {
"element_count": 100,
"unexpected_count": 0,
"unexpected_percent": 0.0,
"partial_unexpected_list": []
}
}
Look at the response : good news !!!
There aren't null values in my df.
"unexpected_count" is equal to 0
API Reference :
https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/index.html
EDIT:
If you need simply to find some invalid values and split your df into:
Clean Dataframe
Dirty Dataframe
maybe you dont need "great_expectations". you can use a function like this:
import pandas as pd
my_df = pd.DataFrame({'A': [1,2,1,2,3,0,1,1,5,2]})
def check_data_quality(dataframe):
df = dataframe
clean_df = df[df['A'].isin([1, 2])]
dirty_df = df[df["A"].isin([1, 2]) == False]
return {'clean': clean_df,
'dirty': dirty_df}
my_df_clean = check_data_quality(my_df)['clean']
my_df_dirty = check_data_quality(my_df)['dirty']

Related

Unable to pull the key:value from dictionary in column to multiple columns

So I was using the solution in this post (Split / Explode a column of dictionaries into separate columns with pandas) but nothing changes in my df.
Here is df before code:
number status_timestamps
0 234234 {"created": "2020-11-30T19:44:42Z", "complete"...
1 2342 {"created": "2020-12-14T13:43:48Z", "complete"...
Here is a sample of the dictionary in that column:
{"created": "2020-11-30T19:44:42Z",
"complete": "2021-01-17T14:20:58Z",
"invoiced": "2020-12-16T22:55:02Z",
"confirmed": "2020-11-30T21:16:48Z",
"in_production": "2020-12-11T18:59:26Z",
"invoice_needed": "2020-12-11T22:00:09Z",
"accepted": "2020-12-01T00:00:23Z",
"assets_uploaded": "2020-12-11T17:16:53Z",
"notified": "2020-11-30T21:17:48Z",
"processing": "2020-12-11T18:49:50Z",
"classified": "2020-12-11T18:49:50Z"}
Here is what I tried and df does not change:
df_final = pd.concat([df, df['status_timestamps'].progress_apply(pd.Series)], axis = 1).drop('status_timestamps', axis = 1)
Here is what happens in a notebook:

Please provide a minimal reproducible working example of what you have tried next time.
If I follow the solution in the mentioned post, it works.
This is the code I have used:
import pandas as pd
json_data = {"created": "2020-11-30T19:44:42Z",
"complete": "2021-01-17T14:20:58Z",
"invoiced": "2020-12-16T22:55:02Z",
"confirmed": "2020-11-30T21:16:48Z",
"in_production": "2020-12-11T18:59:26Z",
"invoice_needed": "2020-12-11T22:00:09Z",
"accepted": "2020-12-01T00:00:23Z",
"assets_uploaded": "2020-12-11T17:16:53Z",
"notified": "2020-11-30T21:17:48Z",
"processing": "2020-12-11T18:49:50Z",
"classified": "2020-12-11T18:49:50Z"}
df = pd.DataFrame({"number": 2342, "status_timestamps": [json_data]})
# fastest solution proposed by your reference post
df.join(pd.DataFrame(df.pop('status_timestamps').values.tolist()))

I was able to use another answer from that post but change to a safer option of literal_eval since it was using eval
Here is working code:
import pandas as pd
from ast import literal_eval
df = pd.read_csv('c:/status_timestamps.csv')
df["status_timestamps"] = df["status_timestamps"].apply(lambda x : dict(literal_eval(x)) )
df2 = df["status_timestamps"].apply(pd.Series )
df_final = pd.concat([df, df2], axis=1).drop('status_timestamps', axis=1)
df_final

Saving pandas dataframes in formats including data descriptions or other self documenting strings in schemas? JSON?

I need to share well described data and want to do this in a modern way that avoids managing bureaucratic documentation no one will read. Fields require some description or note (eg. "values don't include ABC because XYZ") which I'd like to associate to columns that'll be saved with pd.to_<whatever>(), but I don't know of such functionality in pandas.
The format can't present security concerns, and should have a practical compromise between data integrity, performance, and file size. Looks like JSON without index might suit.
JSON documentation writes of schema annotations, which supports pairing keywords like description with strings, but I can't figure out how to use this with options described in pandas to_json documentation.
Example df:
df = pd.DataFrame({"numbers": [6, 2],"strings": ["foo", "whatever"]})
df.to_json('temptest.json', orient='table', indent=4, index=False)
We can edit the JSON to include description:
"schema":{
"fields":[
{
"name":"numbers",
"description": "example string",
"type":"integer"
},
...
We can then df = pd.read_json("temptest.json", orient='table') but descriptions seem ignored and are lost upon saving.
The only other answer I found saves separate dicts and dfs into a single JSON, but I couldn't replicate this without "ValueError: Trailing data". I need something less cumbersome and error prone, and files requiring custom instructions on how to open them aren't appropriate.
How can we can work with and save brief data descriptions with JSON or another format?

Would the following rough sketch be something you could live with:
Step 1: Create json structure out of df like you did
df.to_json('temp.json', orient='table', indent=4, index=False)
Step 2: Add the column description to the so produced json-file as you already did (could be done easily in a structured/programatically manner):
{
"schema":{
"fields":[
{
"name":"numbers",
"type":"integer",
"description": "example number"
},
{
"name":"strings",
"type":"string",
"description": "example string"
}
],
"pandas_version":"0.20.0"
},
"data":[...]
}
One way to do that would be to write a little function that uses Pandas .to_json as a base output and then adds the desired descriptions to the Pandas json-dump (this is step 1 & 2 together):
import json
def to_json_annotated(filepath, df, annotations):
df_json = json.loads(df.to_json(orient='table', index=False))
for field in df_json['schema']['fields']:
field['description'] = annotations.get(field['name'], None)
with open(filepath, 'w') as file:
json.dump(df_json, file)
As in the example above:
annotations = {'numbers': 'example number',
'strings': 'example string'}
to_json_annotated('temp.json', df, annotations)
Step 3: Reading the information back into Pandas-format:
import json
with open('temp.json', 'r') as file:
json_df = json.load(file)
df_data = pd.json_normalize(json_df, 'data') # pd.read_json('temp.json', orient='table') works too
df_meta = pd.json_normalize(json_df, ['schema', 'fields'])
with the results:
df_data:
numbers strings
0 6 foo
1 2 whatever
df_meta:
name type description
0 numbers integer example number
1 strings string example string

I didn't see that there will be such an option but I think you can just add a description inside of each variable:
schema = {'first':{'Variable':'x',"description": "example string","value": 2},"second":{"Variable":"y","description": "example string","value": 3}}
It creates a table:
first second
Variable x y
description example string example string
value 2 3

How to update a pandas dataframe, from multiple API calls

I need to do a python script to
Read a csv file with the columns (person_id, name, flag). The file has 3000 rows.
Based on the person_id from the csv file, I need to call a URL passing the person_id to do a GET
http://api.myendpoint.intranet/get-data/1234
The URL will return some information of the person_id, like example below. I need to get all rents objects and save on my csv. My output needs to be like this
import pandas as pd
import requests
ids = pd.read_csv(f"{path}/data.csv", delimiter=';')
person_rents = df = pd.DataFrame([], columns=list('person_id','carId','price','rentStatus'))
for id in ids:
response = request.get(f'endpoint/{id["person_id"]}')
json = response.json()
person_rents.append( [person_id, rent['carId'], rent['price'], rent['rentStatus'] ] )
pd.read_csv(f"{path}/data.csv", delimiter=';' )
person_id;name;flag;cardId;price;rentStatus
1000;Joseph;1;6638;1000;active
1000;Joseph;1;5566;2000;active
Response example
{
"active": false,
"ctodx": false,
"rents": [{
"carId": 6638,
"price": 1000,
"rentStatus": "active"
}, {
"carId": 5566,
"price": 2000,
"rentStatus": "active"
}
],
"responseCode": "OK",
"status": [{
"request": 345,
"requestStatus": "F"
}, {
"requestId": 678,
"requestStatus": "P"
}
],
"transaction": false
}
After save the additional data from response on csv, i need to get data from another endpoint using the carId on the URL. The mileage result must be save in the same csv.
http://api.myendpoint.intranet/get-mileage/6638
http://api.myendpoint.intranet/get-mileage/5566
The return for each call will be like this
{"mileage":1000.0000}
{"mileage":550.0000}
The final output must be
person_id;name;flag;cardId;price;rentStatus;mileage
1000;Joseph;1;6638;1000;active;1000.0000
1000;Joseph;1;5566;2000;active;550.0000
SOmeone can help me with this script?
Could be with pandas or any python 3 lib.

Code Explanation
Create dataframe, df, with pd.read_csv.
It is expected that all of the values in 'person_id', are unique.
Use .apply on 'person_id', to call prepare_data.
prepare_data expects 'person_id' to be a str or int, as indicated by the type annotation, Union[int, str]
Call the API, which will return a dict, to the prepare_data function.
Convert the 'rents' key, of the dict, into a dataframe, with pd.json_normalize.
Use .apply on 'carId', to call the API, and extract the 'mileage', which is added to dataframe data, as a column.
Add 'person_id' to data, which can be used to merge df with s.
Convert pd.Series, s to a dataframe, with pd.concat, and then merge df and s, on person_id.
Save to a csv with pd.to_csv in the desired form.
Potential Issues
If there's an issue, it's most likely to occur in the call_api function.
As long as call_api returns a dict, like the response shown in the question, the remainder of the code will work correctly to produce the desired output.
import pandas as pd
import requests
import json
from typing import Union
def call_api(url: str) -> dict:
r = requests.get(url)
return r.json()
def prepare_data(uid: Union[int, str]) -> pd.DataFrame:
d_url = f'http://api.myendpoint.intranet/get-data/{uid}'
m_url = 'http://api.myendpoint.intranet/get-mileage/'
# get the rent data from the api call
rents = call_api(d_url)['rents']
# normalize rents into a dataframe
data = pd.json_normalize(rents)
# get the mileage data from the api call and add it to data as a column
data['mileage'] = data.carId.apply(lambda cid: call_api(f'{m_url}{cid}')['mileage'])
# add person_id as a column to data, which will be used to merge data to df
data['person_id'] = uid
return data
# read data from file
df = pd.read_csv('file.csv', sep=';')
# call prepare_data
s = df.person_id.apply(prepare_data)
# s is a Series of DataFrames, which can be combined with pd.concat
s = pd.concat([v for v in s])
# join df with s, on person_id
df = df.merge(s, on='person_id')
# save to csv
df.to_csv('output.csv', sep=';', index=False)
If there are any errors when running this code:
Leave a comment, to let me know.
edit your question, and paste the entire TraceBack, as text, into a code block.
Example
# given the following start dataframe
person_id name flag
0 1000 Joseph 1
1 400 Sam 1
# resulting dataframe using the same data for both id 1000 and 400
person_id name flag carId price rentStatus mileage
0 1000 Joseph 1 6638 1000 active 1000.0
1 1000 Joseph 1 5566 2000 active 1000.0
2 400 Sam 1 6638 1000 active 1000.0
3 400 Sam 1 5566 2000 active 1000.0

There are many different ways to implement this. One of them would be, like you started in your comment:
read the CSV file with pandas
for each line take the person_id and build a call
the delivered JSON response can then be taken from the rents
the carId is then extracted for each individual rental
finally this is collected in a row_list
the row_list is then converted back to csv via pandas
A very simple solution without any error handling could look something like this:
from types import SimpleNamespace
import pandas as pd
import requests
import json
path = '/some/path/'
df = pd.read_csv(f'{path}/data.csv', delimiter=';')
rows_list = []
for _, row in df.iterrows():
rentCall = f'http://api.myendpoint.intranet/get-data/{row.person_id}'
print(rentCall)
response = requests.get(rentCall)
r = json.loads(response.text, object_hook=lambda d: SimpleNamespace(**d))
for rent in r.rents:
mileageCall = f'http://api.myendpoint.intranet/get-mileage/{rent.carId}'
print(mileageCall)
response2 = requests.get(mileageCall)
m = json.loads(response2.text, object_hook=lambda d: SimpleNamespace(**d))
state = "active" if r.active else "inactive"
rows_list.append((row['person_id'], row['name'], row['flag'], rent.carId, rent.price, state, m.mileage))
df = pd.DataFrame(rows_list, columns=('person_id', 'name', 'flag', 'carId', 'price', 'rentStatus', 'mileage'))
print(df.to_csv(index=False, sep=';'))

Speeding up with multiprocessing
You mention that you have 3000 rows, which means that you'll have to make a lot of API calls. Depending on the connection, every one of these calls might take a while. As a result, performing this in a sequential way might be too slow. The majority of the time, your program will just be waiting on a response from the server without doing anything else.
We can improve this performance by using multiprocessing.
I use all the code from Trenton his answer, but I replace the following sequential call:
# call prepare_data
s = df.person_id.apply(prepare_data)
With a parallel alternative:
from multiprocessing import Pool
n_processes=20 # Experiment with this to see what works well
with Pool(n_processes) as p:
s=p.map(prepare_data, df.person_id)
Alternatively, a threadpool might be faster, but you'll have to test that by replacing the import with
from multiprocessing.pool import ThreadPool as Pool.

Elastic-Search scroll(scan) into Pandas DataFrame

I need to get a lot of data from Elasticsearch (es), so I'm using the scan command which is a wrap-up for the native es scroll command.
As a result I will get the following generator Object: <generator object scan at 0x000001BF5A25E518>. Farther more, I'd like to insert all the data into a Pandas DataFrame object so I can easily process it.
Code goes as follows:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan as escan
import pandas as pd
es = Elasticsearch(dpl_server, verify_certs=False)
body = {
"size": 1000,
"query": {
"match_all": {}
}
}
response = escan(client=es,
index="index-*,
query=body, request_timeout=30, size=1000)
print(response)
#<generator object scan at 0x000001BF5A25E518>
What I want to do is putting all the results in Pandas DataFrame. If I print each element in the generator as follows:
for res in response:
print(res['_source'])
# { .... }
# { .... }
# { .... }
I will get a lot of dictionaries. A naive solution of mine so far is to add them 1 by 1 like so:
df = None
for res in response:
if (df is None):
df = pd.DataFrame([res['_source']])
else:
df = pd.concat([df, pd.DataFrame([res['_source']])], sort=True)
I wish to know if there's a better way in doing so (first, in terms of speed, second, in terms of clean code). For instance, would it be better to accumulate all the results from the generator into a list and then build a complete DataFrame ?

You can use panda's json_normalize.
from pandas.io.json import json_normalize
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan as escan
import pandas as pd
es = Elasticsearch(dpl_server, verify_certs=False)
body = {
"size": 1000,
"query": {
"match_all": {}
}
}
response = escan(client=es,
index="index",
query=body, request_timeout=30, size=1000)
# Initialize a double ended queue
output_all = deque()
# Extend deque with iterator
output_all.extend(response)
# Convert deque to DataFrame
output_df = json_normalize(output_all)
Here you can find more info on the double ended queue.

Easiest way to get API data into str/int format python

I have read through many articles and posts to connect to an api then format it into int/str however I did mange to make possibly the longest winded way ever its real ugly please could someone show me the shortest most efficient way to accomplish the below code any suggestions would be greatly appreciated bassically looking to print out "eos" in str format and "price" as int Thanks!
import urllib
import json
import pandas as pd
import numpy as np
import requests
r = requests.get('https://api.coinmarketcap.com/v1/ticker/eos/')
with open('events.csv','w') as fd:
fd.write(r.text)
data = pd.read_csv('events.csv', names=['Choose One'])
i = data.iloc[[6], [0]]
a = str(i)
name,price = a.split(":")
string = price[2:-1]
print(string)

It's simpler to just use pandas read_json to read the file into a data frame, read_json will automatically assign the apt datatype to each column, then use column selection to select 'name','price_usd' columns (of-course in this case there is only one row, but the same code can be used with multiple rows)
i.e.
import pandas as pd
df = pd.read_json('https://api.coinmarketcap.com/v1/ticker/eos/')
print(df[['name','price_usd']].apply(lambda row:'{}: {:.0f}'.format(ro
w['name'],row['price_usd']),axis=1))
using .0f in the format statement will display the integer part (rounded) of the price_usd value so the output will be.
0 EOS: 9
alternatively using the round function will round the float values
i.e.
In [34]: import pandas as pd
...: df = pd.read_json('https://api.coinmarketcap.com/v1/ticker/eos/')
...: print(df[['name','price_usd']].apply(lambda row:'{}: {:}'.format(row['n
...: ame'],round(row['price_usd'],2)),axis=1))
...:
...:
0 EOS: 8.99
dtype: object

Simply use json.loads(r.text) or much easier directly r.json().
Say, right now the api returns the following data:
[
{
"id": "eos",
"name": "EOS",
"symbol": "EOS",
"rank": "9",
"price_usd": "9.31992",
"price_btc": "0.00106154",
"24h_volume_usd": "596467000.0",
"market_cap_usd": "6034993504.0",
"available_supply": "647537050.0",
"total_supply": "900000000.0",
"max_supply": "1000000000.0",
"percent_change_1h": "1.3",
"percent_change_24h": "-6.81",
"percent_change_7d": "-36.4",
"last_updated": "1517755757"
}
]
If you use r.json(), you get this as a json, otherwise load it with data = json.loads(r.text) and save it to a pandas DataFrame with df = pd.DataFrame(data) which then looks like the following:
In [15]: df
Out[15]:
24h_volume_usd available_supply id last_updated market_cap_usd max_supply name percent_change_1h percent_change_24h percent_change_7d price_btc price_usd rank symbol total_supply
0 596467000.0 647537050.0 eos 1517755757 6034993504.0 1000000000.0 EOS 1.3 -6.81 -36.4 0.00106154 9.31992 9 EOS 900000000.0
Access the data with pandas indexing:
In [8]: df[['name', 'price_usd']]
Out[8]:
name price_usd
0 EOS 9.29186
Or for printing:
In [18]: print df.loc[0, 'name'], ': ', df.loc[0, 'price_usd']
EOS : 9.31992

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Python Great Expectations to remove invalid data - python

Related

Unable to pull the key:value from dictionary in column to multiple columns

Saving pandas dataframes in formats including data descriptions or other self documenting strings in schemas? JSON?

How to update a pandas dataframe, from multiple API calls

Elastic-Search scroll(scan) into Pandas DataFrame

Easiest way to get API data into str/int format python

Categories

Resources