Converting API output from a dictionary to a dataframe (Python) - python

I have fed some data into a TravelTime (https://github.com/traveltime-dev/traveltime-python-sdk) API which calculates the time it takes to drive between 2 locations. The result of this (called out) is a dictionary with that looks like this:
{'results': [{'search_id': 'arrival_one_to_many',
'locations': [{'id': 'KA8 0EU', 'properties': {'travel_time': 2646}},
{'id': 'KA21 5DT', 'properties': {'travel_time': 392}}],
'unreachable': []}]}
However, I need a table that would look a bit like this:
search_id
id
Travel Time
arrival_one_to_many
KA21 5DT
2646
arrival_one_to_many
KA21 5DT
392
I've tried converting this dictionary to a dataframe using
out_2 = pd.DataFrame.from_dict(out)
This shows as a dataframe with one column called results, so I tried use out_2['results'].str.split(',', expand=True) to split this into multiple columns at the comma delimiters but got an error:
0
0
NaN
Is anyone able to help me to get this dictionary to a readable and useable dataframe/table?
Thanks

#MICHAELKM22 since you are not using all the keys from the dictionary you wont be able to convert it directly to dataframe.
First extract required keys and then convert it into dataframe.
df_list = []
for res in data['results']:
serch_id = res['search_id']
for loc in res['locations']:
temp_df = {}
temp_df['search_id'] = res['search_id']
temp_df['id'] = loc["id"]
temp_df['travel_time'] = loc["properties"]['travel_time']
df_list.append(temp_df)
df = pd.DataFrame(df_list)
search_id id travel_time
0 arrival_one_to_many KA8 0EU 2646
1 arrival_one_to_many KA21 5DT 392

First, this json to be parsed to fetch required value. Once those values are fetched, then we can store them into dataframe.
Below is the code to parse this json (PS: I have saved json in one file) and these values added to DataFrame.
import json
import pandas as pd
f = open('Json_file.json')
req_file = json.load(f)
df = pd.DataFrame()
for i in req_file['results']:
dict_new = dict()
dict_new['searchid'] = i['search_id']
for j in i['locations']:
dict_new['location_id'] = j['id']
dict_new['travel_time'] = j['properties']['travel_time']
df = df.append(dict_new, ignore_index=True)
print(df)
Below is the output of above code:
searchid location_id travel_time
0 arrival_one_to_many KA8 0EU 2646.0
1 arrival_one_to_many KA21 5DT 392.0

Related

Flatten JSON in Dataframe Column

I have data in a dataframe as seen below (BEFORE)
I am trying to parse/flatten the JSON in the site_Activity column , but I am having no luck.
I have tried some of the methods below as a proof I have tried to solve this on my own.
I have provided a DESIRED AFTER section to highlight how I would expect the data to parse.
Any help is greatly appreciated!
not working df = df.explode(column='site_Activity').reset_index(drop=True)https://stackoverflow.com/questions/54546279/how-to-normalize-json-string-type-column-of-pandas-dataframe
not working pd.json_normalize(df.site_Activity[0])
How to convert JSON data inside a pandas column into new columns
BEFORE
id
site_Activity
123
[{"action_time":"2022-07-05T01:53:59.000000Z","time_spent":12,"url":"cool.stuff.io/advanced"},{"action_time":"2022-07-05T00:10:20.000000Z","time_spent":0,"url":"cool.stuff.io/advanced1"},{"action_time":"2022-07-04T23:45:39.000000Z","time_spent":0,"url":"cool.stuff.io"}]
456
[{"action_time":"2022-07-04T23:00:23.000000Z","time_spent":0,"url":"cool.stuff.io/awesome"}]
DESIRED AFTER
id
action_time
time_spent
url
123
2022-07-05T01:53:59.000000Z
12
cool.stuff.io/advanced
123
2022-07-05T00:10:20.000000Z
0
cool.stuff.io/advanced1
123
2022-07-04T23:45:39.000000Z
0
cool.stuff.io
456
2022-07-04T23:00:23.000000Z
0
cool.stuff.io/awesome
You can:
use .apply(json.loads) to transform the json column into a list/dict column;
use df.explode to transform the list o dicts into a Series of dicts;
use .apply(pd.Series) to 'explode' de Series of dicts into a DataFrame;
use pd.concat to 'merge' the new columns to the rest of the data.
Then it comes to:
import io
import pandas as pd
import json
TESTDATA="""id site_Activity
123 [{"action_time":"2022-07-05T01:53:59.000000Z","time_spent":12,"url":"cool.stuff.io/advanced"},{"action_time":"2022-07-05T00:10:20.000000Z","time_spent":0,"url":"cool.stuff.io/advanced1"},{"action_time":"2022-07-04T23:45:39.000000Z","time_spent":0,"url":"cool.stuff.io"}]
456 [{"action_time":"2022-07-04T23:00:23.000000Z","time_spent":0,"url":"cool.stuff.io/awesome"}]
"""
df = pd.read_csv(io.StringIO(TESTDATA), sep="\t")
df["site_Activity"] = df["site_Activity"].apply(json.loads)
df = df.explode("site_Activity")
df = pd.concat([df[["id"]], df["site_Activity"].apply(pd.Series)], axis=1)
print(df)
# result
id action_time time_spent url
0 123 2022-07-05T01:53:59.000000Z 12 cool.stuff.io/advanced
0 123 2022-07-05T00:10:20.000000Z 0 cool.stuff.io/advanced1
0 123 2022-07-04T23:45:39.000000Z 0 cool.stuff.io
1 456 2022-07-04T23:00:23.000000Z 0 cool.stuff.io/awesome
You can use this:
df = pd.json_normalize(json_data_var, 'site_Activity', ['id'])
'site_Activity' without the square brackets is like the 'many' part; 'id' in the square brackets the 'one'.
Take this example:
url = 'https://someURL.com/burgers'
headers = {
'X-RapidAPI-Key': 'someKey',
'X-RapidAPI-Host': 'someHost.someAPI.com'
}
response = requests.request('GET', url, headers = headers)
api_data = response.json()
df = pd.json_normalize(api_data)
print(df)
This now gives me a dataframe where the 'ingredients' and 'addresses' columns contain nested data
I can then focus on addresses and normalise that
df = pd.json_normalize(api_data, 'addresses', ['name', 'restaurant', 'web', 'description'])

Parse JSON in a Pandas DataFrame

I have some data in a pandas DataFrame, but one of the columns contains multi-line JSON. I am trying to parse that JSON out into a separate DataFrame along with the CustomerId. Here you will see my DataFrame...
df
Out[1]:
Id object
CustomerId object
CallInfo object
Within the CallInfo column, the data looks like this...
[{"CallDate":"2021-06-21","CallLength":362},{"CallDate":"2021-06-24","CallLength":402}]
I want to create a new DataFrame called df_norm which contains the CustomerId, CallDate, and CallLength.
I have tried several ways but couldn't find a working solution. Can anyone help me with this?
Mock up code example...
import pandas as pd
import json
Id = [1, 2, 3]
CustomerId = [700001, 700002, 700003]
CallInfo = ['[{"CallDate":"2021-06-21","CallLength":362},{"CallDate":"2021-06-24","CallLength":402}]', '[{"CallDate":"2021-07-09","CallLength":102}]', '[{"CallDate":"2021-07-11","CallLength":226},{"CallDate":"2021-07-11","CallLength":216}]']
# Reconstruct sample DataFrame
df = pd.DataFrame({
"Id": Id,
"CustomerId": CustomerId,
"CallInfo": CallInfo
})
print(df)
This should work. Create a new list of rows and then toss that into the pd.DataFrame constructor:
new_rows = [{
'Id': row['Id'],
'CustomerId': row['CustomerId'],
'CallDate': item['CallDate'],
'CallLength': item['CallLength']}
for _, row in df.iterrows() for item in json.loads(row['CallInfo'])]
new_df = pd.DataFrame(new_rows)
print(new_df)
EDIT: to account for None values in CallInfo column:
new_rows = []
for _, row in df.iterrows():
call_date = None
call_length = None
if row['CallInfo'] is not None: # Or additional checks, e.g. == "" or something...
for item in json.loads(row['CallInfo']):
call_date = item['CallDate']
call_length = item['CallLength']
new_rows.append({
'Id': row['Id'],
'CustomerId': row['CustomerId'],
'CallDate': call_date,
'CallLength': call_length})

Python, how to create a table from JSON data - indexing

I am trying to create a table from JSON data. I have already used the json.dumps for my data:
this is what I am trying to export to the table:
label3 = json.dumps({'class': CLASSES[idx],"confidence": str(round(confidence * 100, 1)) + "%","startX": str(startX),"startY": str(startY),"EndX": str(endX),"EndY": str(endY),"Timestamp": now.strftime("%d/%m/%Y, %H:%M")})
I have tryied with:
val1 = json.loads(label3)
df = pd.DataFrame(val1)
print(df.T)
The system gives me an error that I must pass an index.
And also with:
val = ast.literal_eval(label3)
val1 = json.loads(json.dumps(val))
print(val1)
val2 = val1["class"][0]["confidence"][0]["startX"][0]["startY"][0]["endX"][0]["endY"][0]["Timestamp"][0]
df = pd.DataFrame(data=val2, columns=["class", "confidence", "startX", "startY", "EndX", "EndY", "Timestamp"])
print(df)
When I try this, the error it gives is that String indices mustb be integers.
How can I create the index?
Thank you,
There are two ways we can tackle this issue.
Do as directed by the error, pass the index to the dataframe function
pd.Dataframe(val1, index=list(range(number_of_rows)) # number of rows is 1 in your case.
While dumping the data using json.dumps, dump dictionary which has the mapping from key:list of values instead of key:value. For example
json.dumps({ 'class': [ CLASSES[idx] ],"confidence": [ ' some confidence ' ] })
I have shortened your given example. See I am passing values as list of values(even if it is only one value per key).

Optimize row access and transformation in pyspark

I have a large dataset(5GB) in the form of jason in S3 bucket.
I need to transform the schema of the data, and write back the transformed data to S3 using an ETL script.
So I use a crawler to detect the schema and load the data in pyspark dataframe, and change the schema. Now I iterate over every row in the dataframe and convert it to dictionary. Remove null columns, and then convert the dictionary to string and write back to S3. Following is the code:
#df is the pyspark dataframe
columns = df.columns
print(columns)
s3 = boto3.resource('s3')
cnt = 1
for row in df.rdd.toLocalIterator():
data = row.asDict(True)
for col_name in columns:
if data[col_name] is None:
del data[col_name]
content = json.dumps(data)
object = s3.Object('write-test-transaction-transformed', str(cnt)).put(Body=content)
cnt = cnt+1
print(cnt)
I have used toLocalIterator.
Is the execution of above code performes serially? if yes then how to optimize it? Is there any better approach for execution of above logic?
assuming, each row in the dataset as json string format
import pyspark.sql.functions as F
def drop_null_cols(data):
import json
content = json.loads(data)
for key, value in list(content.items()):
if value is None:
del content[key]
return json.dumps(content)
drop_null_cols_udf = F.udf(drop_null_cols, F.StringType())
df = spark.createDataFrame(
["{\"name\":\"Ranga\", \"age\":25, \"city\":\"Hyderabad\"}",
"{\"name\":\"John\", \"age\":null, \"city\":\"New York\"}",
"{\"name\":null, \"age\":31, \"city\":\"London\"}"],
"string"
).toDF("data")
df.select(
drop_null_cols_udf("data").alias("data")
).show(10,False)
If the input dataframe is having the cols and output only needs to be not null cols json
df = spark.createDataFrame(
[('Ranga', 25, 'Hyderabad'),
('John', None, 'New York'),
(None, 31, 'London'),
],
['name', 'age', 'city']
)
df.withColumn(
"data", F.to_json(F.struct([x for x in df.columns]))
).select(
drop_null_cols_udf("data").alias("data")
).show(10, False)
#df.write.format("csv").save("s3://path/to/file/) -- save to s3
which results
+-------------------------------------------------+
|data |
+-------------------------------------------------+
|{"name": "Ranga", "age": 25, "city": "Hyderabad"}|
|{"name": "John", "city": "New York"} |
|{"age": 31, "city": "London"} |
+-------------------------------------------------+
I'll follow the below approach(written in scala, but can be implemented in python with minimal change)-
Find the dataset count and named it as totalCount
val totalcount = inputDF.count()
Find the count(col) for all the dataframe columns and get the map of fields to their count
Here for all columns of input dataframe, the count is getting computed
Please note that count(anycol) returns the number of rows for which the supplied column are all non-null. For example - if a column has 10 row value and if say 5 values are null then the count(column) becomes 5
Fetch the first row as Map[colName, count(colName)] referred as fieldToCount
val cols = inputDF.columns.map { inputCol =>
functions.count(col(inputCol)).as(inputCol)
}
// Returns the number of rows for which the supplied column are all non-null.
// count(null) returns 0
val row = dataset.select(cols: _*).head()
val fieldToCount = row.getValuesMap[Long]($(inputCols))
Get the columns to be removed
Use the Map created in step#2 here and mark the column having count less than the totalCount as the column to be removed
select all the columns which has count == totalCount from the input dataframe and save the processed output Dataframe anywhere in any format as per requirement.
Please note that, this approach will remove all the column having at least one null value
val fieldToBool = fieldToCount.mapValues(_ < totalcount)
val processedDF = inputDF.select(fieldToBool.filterNot(_._2).map(_.1) :_*)
// save this processedDF anywhere in any format as per requirement
I believe this approach will perform well than the approach you have currently
I solved the above problem.
We can simply query the dataframe for null values.
df = df.filter(df.column.isNotNull()) thereby removing all rows where null is present.
So if there are n columns, We need 2^n queries to filter out all possible combinations. In my case there were 10 columns so total of 1024 queries, which is acceptable as sql queries are parrallelized.

Optimize parsing file with JSON objects in pandas dataframe, where keys may be missing in some rows

I'm looking to optimize the code below which takes ~5 seconds, which is too slow for a file of only 1000 lines.
I have a large file where each line contains valid JSON, with each JSON looking like the following (the actual data is much larger and nested, so I use this JSON snippet for illustration):
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
"season":["summer","spring"]}
I need to parse this file in order to extract only some key-values from every JSON, to obtain the resulting dataframe:
Groupe Id MotherName FatherName
Advanced 56 Laure James
Middle 11 Ann Nicolas
Advanced 6 Helen Franc
But some keys I need in the dataframe, are missing in some JSON objects, so I should to verify if the key is present, and if not, fill the corresponding value with Null. I use with the following method:
df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
for chunk in f:
jfile = json.loads(chunk)
if 'groupe' in jfile['location']:
groupe = jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id = jfile['id']
else:
id = np.nan
if 'MotherName' in jfile['Mother']:
MotherName = jfile['Mother']['MotherName']
else:
MotherName = np.nan
if 'FatherName' in jfile['Father']:
FatherName = jfile['Father']['FatherName']
else:
FatherName = np.nan
df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
ignore_index=True)
I need to optimize the runtime over the whole 1000-row file to <= 2 seconds. In PERL the same parsing function takes < 1 second, but I need to implement it in Python.
You'll get the best performance if you can build the dataframe in a single step during initialization. DataFrame.from_record takes a sequence of tuples which you can supply from a generator that reads one record at a time. You can parse the data faster with get, which will supply a default parameter when the item isn't found. I created an empty dict called dummy to pass for intermediate gets so that you know a chained get will work.
I created a 1000 record dataset and on my crappy laptop the time went from 18 seconds to .06 seconds. Thats pretty good.
import numpy as np
import pandas as pd
import json
import time
def extract_data(data):
""" convert 1 json dict to records for import"""
dummy = {}
jfile = json.loads(data.strip())
return (
jfile.get('location', dummy).get('groupe', np.nan),
jfile.get('id', np.nan),
jfile.get('Mother', dummy).get('MotherName', np.nan),
jfile.get('Father', dummy).get('FatherName', np.nan))
start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)
#
# The original way
#
start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
for chunk in f:
jfile=json.loads(chunk)
if 'groupe' in jfile['location']:
groupe=jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id=jfile['id']
else:
id=np.nan
if 'MotherName' in jfile['Mother']:
MotherName=jfile['Mother']['MotherName']
else:
MotherName=np.nan
if 'FatherName' in jfile['Father']:
FatherName=jfile['Father']['FatherName']
else:
FatherName=np.nan
df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)
print('original', time.time()-start)
The key part is not to append each row to the dataframe in the loop. You want to keep the collection in a list or dict container and then concatenate all of them at once. You can also simplify your if/else structure with a simple get that returns a default value (e.g. np.nan) if the item is not found in the dictionary.
with open (path/to/file) as f:
d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
for chunk in f:
jfile = json.loads(chunk)
d['groupe'].append(jfile['location'].get('groupe', np.nan))
d['id'].append(jfile.get('id', np.nan))
d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))
df = pd.DataFrame(d)

Categories

Resources