Extracting JSON data into a relational table - python

I have a JSON file which resulted from YouTube's iframe API and needs to be preprocessed. I want to put this JSON data into a pandas dataframe, where each JSON key will be a column, and each recorded "event" should be a new row.
I was able to load the data as a dataframe using the read_json , but with this the keys for each event are shown as an array.
Here is what my JSON data looks like :
{
"events":[
{
"timemillis":1563467463580,
"date":"18.7.2019",
"time":"18:31:03,580",
"name":"Player is loading",
"data":""
},
{
"timemillis":1563467463668,
"date":"18.7.2019",
"time":"18:31:03,668",
"name":"Player is loaded",
"data":"5"
}
]
}
And this is what I did to convert it to a dataframe:
data=pd.read_json("file.json")
df=pd.DataFrame(data)
print(df)
The output looks like this:
0 {'timemillis': 1563469276604, 'date': '18.7.20...
1 {'timemillis': 1563469276694, 'date': '18.7.20...
...
How can I convert this output into a table where I have separate columns for these keys such as 'timemmillis','date','name' and so on? I never worked with JSONs before so I am a bit confused.

import pandas as pd
import json
data = {
"events":[
{
"timemillis":1563467463580,
"date":"18.7.2019",
"time":"18:31:03,580",
"name":"Player is loading",
"data":""
},
{
"timemillis":1563467463668,
"date":"18.7.2019",
"time":"18:31:03,668",
"name":"Player is loaded",
"data":"5"
}
]
}
# or read data from file
# rather than reading file directly to pandas dataframe read as json
# data=pd.read_json("file.json")
with open('file.json') as json_file:
data = json.load(json_file)
df=pd.DataFrame(data['events'])
print(df)
Result
data date name time timemillis
0 18.7.2019 Player is loading 18:31:03,580 1563467463580
1 5 18.7.2019 Player is loaded 18:31:03,668 1563467463668

import pandas as pd
df=pd.read_json("file.json",orient='columns')
rows = []
for i,r in df.iterrows():
rows.append({'eventid':i+1,'timemillis':r['events']['timemillis'],'name':r['events']['name']})
df = pd.DataFrame(rows)
print(df)
Now you can insert this df to database

Related

How can I write a Pandas Dataframe containing a dictionary to Bigquery?

I have JSON data from API which contain page number called 'offset' and 'items' which is a nested JSON like below:
{
"data": {
"offset": 0,
"pageSize": 20,
"items": [
{
"id": "6biewd5a",
"title": "AAAAAAAAAAAAAA"
},
{
"id": "er45ggg",
"title": "BBBBBBBBBBBBBBBB"
}
]
}
}
I am creating a dataframe to write to bigquery.
Here is my code.
import requests
from requests.auth import HTTPBasicAuth
import json
from google.cloud import bigquery
import pandas
import pandas_gbq
URL = 'xxxxxxxxxxxxxxxxxxxx'
auth = HTTPBasicAuth('name', 'password')
r = requests.get(url=URL, auth=auth)
# Extracting data in JSON format
data = json.loads(json.dumps(r.json()))
# data = r.json()
print(type(data))
offset = str(data['data']['offset'])
json_data = data['data']['items']
type(json_data)
df = pandas.DataFrame(
{
'offset': offset,
'json_data': json_data
)
# df['json_data'] = df['json_data'].astype('string')
client = bigquery.Client(project='ncau-data-newsquery-sit')
table_id = 'sdm_adpoint.testfapi1'
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("offset", "STRING"),
bigquery.SchemaField("json_data", "STRING", "REPEATED")
# , index=[0]
],
autodetect=False
)
df.head()
pandas_gbq.to_gbq(df, table_id, project_id='ncau-data-newsquery-sit', if_exists='append')
I have made dataframe out of this like below:
Dataframe
But when I write to bigquery using
pandas_gbq.to_gbq(df, table_id, project_id='ncau-data-newsquery-sit', if_exists='append')
It throws the below error:
ArrowTypeError: Could not convert {'id': '6biewd5a',......
To overcome this, I tried this:
df['json_data'] = df['json_data'].astype('string')
It’s working, but it’s merging all rows as one, and writing each letter in a separate row.
I want data to be written as a row, the same way as displayed in a Pandas dataframe.
I tried to write to Bigquery using
pandas_gbq.to_gbq(df, table_id, project_id='ncau-data-newsquery-sit', if_exists='append')
But it throws an error
ArrowTypeError: Could not convert {'id': '6biewd5a',......
Then I tried
df['json_data'] = df['json_data'].astype('string')
But it’s merging all rows as one, and writing each letter in separate rows.
I want data to be written as a row, the same way as displayed in Pandas dataframe.
Your JSON schema is very similar to the schema given here in a Google Cloud Platform example:
Example schema
So you will have to make the below change to your schema definition in your code.
For the items schema, you will have to define the schema as below:
bigquery.SchemaField(
"items",
"RECORD",
mode="REPEATED",
fields=[
bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
bigquery.SchemaField("title", "STRING", mode="NULLABLE"),
],
)

Python nested dict. return deepest dict to csv

I'm trying to return some values from a nested dict (based on a json) to a csv without success due to the following structure.
{
"http_method":"GET",
"results":{
"FTKMOB21xxxxD":{
"serial_number":"FTKMOB21xxxxD",
"comments":"",
"q_type":432,
"license":"EFTM123123123",
"type":"mobile",
"user":"pippo",
"user_type":"user",
"drift":0,
"status":{
"name":"activated"
}
},
"FTKMOB21xxxxF":{
"serial_number":"FTKMOB21xxxxF",
"comments":"",
"q_type":432,
"license":"EFTM123123123",
"type":"mobile",
"drift":0,
"status":{
"name":"pending"
}
}
},
"vdom":"root",
"path":"user",
"name":"fortitoken",
"action":"",
"status":"success",
"serial":"FGT_VM",
"version":"v7.0.5",
"build":304
}
What I need to return in a csv are fields "serial_number", "user", "status".
The FTKMOB21xxxxD change for each device and I need to consider it as a dynamic value, I suppose that a loop based on its position is needed.
Could you please help me to understood how to do that?
It's straight-forward with pandas:
import pandas as pd
df = pd.DataFrame(input_dict['results'])
df.T[["serial_number", "user", "status"]].to_csv('output.csv', index=False)
Your csv will then look like:
serial_number,user,status
FTKMOB21xxxxD,pippo,{'name': 'activated'}
FTKMOB21xxxxF,,{'name': 'pending'}
Edit: if you actually want status/name as status, you have to reassign df['status']:
df = pd.DataFrame.from_dict(input_dict['results'], orient='index', columns=["serial_number", "user", "status"])
df['status'] = pd.DataFrame(df['status'].to_list())['name'].to_list()
df.to_csv('output.csv', index=False)

convert a CSV file to JSON file

I am trying to convert CSV file to JSON file based on a column value. The csv file looks somewhat like this.
ID Name Age
CSE001 John 18
CSE002 Marie 20
ECE001 Josh 22
ECE002 Peter 23
currently I am using the following code to obtain json file.
import csv
import json
def csv_to_json(csv_file_path, json_file_path):
data_dict = {}
with open(csv_file_path, encoding = 'utf-8') as csv_file_handler:
csv_reader = csv.DictReader(csv_file_handler)
for rows in csv_reader:
key = rows['ID']
data_dict[key] = rows
with open(json_file_path, 'w', encoding = 'utf-8') as json_file_handler:
json_file_handler.write(json.dumps(data_dict, indent = 4))
OUTPUT:
**{
"CSE001":{
"ID":"CSE001",
"Name":"John",
"Age":18
}
"CSE002":{
"ID":"CSE002",
"Name":"Marie",
"Age":20
}
"ECE001":{
"ID":"ECE001",
"Name":"Josh",
"Age":22
}
"ECE002":{
"ID":"ECE002",
"Name":"Peter",
"Age":23
}
}**
I want my output to generate two separate json files for CSE and ECE based on the ID value. Is there a way to achieve this output.
Required Output:
CSE.json:
{
"CSE001":{
"ID":"CSE001",
"Name":"John",
"Age":18
}
"CSE002":{
"ID":"CSE002",
"Name":"Marie",
"Age":20
}
}
ECE.json:
{
"ECE001":{
"ID":"ECE001",
"Name":"Josh",
"Age":22
}
"ECE002":{
"ID":"ECE002",
"Name":"Peter",
"Age":23
}
}
I would suggest you to use pandas, that way will be more easier.
Code may look like:
import pandas as pd
def csv_to_json(csv_file_path):
df = pd.read_csv(csv_file_path)
df_CSE = df[df['ID'].str.contains('CSE')]
df_ECE = df[df['ID'].str.contains('ECE')]
df_CSE.to_json('CSE.json')
df_ECE.to_json('ESE.json')
You can create dataframe and then do the following operation
import pandas as pd
df = pd.DataFrame.from_dict({
"CSE001":{
"ID":"CSE001",
"Name":"John",
"Age":18
},
"CSE002":{
"ID":"CSE002",
"Name":"Marie",
"Age":20
},
"ECE001":{
"ID":"ECE001",
"Name":"Josh",
"Age":22
},
"ECE002":{
"ID":"ECE002",
"Name":"Peter",
"Age":23
}
},orient='index')
df["id_"] = df["ID"].str[0:2] # temp column for storing first two chars
grps = df.groupby("id_")[["ID", "Name", "Age"]]
for k, v in grps:
print(v.to_json(orient="index")) # you can create json file as well
You could store each row into two level dictionary with the top level being the first 3 characters of the ID.
These could then be written out into separate files with the key being part of the filename:
from collections import defaultdict
import csv
import json
def csv_to_json(csv_file_path, json_base_path):
data_dict = defaultdict(dict)
with open(csv_file_path, encoding = 'utf-8') as csv_file_handler:
csv_reader = csv.DictReader(csv_file_handler)
for row in csv_reader:
key = row['ID'][:3]
data_dict[key][row['ID']] = row
for key, values in data_dict.items():
with open(f'{json_base_path}_{key}.json', 'w', encoding='utf-8') as json_file_handler:
json_file_handler.write(json.dumps(values, indent = 4))
csv_to_json('input.csv', 'output')
The defaultdict is used to avoid needing to first test if a key is already present before using it.
This would create output_CSE.json and output_ECE.json, e.g.
{
"ECE001": {
"ID": "ECE001",
"Name": "Josh",
"Age": "22"
},
"ECE002": {
"ID": "ECE002",
"Name": "Peter",
"Age": "23"
}
}

How to parse Google Ads batch stream into pandas dataframe?

I am having trouble parsing batch response from my Google Ads request.
I am able to get the response as json that looks like this:
results {
metrics {
clicks: 200
conversions_value: 5
conversions: 40
cost_micros: 4546564
impressions: 1235
}
segments {
date: "2021-08-03"
}
landing_page_view {
resource_name: "first"
unexpanded_final_url: "https://www.soomething.com/find"
}
}
results {
metrics {
clicks: 1000
conversions_value: 10
conversions: 65
cost_micros: 654654
impressions: 8154
}
segments {
date: "2021-08-02"
}
landing_page_view {
resource_name: "customer"
unexpanded_final_url: "https://www.soomething.com/find"
}
}
This is what I tried so far:
response = ga_service.search_stream(customer_id=customer_id, query=query)
df = pd.DataFrame()
for batch in response:
for row in batch.results:
df= pd.DataFrame({"Date": row.segments.date,
"Landing page": row.landing_page_view.unexpanded_final_url,
"clicks": row.metrics.clicks,
"conversions": row.metrics.conversions,
"conversion value": row.metrics.conversions_value,
"costs": row.metrics.cost_micros ,
"impressions": row.metrics.impressions},index=[0])
final = df.append(df)
final
But the results looks like just one row of data in dataframe instead 7 days worth of data.
But if i do print(batch) i get the response as json i mention above.
How do i parse all data from json into dataframe?
Thank you in advance
Your approach is straight forward but you are appending the new row into df instead into the final Dataframe:
So do this:
final = final.append(df).reset_index(drop=True)
Instead of this:
final = df.append(df)

How can I organize JSON data from pandas dataframe

I can't figure out how to correctly organize the JSON data that is created from my pandas dataframe. This is my code:
with open (spreadsheetName, 'rb') as spreadsheet:
newSheet = spreadsheet.read()
newSheet = pd.read_excel(newSheet)
exportSheet = newSheet.to_json('file.json', orient = 'index')
And I'd like for the JSON data to look something like
{
"cars": [
{
"Model": "Camry",
"Year": "2015"
},
{
"Model": "Model S",
"Year": "2018"
}
]
}
But instead I'm getting a single line of JSON data from the code I have. Any ideas on how I can make it so that each row is a JSON 'object' with it's own keys and values from the column headers (like model and year)?
Set an indent argument to desired value in to_json function.
exportSheet = newSheet.to_json('file.json', orient='index', indent=4)

Categories

Resources