Convert Json data to Python DataFrame - python

This is my first time accessing an API / working with json data so if anyone can point me towards a good resource for understanding how to work with it I'd really appreciate it.
Specifically though, I have json data in this form:
{"result": { "code": "OK", "msg": "" },"report_name":"DAILY","columns":["ad","ad_impressions","cpm_cost_per_ad","cost"],"data":[{"row":["CP_CARS10_LR_774470","966","6.002019","5.797950"]}],"total":["966","6.002019","5.797950"],"row_count":1}
I understand this structure but I don't know how to get it into a DataFrame properly.

Looking at the structure of your json, presumably you will have several rows for your data and in my opinion it will make more sense to build the dataframe yourself.
This code uses columns and data to build a dataframe:
In [12]:
import json
import pandas as pd
​
with open('... path to your json file ...') as fp:
for line in fp:
obj = json.loads(line)
columns = obj['columns']
data = obj['data']
l = []
for d in data:
l += [d['row']]
df = pd.DataFrame(l, index=None, columns=columns)
df
Out[12]:
ad ad_impressions cpm_cost_per_ad cost
0 CP_CARS10_LR_774470 966 6.002019 5.797950
As for the rest of the data, in your json, I guess you could e.g. use the totals for checking your dataframe,
In [14]:
sums = df.sum(axis=0)
obj['total']
for i in range(0,3):
if (obj['total'][i] != sums[i+1]):
print "error in total"
In [15]:
if obj['row_count'] != len(df.index):
print "error in row count"
As for the rest of the data in the json, it is difficult for me to know if anything else should be done.
Hope it helps.

Check pandas documentation. Specifically,
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Pandas supports reading json to dataframe

Related

How to turn multiple rows of dictionaries from a file into a dataframe

I have a script that I use to fire orders from a csv file, to an exchange using a for loop.
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
for i in range(len(df)):
order = Client.new_order(...
...)
file = open('orderData.txt', 'a')
original_stdout = sys.stdout
with file as f:
sys.stdout = f
print(order)
file.close()
sys.stdout = original_stdout
I put the response from the exchange in a txt file like this...
I want to turn the multiple responses into 1 single dataframe. I would hope it would look something like...
(I did that manually).
I tried;
data = pd.read_csv('orderData.txt', header=None)
dfData = pd.DataFrame(data)
print(dfData)
but I got;
I have also tried
data = pd.read_csv('orderData.txt', header=None)
organised = data.apply(pd.Series)
print(organised)
but I got the same output.
I can print order['symbol'] within the loop etc.
I'm not certain whether I should be populating this dataframe within the loop, or by capturing and writing the response and processing it afterwards. Appreciate your advice.
It looks like you are getting json strings back, you could read json objects into dictionaries and then create a dataframe from that. Perhaps try something like this (no longer needs a file)
data = pd.read_csv('orderparameters.csv')
df = pd.DataFrame(data)
response_data = []
for i in range(len(df)):
order_json = Client.new_order(...
...)
response_data.append(eval(order_json))
response_dataframe = pd.DataFrame(response_data)
If I understand your question correctly, you can simply do the following:
import pandas as pd
orders = pd.read_csv('orderparameters.csv')
responses = pd.DataFrame(Client.new_order(...) for _ in range(len(orders)))

Python - Parsing JSON Data through user defined function

I have a JSON Text File
Inside the JSON text file, there are columns like id, title, context, question, is_impossible, answer_start and text.
I am trying to read this into a Pandas DataFrame. I am new to Python and JSON. So giving a go with a function definition.
Here is my code,
def squad_json_pd_df(json_dict):
mylistsize = len((list(json_normalize(json_dict,'data')['title'])))
row = []
for i in range(0,mylistsize):
data = [c for c in json_dict['data']][i]
df = pd.DataFrame()
data_paragraphs = data['paragraphs']
mytitle = data['title']
for article_dict in data_paragraphs:
for answers_dict in article_dict['qas']:
for answer in answers_dict['answers']:
row.append((
answers_dict['id'],
mytitle,
article_dict['context'],
answers_dict['question'],
answers_dict['is_impossible'],
answer['answer_start'],
answer['text']
))
df = pd.concat([df, pd.DataFrame.from_records(row, columns=['id', 'title','context', 'question','is_impossible', 'answer_start', 'answer'])], axis=0, ignore_index=True)
df.drop_duplicates(inplace=True)
return df
with open(dev_datapath) as file:
dev_dict = json.load(file)
dev_df = squad_json_pd_df(dev_dict)
So the problem here is - The is_Impossible column has both true and false values inside it (When I see it in the text file). But after I load in the Pandas dataframe, I can see only false records.
My understanding of the problem is that - The JSON File structure could be different for true records and I am not parsing it correctly in Python.
The Is_Impossible false structure looks as below,
The Is_Impossible true structure looks as below,
The reason why you don't get the "True"-s back is because they are under a different json-tag - they are under "plausible_answers" instead of answers I think. In your code the answers_dict is only pulled from the "answers" tag from the json - so you never actually loop over the plausible_answers list where the tag would be set to True

Taking OpenCorporate API data into a structured CSV

I'm currently struggling with figuring out how to use pandas to scrape data off of the OpenCorporate API and insert it into a CSV file. I'm not quite sure where I'm messing up.
import pandas as pd
df = pd.read_json('https://api.opencorporates.com/companies/search?q=pwc')
data = df['companies']['company'][0]
result = {'name':data['timestamp'],
'company_number':data[0]['company_number'],
'jurisdiction_code':data[0]['jurisdiction_code'],
'incorporation_date':data[0]['incorporation_date'],
'dissolution_date':data[0]['dissolution_date'],
'company_type':data[0]['company_type'],
'registry_url':data[0]['registry_url'],
'branch':data[0]['branch'],
'opencorporates_url':data[0]['opencorporates_url'],
'previous_names':data[0]['previous_names'],
'source':data[0]['source'],
'url':data[0]['url'],
'registered_address':data[0]['registered_address'],
}
df1 = pd.DataFrame(result, columns=['name', 'company_number', 'jurisdiction_code', 'incorporation_date', 'dissolution_date', 'company_type', 'registry_url', 'branch', 'opencorporates_url', 'previous_names', 'source', 'url', 'registered_address'])
df1.to_csv('company.csv', index=False, encoding='utf-8')
Get the json data with requests and then use pd.io.json.json_normalize to flatten the response.
import requests
json_data = requests.get('https://api.opencorporates.com/companies/search?q=pwc').json()
from pandas.io.json import json_normalize
df = None
for row in json_data["results"]["companies"]:
if df is None:
df = json_normalize(row["company"])
else:
df = pd.concat([df, json_normalize(row["company"])])
You then write the DataFrame to a csv using the df.to_csv() method as described in the question.
It might be easier for you to access to the OpenCorporates database in bulk.
OpenCorporates provides access for commercial users under a closed licence, and as open data for journalists, academics and NGOs who are able to share the results under a share-alike, open data licence. The licence is available here: https://opencorporates.com/info/licence

Python convert dictionary to CSV

I am trying to convert dictionary to CSV so that it is readable (in their respective key).
import csv
import json
from urllib.request import urlopen
x =0
id_num = [848649491, 883560475, 431495539, 883481767, 851341658, 42842466, 173114302, 900616370, 1042383097, 859872672]
for bilangan in id_num:
with urlopen("https://shopee.com.my/api/v2/item/get?itemid="+str(bilangan)+"&shopid=1883827")as response:
source = response.read()
data = json.loads(source)
#print(json.dumps(data, indent=2))
data_list ={ x:{'title':productName(),'price':price(),'description':description(),'preorder':checkPreorder(),
'estimate delivery':estimateDelivery(),'variation': variation(), 'category':categories(),
'brand':brand(),'image':image_link()}}
#print(data_list[x])
x =+ 1
i store the data in x, so it will be looping from 0 to 1, 2 and etc. i have tried many things but still cannot find a way to make it look like this or close to this:
https://i.stack.imgur.com/WoOpe.jpg
Using DictWriter from the csv module
Demo:
import csv
data_list ={'x':{'title':'productName()','price':'price()','description':'description()','preorder':'checkPreorder()',
'estimate delivery':'estimateDelivery()','variation': 'variation()', 'category':'categories()',
'brand':'brand()','image':'image_link()'}}
with open(filename, "w") as infile:
writer = csv.DictWriter(infile, fieldnames=data_list["x"].keys())
writer.writeheader()
writer.writerow(data_list["x"])
I think, maybe you just want to merge some cells like excel do?
If yes, I think this is not possible in csv, because csv format does not contain cell style information like excel.
Some possible solutions:
use openpyxl to generate a excel file instead of csv, then you can merge cells with "worksheet.merge_cells()" function.
do not try to merge cells, just keep title, price and other fields for each line, the data format should be like:
first line: {'title':'test_title', 'price': 22, 'image': 'image_link_1'}
second line: {'title':'test_title', 'price': 22, 'image': 'image_link_2'}
do not try to merge cells, but set the title, price and other fields to a blank string, so it will not show in your csv file.
use line break to control the format, that will merge multi lines with same title into single line.
hope that helps.
If I were you, I would have done this a bit differently. I do not like that you are calling so many functions while this website offers a beautiful JSON response back :) More over, I will use pandas library so that I have total control over my data. I am not a CSV lover. This is a silly prototype:
import requests
import pandas as pd
# Create our dictionary with our items lists
data_list = {'title':[],'price':[],'description':[],'preorder':[],
'estimate delivery':[],'variation': [], 'categories':[],
'brand':[],'image':[]}
# API url
url ='https://shopee.com.my/api/v2/item/get'
id_nums = [848649491, 883560475, 431495539, 883481767, 851341658,
42842466, 173114302, 900616370, 1042383097, 859872672]
shop_id = 1883827
# Loop throw id_nums and return the goodies
for id_num in id_nums:
params = {
'itemid': id_num, # take values from id_nums
'shopid':shop_id}
r = requests.get(url, params=params)
# Check if we got something :)
if r.ok:
data_json = r.json()
# This web site returns a beautiful JSON we can slice :)
product = data_json['item']
# Lets populate our data_list with the items we got. We could simply
# creating one function to do this, but for now this will do
data_list['title'].append(product['name'])
data_list['price'].append(product['price'])
data_list['description'].append(product['description'])
data_list['preorder'].append(product['is_pre_order'])
data_list['estimate delivery'].append(product['estimated_days'])
data_list['variation'].append(product['tier_variations'])
data_list['categories'].append([product['categories'][i]['display_name'] for i, _ in enumerate(product['categories'])])
data_list['brand'].append(product['brand'])
data_list['image'].append(product['image'])
else:
# Do something if we hit connection error or something.
# may be retry or ignore
pass
# Putting dictionary to a list and ordering :)
df = pd.DataFrame(data_list)
df = df[['title','price','description','preorder','estimate delivery',
'variation', 'categories','brand','image']]
# df.to ...? There are dozen of different ways to store your data
# that are far better than CSV, e.g. MongoDB, HD5 or compressed pickle
df.to_csv('my_data.csv', sep = ';', encoding='utf-8', index=False)

Python Response API JSON to CSV table

Bellow you see my code that I use to collect some data via the API of IBM. However I have some problems with saving the output via python to a csv table.
These are the columns that I want (and their values):
emotion__document__emotion__anger emotion__document__emotion__joy
emotion__document__emotion__sadness emotion__document__emotion__fear
emotion__document__emotion__disgust sentiment__document__score
sentiment__document__label language entities__relevance
entities__text entities__type entities__count concepts__relevance
concepts__text concepts__dbpedia_resource usage__text_characters
usage__features usage__text_units retrieved_url
This is my code that I use to collect the data:
response = natural_language_understanding.analyze(
url=url,
features=[
Features.Emotion(),
Features.Sentiment(),
Features.Concepts(limit=1),
Features.Entities(limit=1)
]
)
data = json.load(response)
rows_list = []
cols = []
for ind,row in enumerate(data):
if ind == 0:
cols.append(["usage__{}".format(i) for i in row["usage"].keys()])
cols.append(["emotion__document__emotion__{}".format(i) for i in row["emotion"]["document"]["emotion"].keys()])
cols.append(["sentiment__document__{}".format(i) for i in row["sentiment"]["document"].keys()])
cols.append(["concepts__{}".format(i) for i in row["concepts"].keys()])
cols.append(["entities__{}".format(i) for i in row["entities"].keys()])
cols.append(["retrieved_url"])
d = OrderedDict()
d.update(row["usage"])
d.update(row["emotion"]["document"]["emotion"])
d.update(row["sentiment"]["document"])
d.update(row["concepts"])
d.update(row["entities"])
d.update({"retrieved_url":row["retrieved_url"]})
rows_list.append(d)
df = pd.DataFrame(rows_list)
df.columns = [i for subitem in cols for i in subitem]
df.to_csv("featuresoutput.csv", index=False)
Changing
cols.append(["concepts__{}".format(i) for i in row["concepts"][0].keys()])
cols.append(["entities__{}".format(i) for i in row["entities"][0].keys()])
Did not solved the problem
If you get it from an API, the response would be in json format. You can output it into a csv by:
import csv, json
response = the json response you get from the API
attributes = [emotion__document__emotion__anger, emotion__document__emotion__joy.....attributes you want]
data = json.load(response)
with open('output.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
for attribute in attributes:
writer.writerow(data[attribute][0])
f.close()
make sure data is in dict but not string, Python 3.6 should return a dict. Print out a few rows to look into how your required data is stored.
This line assigns a string to data:
data=(json.dumps(datas, indent=2))
So here you iterate over the characters of a string:
for ind,row in enumerate(data):
In this case row will be a string, and not a dictionary. So, for example, row["usage"] would give you such an error in this case.
Maybe you wanted to iterate over datas?
Update
The code has a few other issues, such as:
cols.append(["concepts__{}".format(i) for i in row["concepts"].keys()])
In this case, you would want row["concepts"][0].keys() to get the keys of the first element, because row["concepts"] is an array.
I'm not very familiar with pandas, but I would suggest you to look at json_normalize, included in pandas, which can help flatten the JSON structure. An issue you might face, is the concepts and entities, which contain arrays of documents. That means that you would have to include the same document, at least max(len(concepts), len(entities)) times.

Categories

Resources