API causing multiple JSON arrays in file - python

I am writing an API that needs to be able to call a list of URLs from a file and run the URLs through the code. Which I have working. The only down side is now I have a single JSON file with multiple JSON arrays in it and can not get it to convert to a CSV. Any help greatly appreciated.
import requests
import json
import pandas as pd
import csv
from pathlib import Path
Path("/test2/test2").mkdir(parents=True, exist_ok=True)
links = pd.read_csv('file.csv')
test = []
for url in links:
response = requests.get(url, headers={'CERT': 'cert'}).json()
test.append(response[:])
json2 = json.dumps(test)
f = open('/test2/test2/data.json','a')
f.write(json2)
f.close()
df = pd.read_json('/test2/test2/data.json', lines=True)
df.to_csv('/test2/test2/data.csv')
df = pd.read_csv('/test2/test2/data.csv')
test = df['ID']
test2 = df['Code']
test3 = df['Name']
header=['ID', 'Code', 'Name']
df.to_csv('/test2/test2/test.csv', columns = header)
Ive tried to include coding such as json3 = json2.replace('}][{', '}, {') as well as trying to
testList = []
with open('/test2/test2/data.json') as f:
for jsonObj in f:
testDict = json.loads(jsonObj)
testList.append(testDict)
And have had no luck. With this I got past the I mean technically I can open the file in notepad and change }][{ to }, { but I would like to do it programmatically as this would be an automated API. Any help greatly appreciated.
EDIT:
Sample Output:
[{"ID": 5, "OldID": 1, "Code": 5, "Name": "Jeff"}][{"ID": 2, "OldID": 4, "Code": 0, "Name": "James"}]
Thats a scrubbed down version. The ouput is going into one line, which works fine running the code with just one URL, but with two URLs it gives the issue. For some reason I can not get replace to correct the ']['. Running one URL having it as a list/array [] doesn't bother the conversion its just the start of the new list/array that does.

I never could get this to not combine JSON arrays in the file, but I was able to sort those arrays into one JSON object so that I could convert and parse the file. Here was the code that did it's magic and completed my process.
with fileinput.FileInput('data.json', inplace=True, backup='.bak') as file:
for line in file:
print(line.replace('][', ', '), end='')
When I added this in between my first JSON write and my pandas call to read the file it was able to find the troublemaking characters and remove it.

Related

Extract only ids from json files and read them into a csv

I have a folder including multiple JSON files. Here is a sample JSON file (all JSON files have the same structure):
{
"url": "http://www.lulu.com/shop/alfred-d-byrd/in-the-fire-of-dawn/paperback/product-1108729.html",
"label": "true",
"body": "SOME TEXT HERE",
"ids": [
"360175950098468864",
"394147879201148929"
]
}
I'd like to extract only ids and write them into a CSV file. Here is my code:
import pandas as pd
import os
from os import path
import glob
import csv
import json
input_path = "TEST/True_JSON"
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file) #reading json into a pandas dataframe
ids = json_data[['ids']] #select only "response_tweet_ids"
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False)
print(ids)
PROBLEM: The above code writes some ids into a CSV file. However, it doesn't return all ids. Also, there are some ids in the output CSV file (ids.csv) that didn't exist in any of my JSON files!
I really appreciate it if someone helps me understand where is the problem.
Thank you,
one other way is create common list for all ids in the folder and write it to the output file only once, here example:
input_path = "TEST/True_JSON"
ids = []
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file) #reading json into a pandas dataframe
ids.extend(json_data['ids'].to_list()) #select only "response_tweet_ids"
pd.DataFrame(
ids, colums=('ids', )
).to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False)
print(ids)
Please read the answer by #lemonhead to get more details.
I think you have two main issues here:
pandas seems to read in ids off-by-1 in some cases, probably due to internally reading in as a float and then converting to an int64 and flooring. See here for a similar issue encountered
To see this:
> x = '''
{
"url": "http://www.lulu.com/shop/alfred-d-byrd/in-the-fire-of-dawn/paperback/product-1108729.html",
"label": "true",
"body": "SOME TEXT HERE",
"ids": [
"360175950098468864",
"394147879201148929"
]
}
'''
> print(pd.read_json(io.StringIO(x)))
# outputs:
url label body ids
0 http://www.lulu.com/shop/alfred-d-byrd/in-the-... true SOME TEXT HERE 360175950098468864
1 http://www.lulu.com/shop/alfred-d-byrd/in-the-... true SOME TEXT HERE 394147879201148928
Note the off by one error with 394147879201148929! AFAIK, one quick way to obviate this in your case is just to tell pandas to read everything in as a string, e.g.
pd.read_json(json_file, dtype='string')
You are looping through your json files and writing each one to the same csv file. However, by default, pandas is opening the file in 'w' mode, which will overwrite any previous data in the file. If you open in append mode ('a') instead, that should do what you intended
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False, mode='a')
In context:
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file, dtype='string') #reading json into a pandas dataframe
ids = json_data[['ids']] #select only "response_tweet_ids"
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False, mode='a')
Overall though, unless you are getting something else from pandas here, why not just use raw json and csv libraries? The following would be do the same without the pandas dependency:
import os
from os import path
import glob
import csv
import json
input_path = "TEST/True_JSON"
all_ids = []
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = json.load(json_file)
ids = json_data['ids']
all_ids.extend(ids)
print(all_ids)
# write all ids to a csv file
# you could also remove duplicates or other post-processing at this point
with open('TEST/ids.csv', mode='wt', newline='') as fobj:
writer = csv.writer(fobj)
for row in all_ids:
writer.writerow([row])
By default, dataframe.to_csv() overwrites the file. So each time through the loop you replace the file with the IDs from that input file, and the final result is the IDs from the last file.
Use the mode='a' argument to append to the CSV file instead of overwriting.
ids.to_csv(
'TEST/ids.csv', encoding='utf-8', header=False, index=False,
mode='a'
)

Python convert dictionary to CSV

I am trying to convert dictionary to CSV so that it is readable (in their respective key).
import csv
import json
from urllib.request import urlopen
x =0
id_num = [848649491, 883560475, 431495539, 883481767, 851341658, 42842466, 173114302, 900616370, 1042383097, 859872672]
for bilangan in id_num:
with urlopen("https://shopee.com.my/api/v2/item/get?itemid="+str(bilangan)+"&shopid=1883827")as response:
source = response.read()
data = json.loads(source)
#print(json.dumps(data, indent=2))
data_list ={ x:{'title':productName(),'price':price(),'description':description(),'preorder':checkPreorder(),
'estimate delivery':estimateDelivery(),'variation': variation(), 'category':categories(),
'brand':brand(),'image':image_link()}}
#print(data_list[x])
x =+ 1
i store the data in x, so it will be looping from 0 to 1, 2 and etc. i have tried many things but still cannot find a way to make it look like this or close to this:
https://i.stack.imgur.com/WoOpe.jpg
Using DictWriter from the csv module
Demo:
import csv
data_list ={'x':{'title':'productName()','price':'price()','description':'description()','preorder':'checkPreorder()',
'estimate delivery':'estimateDelivery()','variation': 'variation()', 'category':'categories()',
'brand':'brand()','image':'image_link()'}}
with open(filename, "w") as infile:
writer = csv.DictWriter(infile, fieldnames=data_list["x"].keys())
writer.writeheader()
writer.writerow(data_list["x"])
I think, maybe you just want to merge some cells like excel do?
If yes, I think this is not possible in csv, because csv format does not contain cell style information like excel.
Some possible solutions:
use openpyxl to generate a excel file instead of csv, then you can merge cells with "worksheet.merge_cells()" function.
do not try to merge cells, just keep title, price and other fields for each line, the data format should be like:
first line: {'title':'test_title', 'price': 22, 'image': 'image_link_1'}
second line: {'title':'test_title', 'price': 22, 'image': 'image_link_2'}
do not try to merge cells, but set the title, price and other fields to a blank string, so it will not show in your csv file.
use line break to control the format, that will merge multi lines with same title into single line.
hope that helps.
If I were you, I would have done this a bit differently. I do not like that you are calling so many functions while this website offers a beautiful JSON response back :) More over, I will use pandas library so that I have total control over my data. I am not a CSV lover. This is a silly prototype:
import requests
import pandas as pd
# Create our dictionary with our items lists
data_list = {'title':[],'price':[],'description':[],'preorder':[],
'estimate delivery':[],'variation': [], 'categories':[],
'brand':[],'image':[]}
# API url
url ='https://shopee.com.my/api/v2/item/get'
id_nums = [848649491, 883560475, 431495539, 883481767, 851341658,
42842466, 173114302, 900616370, 1042383097, 859872672]
shop_id = 1883827
# Loop throw id_nums and return the goodies
for id_num in id_nums:
params = {
'itemid': id_num, # take values from id_nums
'shopid':shop_id}
r = requests.get(url, params=params)
# Check if we got something :)
if r.ok:
data_json = r.json()
# This web site returns a beautiful JSON we can slice :)
product = data_json['item']
# Lets populate our data_list with the items we got. We could simply
# creating one function to do this, but for now this will do
data_list['title'].append(product['name'])
data_list['price'].append(product['price'])
data_list['description'].append(product['description'])
data_list['preorder'].append(product['is_pre_order'])
data_list['estimate delivery'].append(product['estimated_days'])
data_list['variation'].append(product['tier_variations'])
data_list['categories'].append([product['categories'][i]['display_name'] for i, _ in enumerate(product['categories'])])
data_list['brand'].append(product['brand'])
data_list['image'].append(product['image'])
else:
# Do something if we hit connection error or something.
# may be retry or ignore
pass
# Putting dictionary to a list and ordering :)
df = pd.DataFrame(data_list)
df = df[['title','price','description','preorder','estimate delivery',
'variation', 'categories','brand','image']]
# df.to ...? There are dozen of different ways to store your data
# that are far better than CSV, e.g. MongoDB, HD5 or compressed pickle
df.to_csv('my_data.csv', sep = ';', encoding='utf-8', index=False)

Read json file as pandas dataframe?

I am using python 3.6 and trying to download json file (350 MB) as pandas dataframe using the code below. However, I get the following error:
data_json_str = "[" + ",".join(data) + "]
"TypeError: sequence item 0: expected str instance, bytes found
How can I fix the error?
import pandas as pd
# read the entire file into a python array
with open('C:/Users/Alberto/nutrients.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ",".join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
From your code, it looks like you're loading a JSON file which has JSON data on each separate line. read_json supports a lines argument for data like this:
data_df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
Note
Remove lines=True if you have a single JSON object instead of individual JSON objects on each line.
Using the json module you can parse the json into a python object, then create a dataframe from that:
import json
import pandas as pd
with open('C:/Users/Alberto/nutrients.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
If you open the file as binary ('rb'), you will get bytes. How about:
with open('C:/Users/Alberto/nutrients.json', 'rU') as f:
Also as noted in this answer you can also use pandas directly like:
df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
if you want to convert it into an array of JSON objects, I think this one will do what you want
import json
data = []
with open('nutrients.json', errors='ignore') as f:
for line in f:
data.append(json.loads(line))
print(data[0])
The easiest way to read json file using pandas is:
pd.read_json("sample.json",lines=True,orient='columns')
To deal with nested json like this
[[{Value1:1},{value2:2}],[{value3:3},{value4:4}],.....]
Use Python basics
value1 = df['column_name'][0][0].get(Value1)
Please the code below
#call the pandas library
import pandas as pd
#set the file location as URL or filepath of the json file
url = 'https://www.something.com/data.json'
#load the json data from the file to a pandas dataframe
df = pd.read_json(url, orient='columns')
#display the top 10 rows from the dataframe (this is to test only)
df.head(10)
Please review the code and modify based on your need. I have added comments to explain each line of code. Hope this helps!

Python Response API JSON to CSV table

Bellow you see my code that I use to collect some data via the API of IBM. However I have some problems with saving the output via python to a csv table.
These are the columns that I want (and their values):
emotion__document__emotion__anger emotion__document__emotion__joy
emotion__document__emotion__sadness emotion__document__emotion__fear
emotion__document__emotion__disgust sentiment__document__score
sentiment__document__label language entities__relevance
entities__text entities__type entities__count concepts__relevance
concepts__text concepts__dbpedia_resource usage__text_characters
usage__features usage__text_units retrieved_url
This is my code that I use to collect the data:
response = natural_language_understanding.analyze(
url=url,
features=[
Features.Emotion(),
Features.Sentiment(),
Features.Concepts(limit=1),
Features.Entities(limit=1)
]
)
data = json.load(response)
rows_list = []
cols = []
for ind,row in enumerate(data):
if ind == 0:
cols.append(["usage__{}".format(i) for i in row["usage"].keys()])
cols.append(["emotion__document__emotion__{}".format(i) for i in row["emotion"]["document"]["emotion"].keys()])
cols.append(["sentiment__document__{}".format(i) for i in row["sentiment"]["document"].keys()])
cols.append(["concepts__{}".format(i) for i in row["concepts"].keys()])
cols.append(["entities__{}".format(i) for i in row["entities"].keys()])
cols.append(["retrieved_url"])
d = OrderedDict()
d.update(row["usage"])
d.update(row["emotion"]["document"]["emotion"])
d.update(row["sentiment"]["document"])
d.update(row["concepts"])
d.update(row["entities"])
d.update({"retrieved_url":row["retrieved_url"]})
rows_list.append(d)
df = pd.DataFrame(rows_list)
df.columns = [i for subitem in cols for i in subitem]
df.to_csv("featuresoutput.csv", index=False)
Changing
cols.append(["concepts__{}".format(i) for i in row["concepts"][0].keys()])
cols.append(["entities__{}".format(i) for i in row["entities"][0].keys()])
Did not solved the problem
If you get it from an API, the response would be in json format. You can output it into a csv by:
import csv, json
response = the json response you get from the API
attributes = [emotion__document__emotion__anger, emotion__document__emotion__joy.....attributes you want]
data = json.load(response)
with open('output.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
for attribute in attributes:
writer.writerow(data[attribute][0])
f.close()
make sure data is in dict but not string, Python 3.6 should return a dict. Print out a few rows to look into how your required data is stored.
This line assigns a string to data:
data=(json.dumps(datas, indent=2))
So here you iterate over the characters of a string:
for ind,row in enumerate(data):
In this case row will be a string, and not a dictionary. So, for example, row["usage"] would give you such an error in this case.
Maybe you wanted to iterate over datas?
Update
The code has a few other issues, such as:
cols.append(["concepts__{}".format(i) for i in row["concepts"].keys()])
In this case, you would want row["concepts"][0].keys() to get the keys of the first element, because row["concepts"] is an array.
I'm not very familiar with pandas, but I would suggest you to look at json_normalize, included in pandas, which can help flatten the JSON structure. An issue you might face, is the concepts and entities, which contain arrays of documents. That means that you would have to include the same document, at least max(len(concepts), len(entities)) times.

Convert Json data to Python DataFrame

This is my first time accessing an API / working with json data so if anyone can point me towards a good resource for understanding how to work with it I'd really appreciate it.
Specifically though, I have json data in this form:
{"result": { "code": "OK", "msg": "" },"report_name":"DAILY","columns":["ad","ad_impressions","cpm_cost_per_ad","cost"],"data":[{"row":["CP_CARS10_LR_774470","966","6.002019","5.797950"]}],"total":["966","6.002019","5.797950"],"row_count":1}
I understand this structure but I don't know how to get it into a DataFrame properly.
Looking at the structure of your json, presumably you will have several rows for your data and in my opinion it will make more sense to build the dataframe yourself.
This code uses columns and data to build a dataframe:
In [12]:
import json
import pandas as pd
​
with open('... path to your json file ...') as fp:
for line in fp:
obj = json.loads(line)
columns = obj['columns']
data = obj['data']
l = []
for d in data:
l += [d['row']]
df = pd.DataFrame(l, index=None, columns=columns)
df
Out[12]:
ad ad_impressions cpm_cost_per_ad cost
0 CP_CARS10_LR_774470 966 6.002019 5.797950
As for the rest of the data, in your json, I guess you could e.g. use the totals for checking your dataframe,
In [14]:
sums = df.sum(axis=0)
obj['total']
for i in range(0,3):
if (obj['total'][i] != sums[i+1]):
print "error in total"
In [15]:
if obj['row_count'] != len(df.index):
print "error in row count"
As for the rest of the data in the json, it is difficult for me to know if anything else should be done.
Hope it helps.
Check pandas documentation. Specifically,
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html
Pandas supports reading json to dataframe

Categories

Resources