I am trying to convert a very long JSON file to CSV. I'm currently trying to use the code below to accomplish this.
import json
import csv
with open('G:\user\jsondata.json') as json_file:
jsondata = json.load(json_file)
data_file = open('G:\user\jsonoutput.csv', 'w', newline='')
csv_writer = csv.writer(data_file)
count = 0
for data in jsondata:
if count == 0:
header = data.keys()
csv_writer.writerow(header)
count += 1
csv_writer.writerow(data.values())
data_file.close()
This code accomplishes writing all the data to a CSV, However only takes the keys for from the first JSON line to use as the headers in the CSV. This would be fine, but further in the JSON there are more keys to used. This causes the values to be disorganized. I was wondering if anyone could help me find a way to get all the possible headers and possibly insert NA when a line doesn't contain that key or values for that key.
The JSON file is similar to this:
[
{"time": "1984-11-04:4:00", "dateOfevent": "1984-11-04", "action": "TAKEN", "Country": "Germany", "Purchased": "YES", ...},
{"time": "1984-10-04:4:00", "dateOfevent": "1984-10-04", "action": "NOTTAKEN", "Country": "Germany", "Purchased": "NO", ...},
{"type": "A4", "time": "1984-11-04:4:00", "dateOfevent": "1984-11-04", "Country": "Germany", "typeOfevent": "H7", ...},
{...},
{...},
]
I've searched for possible solutions all over, but was unable to find anyone having a similar issue.
If want to use csv and json modules to do this then can do it in two passes. First pass collects the keys for the CSV file and second pass writes the rows to CSV file. Also, must use a DictWriter since the keys differ in the different records.
import json
import csv
with open('jsondata.json') as json_file:
jsondata = json.load(json_file)
# stage 1 - populate column names from JSON
keys = []
for data in jsondata:
for k in data.keys():
if k not in keys:
keys.append(k)
# stage 2 - write rows to CSV file
with open('jsonoutput.csv', 'w', newline='') as fout:
csv_writer = csv.DictWriter(fout, fieldnames=keys)
csv_writer.writeheader()
for data in jsondata:
csv_writer.writerow(data)
Could you try to read it normally, and then cobert it to csv using .to_csv like this:
df = pd.read_json('G:\user\jsondata')
#df = pd.json_normalize(df['Column Name']) #if you want to normalize it
dv.to_csv('example.csv')
Related
I have tried a few different ways using Panda to import my JSON to a csv file.
import pandas as pd
df = pd.read_json("CDMP_E2.json")
df.ts_csv("CDMP_Output.csv")
The problem is when I run that code it makes the output all in one "column".
The column header shows up as Credit-NoSQL.
Then the data in the column is everything from each "object"
'date':'2021-08-01','type':'CARD','amount':'100'
So it looks like this:
Credit-NoSQL
'date':'2021-08-01','type':'CARD','amount':'100'
I would instead expect to see date, type and amount as the headers instead.
account date type amount returneddate
ABCD 2021-08-01 CARD 100
EFGHI 2021-08-01 CARD 150 2021-08-04
My JSON file looks as such:
[
{
"Credit-NoSQL":{
"account":"ABCD"
"date":"2021-08-01",
"type":"CARD",
"amount":"100"
}
},
{
"Credit-NoSQL":{
"account":"EFGHI"
"date":"2021-08-02",
"type":"CARD",
"amount":"150"
"returneddate":"2021-08-04"
}
}
]
so I am not sure if it is the way my JSON file is set up with it's list and such or if I am missing something in my python command. I am new to python and still learning so I am at a loss at what I can do next.
No need to use pandas for this.
import json, csv
with open("CDMP_E2.json") as json_file:
data = [item['Credit-NoSQL'] for item in json.load(json_file)]
# Get the union of all dictionary keys
fieldnames = set()
for row in data:
fieldnames |= row
with open("CDMP_Output.csv", "w") as csv_file:
cwrite = csv.DictWriter(csv_file, fieldnames = fieldnames)
cwrite.writeheader()
cwrite.writerows(data)
I have a folder including multiple JSON files. Here is a sample JSON file (all JSON files have the same structure):
{
"url": "http://www.lulu.com/shop/alfred-d-byrd/in-the-fire-of-dawn/paperback/product-1108729.html",
"label": "true",
"body": "SOME TEXT HERE",
"ids": [
"360175950098468864",
"394147879201148929"
]
}
I'd like to extract only ids and write them into a CSV file. Here is my code:
import pandas as pd
import os
from os import path
import glob
import csv
import json
input_path = "TEST/True_JSON"
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file) #reading json into a pandas dataframe
ids = json_data[['ids']] #select only "response_tweet_ids"
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False)
print(ids)
PROBLEM: The above code writes some ids into a CSV file. However, it doesn't return all ids. Also, there are some ids in the output CSV file (ids.csv) that didn't exist in any of my JSON files!
I really appreciate it if someone helps me understand where is the problem.
Thank you,
one other way is create common list for all ids in the folder and write it to the output file only once, here example:
input_path = "TEST/True_JSON"
ids = []
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file) #reading json into a pandas dataframe
ids.extend(json_data['ids'].to_list()) #select only "response_tweet_ids"
pd.DataFrame(
ids, colums=('ids', )
).to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False)
print(ids)
Please read the answer by #lemonhead to get more details.
I think you have two main issues here:
pandas seems to read in ids off-by-1 in some cases, probably due to internally reading in as a float and then converting to an int64 and flooring. See here for a similar issue encountered
To see this:
> x = '''
{
"url": "http://www.lulu.com/shop/alfred-d-byrd/in-the-fire-of-dawn/paperback/product-1108729.html",
"label": "true",
"body": "SOME TEXT HERE",
"ids": [
"360175950098468864",
"394147879201148929"
]
}
'''
> print(pd.read_json(io.StringIO(x)))
# outputs:
url label body ids
0 http://www.lulu.com/shop/alfred-d-byrd/in-the-... true SOME TEXT HERE 360175950098468864
1 http://www.lulu.com/shop/alfred-d-byrd/in-the-... true SOME TEXT HERE 394147879201148928
Note the off by one error with 394147879201148929! AFAIK, one quick way to obviate this in your case is just to tell pandas to read everything in as a string, e.g.
pd.read_json(json_file, dtype='string')
You are looping through your json files and writing each one to the same csv file. However, by default, pandas is opening the file in 'w' mode, which will overwrite any previous data in the file. If you open in append mode ('a') instead, that should do what you intended
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False, mode='a')
In context:
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = pd.read_json(json_file, dtype='string') #reading json into a pandas dataframe
ids = json_data[['ids']] #select only "response_tweet_ids"
ids.to_csv('TEST/ids.csv',encoding='utf-8', header=False, index=False, mode='a')
Overall though, unless you are getting something else from pandas here, why not just use raw json and csv libraries? The following would be do the same without the pandas dependency:
import os
from os import path
import glob
import csv
import json
input_path = "TEST/True_JSON"
all_ids = []
for file in glob.glob(os.path.join(input_path,'*.json')):
with open(file,'rt') as json_file:
json_data = json.load(json_file)
ids = json_data['ids']
all_ids.extend(ids)
print(all_ids)
# write all ids to a csv file
# you could also remove duplicates or other post-processing at this point
with open('TEST/ids.csv', mode='wt', newline='') as fobj:
writer = csv.writer(fobj)
for row in all_ids:
writer.writerow([row])
By default, dataframe.to_csv() overwrites the file. So each time through the loop you replace the file with the IDs from that input file, and the final result is the IDs from the last file.
Use the mode='a' argument to append to the CSV file instead of overwriting.
ids.to_csv(
'TEST/ids.csv', encoding='utf-8', header=False, index=False,
mode='a'
)
I am not able to generate a proper csv file using the below code. But when I query in individually, I am getting the desired result. Below is the my json file and code
{
"quiz": {
"maths": {
"q2": {
"question": "12 - 8 = ?",
"options": [
"1",
"2",
"3",
"4"
],
"answer": "4"
},
"q1": {
"question": "5 + 7 = ?",
"options": [
"10",
"11",
"12",
"13"
],
"answer": "12"
}
},
"sport": {
"q1": {
"question": "Which one is correct team name in NBA?",
"options": [
"New York Bulls",
"Los Angeles Kings",
"Golden State Warriros",
"Huston Rocket"
],
"answer": "Huston Rocket"
}
}
}
}
import json
import csv
# Opening JSON file and loading the data
# into the variable data
with open('tempjson.json', 'r') as jsonFile:
data = json.load(jsonFile)
flattenData=flatten(data)
employee_data=flattenData
# now we will open a file for writing
data_file = open('data_files.csv', 'w')
# create the csv writer object
csv_writer = csv.writer(data_file)
# Counter variable used for writing
# headers to the CSV file
count = 0
for emp in employee_data:
if count == 0:
# Writing headers of CSV file
header = emp
csv_writer.writerow(header)
count += 1
# Writing data of CSV file
#csv_writer.writerow(employee_data.get(emp))
data_file.close()
Once the above code execute, I get the information as below:
I am not getting it what I am doing wrong. I am flattenning my json file and then trying to change it to csv
You can manipulate the JSON easily with Pandas Dataframes and save it to a CSV.
I'm not sure how your desired CSV should look like, but the following code generates a CSV with columns question, options, and answers. It generates an index column with the name of the quiz and the question number in an alphabetically ordered list (your JSON was unordered). The code below will also work when more different quizzes and questions are added.
Maybe converting it natively in Python is performance-wise better, but manipulation using Pandas makes it easier.
import pandas as pd
# create Pandas dataframe from JSON for easy manipulation
df = pd.read_json("tempjson.json")
# create result dataframe
df_result = pd.DataFrame()
# Get nested dict from each dataframe row
for index, row in df.iterrows():
# Convert it into a new dataframe
df_temp = pd.DataFrame.from_dict(df.loc[index]['quiz'], orient='index')
# Add name of quiz to index
df_temp.index = index + ' ' + df_temp.index
# Append row result to final dataframe
df_result = df_result.append(df_temp)
# Optionally sort alphabetically so questions are in order
df_result.sort_index(inplace=True)
# convert dataframe to CSV
df_result.to_csv('quiz.csv')
Update on request: Export to CSV using flattened JSON:
import json
import csv
from flatten_json import flatten
import pandas
# Opening JSON file and loading the data
# into the variable data
with open("tempjson.json", 'r') as jsonFile:
data = json.load(jsonFile)
flattenData=flatten(data)
df = pd.DataFrame.from_dict(flattenData, orient='index')
# convert dataframe to CSV
df.to_csv('quiz.csv', header=False)
Results in the following CSV (Not sure what your desired outcome is since you did not provide the desired result in your question).
I have a file 'test.json' which contains an array "rows" and another sub array "allowed" in which some alphabets are there like "A","B" etc.but i want to modify the contents of subarray. how can i do??
test.json file is following:
{"rows": [
{
"Company": "google",
"allowed": ["A","B","C"]},#array containg 3 variables
{
"Company": "Yahoo",
"allowed": ["D","E","F"]#array contanig 3 variables
}
]}
But i want to modify "allowed" array . and want to update 3rd index as "LOOK" instead of "C". so that the resultant array should looks like:
{"rows": [
{
"Company": "google",
"allowed": ["A","B","LOOK"]#array containg 3 variables
},
{
"Company": "Yahoo", #array containing 3 variables
"allowed": ["D","E","F"] #array containing 3 variables
}
]}
My program:
import json
with open('test.json') as f:
data = json.load(f)
for row in data['rows']:
a_dict = {row['allowed'][1]:"L"}
with open('test.json') as f:
data = json.load(f)
data.update(a_dict)
with open('test.json', 'w') as f:
json.dump(data, f,indent=2)
There are a couple of problems with your program as it is.
The first issue is you're not looking up the last element of your 'allowed' arrays:
a_dict = {row['allowed'][1]:"L"}
Remember, array indicies start at zero. eg:
['Index 0', 'Index 1', 'Index 2']
But the main problem is when you walk over each row, you fetch the contents of
that row, but then don't do anything with it.
import json
with open('test.json') as f:
data = json.load(f)
for row in data['rows']:
a_dict = {row['allowed'][1]:"L"}
# a_dict is twiddling its thumbs here...
# until it's replaced by the next row's contents
...
It just gets replaced by the next row of the for loop, until you're left with the
final row all by itself in "a_dict", since the last one of course isn't overwritten by
anything. Which in you sample, would be:
{'E': 'L'}
Next you load the original json data again (though, you don't need to -- it's
still in your data variable, unmodified), and add a_dict to it:
with open('test.json') as f:
data = json.load(f)
data.update(a_dict)
This leaves you with this:
{
"rows": [
{
"Company": "google",
"allowed": ["A", "B", "C"]
},
{
"Company": "Yahoo",
"allowed": ["D", "E", "F"]
}
],
"E": "L"
}
So, to fix this, you need to:
Point at the correct 'allowed' index (in your case, that'll be [2]), and
Modify the rows, instead of copying them out and merging them back into data.
In your for loop, each row in data['rows'] is pointing at the value in data, so you can update the contents of row, and your work is done.
One thing I wasn't clear on was whether you meant to update all rows (implied by your looping over all rows), or just update the first row (as shown in your example desired output).
So here's a sample fix which works in either case:
import json
modify_first_row_only = True
with open('test.json', 'r') as f:
data = json.load(f)
rows = data['rows']
if modify_first_row_only:
rows[0]['allowed'][2] = 'LOOK'
else:
for row in rows:
row['allowed'][2] = 'LOOK'
with open('test.json', 'w') as f:
json.dump(data, f, indent=2)
I am trying to get some data from a JSON file. Here is the code for it -
import csv
import json
ifile = open('facebook.csv', "rb")
reader = csv.reader(ifile)
rownum = 0
for row in reader:
try:
csvfile = open('facebook.csv', 'r')
jsonfile = open('file.json', 'r+')
fieldnames = ("USState","NOFU2008","NOFU2009","NOFU2010", "12MI%", "24MI%")
reader = csv.DictReader( csvfile, fieldnames)
for row in reader:
json.dump(row, jsonfile)
jsonfile.write('\n')
data = json.load(jsonfile)
print data["USState"]
except ValueError:
continue
I am not getting any output on the console for the print statement. The JSON is in the following format
{"USState": "US State", "12MI%": "12 month increase %", "24MI%": "24 month increase %", "NOFU2010": "Number of Facebook UsersJuly 2010", "NOFU2008": "Number of Facebook usersJuly 2008", "NOFU2009": "Number of Facebook UsersJuly 2009"}
{"USState": "Alabama", "12MI%": "109.3%", "24MI%": "400.7%", "NOFU2010": "1,452,300", "NOFU2008": "290,060", "NOFU2009": "694,020"}
I want to access this like NOFU2008 for all the rows.
The problem is in the way you're creating the JSON file. You don't want to use json.dump() for each row and then append those to the JSON file.
To create a JSON file, you should first create a data structure in Python that represents the entire file the way you want it, and then call json.dump() one time only to dump out the entire structure to JSON format.
Making a single json.dump() call for your entire file will insure that it is valid JSON.
I'd also recommend wrapping your list/array of rows inside a dict/object so you have a place to put other properties that pertain to the entire JSON file as opposed to a single row.
It looks like the first couple of rows of your facebook.csv are something like this (with or without the quotes):
"US State","12 month increase %","24 month increase %","Number of Facebook UsersJuly 2010","Number of Facebook usersJuly 2008","Number of Facebook UsersJuly 2009"
"Alabama","109.3%","400.7%","1,452,300","290,060","694,020"
Let's say we want to generate this JSON file from that (indented here for clarity):
{
"rows": [
{
"USState": "US State",
"12MI%": "Number of Facebook usersJuly 2008",
"24MI%": "Number of Facebook UsersJuly 2009",
"NOFU2010": "Number of Facebook UsersJuly 2010",
"NOFU2008": "12 month increase %",
"NOFU2009": "24 month increase %"
},
{
"USState": "Alabama",
"12MI%": "290,060",
"24MI%": "694,020",
"NOFU2010": "1,452,300",
"NOFU2008": "109.3%",
"NOFU2009": "400.7%"
}
]
}
Note that the top level of the JSON file is an object (not an array), and this object has a rows property which is the array of rows.
We can create this JSON file and test it with this Python code:
import csv
import json
# Read the CSV file and convert it to a list of dicts
with open( 'facebook.csv', 'rb' ) as csvfile:
fieldnames = (
"USState", "NOFU2008", "NOFU2009", "NOFU2010",
"12MI%", "24MI%"
)
reader = csv.DictReader( csvfile, fieldnames )
rows = list( reader )
# Wrap the list inside an outer dict
wrap = {
'rows': rows
}
# Format and write the entire JSON in one fell swoop
with open( 'file.json', 'wb' ) as jsonfile:
json.dump( wrap, jsonfile )
# Now test the file by reading it and parsing it
with open( 'file.json', 'rb' ) as jsonfile:
data = json.load( jsonfile )
# For fun, convert the data back to JSON again and pretty-print it
print json.dumps( data, indent=4 )
A few notes... This code does not have the nested reader loops from the original. I have no idea what those were for. One reader should be enough.
In fact, this version doesn't use a loop at all. This line generates a list of rows from the reader object:
rows = list( reader )
Also pay close attention to use use of with where the CSV and JSON files are opened. This is a great way to open a file because the file will be automatically closed at the end of the with block.
Now having said all this, I have to wonder if this exact JSON structure is what you really want? It looks like the first row of the CSV is a header row, so you may want to skip that row? You can do that easily by adding a reader.next() call before converting the rest of the CSV data to a list:
reader.next()
rows = list( reader )
Also I'm not sure I understand how you want to access the resulting data. You wouldn't be able to use data["USState"], because USState is a property of each individual row object. So say a little more about how you want to access the data and we can sort it out.
If you want to create a list of json objects in the file then you should inform yourself about what a list in json looks like.
In case that list elements are separated by a comma you should put something like this into the code:
jsonfile.write(',\n')