Json lines (Jsonl) generator to csv format - python

I have a large Jsonl file (6GB+) which I need to convert to .csv format. After running:
import json
with open(root_dir + 'filename.json') as json_file:
for line in json_file:
data = json.loads(line)
print(data)
Many records of the below format are returned:
{'url': 'https://twitter.com/CHItraders/status/945958273861275648', 'date': '2017-12-27T10:03:22+00:00', 'content': 'Why #crypto currencies like $BTC #Bitcoin are set for global domination - MUST READ! - https :// t.co/C1kEhoLaHr https :// t.co/sZT43PBDrM', 'renderedContent': 'Why #crypto currencies like $BTC #Bitcoin are set for global domination - MUST READ! - BizNews.com biznews.com/wealth-buildin…', 'id': 945958273861275648, 'username': 'CHItraders', 'user': {'username': 'CHItraders', 'displayname': 'CHItraders', 'id': 185663478, 'description': 'Options trader. Market-news. Nothing posted constitutes as advice. Do your own diligence.', 'rawDescription': 'Options trader. Market-news. Nothing posted constitutes as advice. Do your own diligence.', 'descriptionUrls': [], 'verified': False, 'created': '2010-09-01T14:52:28+00:00', 'followersCount': 1196, 'friendsCount': 490, 'statusesCount': 38888, 'favouritesCount': 10316, 'listedCount': 58, 'mediaCount': 539, 'location': 'Chicago, IL', 'protected': False, 'linkUrl': None, 'linkTcourl': None, 'profileImageUrl': 'https://pbs.twimg.com/profile_images/623935252357058560/AaeCRlHB_normal.jpg', 'profileBannerUrl': 'https://pbs.twimg.com/profile_banners/185663478/1437592670'}, 'outlinks': ['http://BizNews.com', 'https://www.biznews.com/wealth-building/2017/12/27/bitcoin-rebecca-mqamelo/#.WkNv2bQ3Awk.twitter'], 'outlinksss': 'http://BizNews.com https://www.biznews.com/wealth-building/2017/12/27/bitcoin-rebecca-mqamelo/#.WkNv2bQ3Awk.twitter', 'tcooutlinks': ['https :// t.co/C1kEhoLaHr', 'https :// t.co/sZT43PBDrM'], 'tcooutlinksss': 'https :// t.co/C1kEhoLaHr https :// t.co/sZT43PBDrM', 'replyCount': 0, 'retweetCount': 0, 'likeCount': 0, 'quoteCount': 0, 'conversationId': 945958273861275648, 'lang': 'en', 'source': 'Twitter Web Client', 'media': None, 'retweetedTweet': None, 'quotedTweet': None, 'mentionedUsers': None}
Due to the size of the file, I can't use the conversion:
with open(root_dir + 'filename.json', 'r', encoding ='utf-8-sig') as f:
data = f.readlines()
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
newdf = pd.read_json(StringIO(data_json_str))
newdf.to_csv(root_dir + 'output.csv')
due to MemoryError. I am trying to use the below generator and write each line to the csv, which should negate the MemoryError issue:
def yield_line_delimited_json(path):
"""
Read a line-delimited json file yielding each row as a record
:param str path:
:rtype: list[object]
"""
with open(path, 'r') as json_file:
for line in json_file:
yield json.loads(line)
new = yield_line_delimited_json(root_dir + 'filename.json')
with open(root_dir + 'output.csv', 'w') as f:
for x in new:
f.write(str(x))
However, the data is not written to the .csv format. Any advice on why the data isn't writing to the csv file is greatly appreciated!

The generator seems completely superfluous.
with open(root_dir + 'filename.json') as old, open(root_dir + 'output.csv', 'w') as csvfile:
new = csv.writer(csvfile)
for x in old:
row = json.loads(x)
new.writerow(row)
If one line of JSON does not simply produce an array of strings and numbers, you still need to figure out how to convert it from whatever structure is inside the JSON to something which can usefully be serialized as a one-dimensional list of strings and numbers of a fixed length.
If your JSON can be expected to reliably contain a single dictionary with a fixed set of keyword-value pairs, maybe try
from csv import DictWriter
import json
with open(jsonfile, 'r') as inp, open(csvfile, 'w') as outp:
writer = DictWriter(outp, fieldnames=[
'url', 'date', 'content', 'renderedContent', 'id', 'username',
# 'user', # see below
'outlinks', 'outlinksss', 'tcooutlinks', 'tcooutlinksss',
'replyCount', 'retweetCount', 'likeCount', 'quoteCount',
'conversationId', 'lang', 'source', 'media', 'retweetedTweet',
'quotedTweet', 'mentionedUsers'])
for line in inp:
row = json.loads(line)
writer.writerow(row)
I omitted the user field because in your example, this key contains a nested structure which cannot easily be transformed into CSV without further mangling. Perhaps you would like to extract just user.id into a new field user_id; or perhaps you would like to lift the entire user structure into a flattened list of additional columns in the main record like user_username, user_displayname, user_id, etc?
In some more detail, CSV is basically a two-dimensional matrix where every row is a one-dimensional collection of columns corresponding to one record in the data, where each column can contain one string or one number. Every row needs to have exactly the same number of columns, though you can leave some of them empty.
JSON which can trivially be transformed into CSV would look like
["Adolf", "1945", 10000000]
["Joseph", "1956", 25000000]
["Donald", null, 1000000]
JSON which can be transformed to CSV by some transformation (which you'd have to specify separately, like for example with the dictionary key ordering specified above) might look like
{"name": "Adolf", "dod": "1945", "death toll": 10000000}
{"dod": "1956", "name": "Joseph", "death toll": 25000000}
{"death toll": 1000000, "name": "Donald"}
(Just to make it more interesting, one field is missing, and the dictionary order varies from one record to the next. This is not typical, but definitely within the realm of valid corner cases that Python could not possibly guess on its own how to handle.)
Most real-world JSON is significantly more complex than either of these simple examples, to the point where we can say that the problem is not possible to solve in the general case without manual work on your part to separate out and normalize the data you want into a structure which is suitable for representing as a CSV matrix.

Related

Why is the index position outputting data from other index's?

I'm trying to create a list of values from two categories in a dataset called style 1 and style 2. However, when I input the code it creates a list that combines data from style 1 and style 2. The weird part is that it's grabbing these catchphrases from somewhere I'm not aware of? This is the dataset that I'm using and I cannot find the catchphrases within it: https://www.kaggle.com/datasets/jessicali9530/animal-crossing-new-horizons-nookplaza-dataset?select=villagers.csv
Here's my code:
style = []
for i in lines[1:]:
vals = i.strip().split(',')
style.append(vals[13])
style.append(vals[14])
print(style)
Here's a small sample of what is being printed:
[' let it go. Then chase it down. What were you thinking?"', 'Active', 'Cool', 'Cool', 'Active', 'Simple', 'Simple', 'Elegant', 'Active', 'Active', 'Simple', 'Simple', 'Cute', 'Cute', 'Gorgeous', 'Elegant', ' water', ' and shelter!"',
As you can see there are these random catchphrases mixed within the style 1 and style 2 values. Not sure why or where it's coming from.
Some lines must contain commas inside a csv field. Usually in csv processing this is handled by putting double quotes around each field, like this:
Hero,Catchphrase,Sidekick
"Lone Ranger","Hi Ho, Silver!","Tonto"
"Superman","Up, Up, and Away!","Lois Lane"
You see that there are commas inside the catchphrase column. Since they are inside the quotes, standard csv processing will handle it correctly.
However, it looks like you are not using standard csv processing. You're just treating each csv data line as a standard string. If you were to do the same with this sample data, the first line will split into four fields, and the second line into five.
Here's a short example of the right way to do it:
import csv
with open("myfile.csv") as f:
reader = csv.reader(f)
for row in reader:
# row[13] and row[14] are handled correctly

Correct way to write list of json to file in Python

Using Python, I'm pulling json data from an API - every time I get a response, I add it to a list named api_responses like so:
api_responses = [{'id':'1', 'first_name':'John', 'last_name':'Smith'},{'id':'2', 'first_name':'Chris', 'last_name':'Clark'}]
I need to write the records to a file so that it's only the records and not a list:
api_responses.json:
{'id':'1', 'first_name':'John', 'last_name':'Smith'}
{'id':'2', 'first_name':'Chris', 'last_name':'Clark'}
I could for-loop the list, write each row and add '\n'.
with open('api_responses.json', 'w') as outfile:
for response in api_responses:
json.dump(response, outfile)
outfile.write('\n')
I don't want to indent or pretty the json - just make it flat as possible. Is there a better way to do this? Is jsonl/jsonlines what I am searching for?
Your code already correctly outputs the JSON objects as individual lines into a file, also known as JSONL as you pointed out.
To make the code slightly more concise and readable, you can take advantage of the print function, which outputs a newline character for you by default:
with open('api_responses.json', 'w') as outfile:
for response in api_responses:
print(json.dumps(response), file=outfile)
With your sample input, this outputs:
{"id": "1", "first_name": "John", "last_name": "Smith"}
{"id": "2", "first_name": "Chris", "last_name": "Clark"}
You are looking for the * decorator for lists on print:
>>> api_responses = [{'id':'1', 'first_name':'John', 'last_name':'Smith'},
{'id':'2', 'first_name':'Chris', 'last_name':'Clark'}]
>>> print(*api_responses,sep="\n", file=open('filename.txt','w'))
{'id': '1', 'first_name': 'John', 'last_name': 'Smith'}
{'id': '2', 'first_name': 'Chris', 'last_name': 'Clark'}
Note that this answer solves perfectly what OP is asking. Open the file filename.txt and see that matches with op requests.
Note: request is to use simple quote ', and this answer match the requierement.

"list indices must be integers or slices, not str" while manipulating data from JSON

I am trying to extract some data from JSON files, which are have all the same structure and then write the chosen data into a new JSON file. My goal is to create a new JSON file which is more or less a list of each JSON file in my folder with the data:
Filename, triggerdata, velocity {imgVel, trigVel}, coordinates.
In a further step of my programme, I will need this new splitTest1 for analysing the data of the different files.
I have the following code:
base_dir = 'mypath'
def createJsonFile() :
splitTest1 = {}
splitTest1['20mm PSL'] = []
for file in os.listdir(base_dir):
# If file is a json, construct it's full path and open it, append all json data to list
if 'json' in file:
json_path = os.path.join(base_dir, file)
json_data = pd.read_json(json_path, lines=True)
if splitTest1[file]['20mm PSL'] == to_find:
splitTest1['20mm PSL'].append({
'filename': os.path.basename(base_dir),
'triggerdata': ['rawData']['adcDump']['0B'],
'velocity': {
'imgVel': ['computedData']['particleProperties']['imgVelocity'],
'trigVel': ['computedData']['img0Properties']['coordinates']},
'coordinates': ['computedData']['img1Properties']['coordinates']})
print(len(splitTest1))
When I run the code, I get this error:
'triggerdata': ['rawData']['adcDump']['0B'], TypeError: list indices must be integers or slices, not str
What is wrong with the code? How do I fix this?
This is my previous code how I accessed that data without saving it in another JSON File:
with open('myJsonFile.json') as f0:
d0 = json.load(f0)
y00B = d0['rawData']['adcDump']['0B']
x = np.arange(0, (2048 * 0.004), 0.004) # in ms, 2048 Samples, 4us
def getData():
return y00B, x
def getVel():
imgV = d0['computedData']['particleProperties']['imgVelocity']
trigV = d0['computedData']['trigger']['trigVelocity']
return imgV, trigV
Basically, I am trying to put this last code snippet into a loop which is reading all my JSON files in my folder and make a new JSON file with a list of the names of these files and some other chosen data (like the ['rawData']['adcDump']['0B'], etc)
Hope this helps understanding my problem better
I assume what you want to do is to take some data from several json files and compile those into a list and write that into a new json file.
In order to get the data from your current json file you'll need to add a "reference" to it in front of the indices (otherwise the code has no idea where it should that data from). Like so:
base_dir = 'mypath'
def createJsonFile() :
splitTest1 = {}
splitTest1['20mm PSL'] = []
for file in os.listdir(base_dir):
# If file is a json, construct it's full path and open it, append all json data to list
if 'json' in file:
json_path = os.path.join(base_dir, file)
json_data = pd.read_json(json_path, lines=True)
if splitTest1[file]['20mm PSL'] == to_find:
splitTest1['20mm PSL'].append({
'filename': os.path.basename(base_dir),
'triggerdata': json_data['rawData']['adcDump']['0B'],
'velocity': {
'imgVel': json_data['computedData']['particleProperties']['imgVelocity'],
'trigVel': json_data['computedData']['img0Properties']['coordinates']},
'coordinates': json_data['computedData']['img1Properties']['coordinates']})
print(len(splitTest1))
So basically what you need to do is to add "json_data" in front of the indices.
Also I suggest you to write the variable "json_path" and not "base_dir" into the 'filename' field.
I found the solution with help of the post from Mattu475
I had to add the reference in front of the indices and also change on how to open the files found in my folder with the following code;
with open (json_path) as f0:
json_data = json.load(f0)
instead of pd.read_json(...)
Here the full code:
def createJsonFile() :
splitTest1 = {}
splitTest1['20mm PSL'] = []
for file in os.listdir(base_dir):
# If file is a json, construct it's full path and open it, append all json data to list
if 'json' in file:
print("filename: " ,file) # file is only the file name, the path not included
json_path = os.path.join(base_dir, file)
print("path : ", json_path)
with open (json_path) as f0:
json_data = json.load(f0)
splitTest1['20mm PSL'].append({
'filename': os.path.basename(json_path),
'triggerdata': json_data['rawData']['adcDump']['0B'],
#'imgVel': json_data['computedData']['particleProperties']['imgVelocity'],
'trigVel': json_data['computedData']['trigger']['trigVelocity'],
#'coordinatesImg0': json_data['computedData']['img0Properties']['coordinates'],
#'coordinatesImg1': json_data['computedData']['img1Properties']['coordinates']
})
return splitTest1
few lines (the ones commented out) do not function 100% yet, but the rest works.
Thank you for your help!
The issue is with this line
'imgVel': ['computedData']['particleProperties']['imgVelocity'],
And the two that come after that. What's happening there is you're creating a list with the string 'computedData' as the only element. And then trying to find the index 'particleProperties', which doesn't make sense. You can only index a list with integers. I can't really give you a "solution", but if you want imgVel to just be a list of those strings, then you would do
'imgVel': ['computedData', 'particularProperties', 'imgVelocity']
Your dict value isn't legal Python.
'triggerdata': ['rawData']['adcDump']['0B']
The value doesn't make any sense: you make a list of a single string, then you try to index it with another string. You asked for element "adcDump" of the list ['rawData'], and there isn't any such syntax.
You cannot store arbitrary source code (your partial expression) as if it were a data value.
If you want help to construct a particular reference, then please post a focused question. Please repeat how to ask from the intro tour.

python 2.7:iterate dictionary and map values to a file

I have a list of dictionaries which I build from .xml file:
list_1=[{'lat': '00.6849879', 'phone': '+3002201600', 'amenity': 'restaurant', 'lon': '00.2855850', 'name': 'Telegraf'},{'lat': '00.6850230', 'addr:housenumber': '6', 'lon': '00.2844493', 'addr:city': 'XXX', 'addr:street': 'YYY.'},{'lat': '00.6860304', 'crossing': 'traffic_signals', 'lon': '00.2861978', 'highway': 'crossing'}]
My aim is to build a text file with values (not keys) in such order:
lat,lon,'addr:street','addr:housenumber','addr:city','amenity','crossing' etc...
00.6849879,00.2855850, , , ,restaurant, ,'\n'00.6850230,00.2844493,YYY,6,XXX, , ,'\n'00.6860304,00.2861978, , , , ,traffic_signals,'\n'
if value not exists there should be empty space.
I tried to loop with for loop:
for i in list_1:
line= i['lat'],i['lon']
print line
Problem occurs if I add value which does not exist in some cases:
for i in list_1:
line= i['lat'],i['lon'],i['phone']
print line
Also tried to loop and use map() function, but results seems not correct:
for i in list_1:
line=map(lambda x1,x2:x1+','+x2+'\n',i['lat'],i['lon'])
print line
Also tried:
for i in list_1:
for k,v in i.items():
if k=='addr:housenumber':
print v
This time I think there might be too many if/else conditions to write.
Seems like solutions is somewhere close. But can't figure out the solution and its optimal way.
I would look to use the csv module, in particular DictWriter. The fieldnames dictate the order in which the dictionary information is written out. Actually writing the header is optional:
import csv
fields = ['lat','lon','addr:street','addr:housenumber','addr:city','amenity','crossing',...]
with open('<file>', 'w') as f:
writer = csv.DictWriter(f, fields)
#writer.writeheader() # If you want a header
writer.writerows(list_1)
If you really didn't want to use csv module then you can simple iterate over the list of the fields you want in the order you want them:
fields = ['lat','lon','addr:street','addr:housenumber','addr:city','amenity','crossing',...]
for row in line_1:
print(','.join(row.get(field, '') for field in fields))
If you can't or don't want to use csv you can do something like
order = ['lat','lon','addr:street','addr:housenumber',
'addr:city','amenity','crossing']
for entry in list_1:
f.write(", ".join([entry.get(x, "") for x in order]) + "\n")
This will create a list with the values from the entry map in the order present in the order list, and default to "" if the value is not present in the map.
If your output is a csv file, I strongly recommend using the csv module because it will also escape values correctly and other csv file specific things that we don't think about right now.
Thanks guys
I found the solution. Maybe it is not so elegant but it works.
I made a list of node keys look for them in another list and get values.
key_list=['lat','lon','addr:street','addr:housenumber','amenity','source','name','operator']
list=[{'lat': '00.6849879', 'phone': '+3002201600', 'amenity': 'restaurant', 'lon': '00.2855850', 'name': 'Telegraf'},{'lat': '00.6850230', 'addr:housenumber': '6', 'lon': '00.2844493', 'addr:city': 'XXX', 'addr:street': 'YYY.'},{'lat': '00.6860304', 'crossing': 'traffic_signals', 'lon': '00.2861978', 'highway': 'crossing'}]
Solution:
final_list=[]
for i in list:
line=str()
for ii in key_list:
if ii in i:
x=ii
line=line+str(i[x])+','
else:
line=line+' '+','
final_list.append(line)

Trying to write a list of dictionaries to csv in Python, running into encoding issues

So I am running into an encoding problem stemming from writing dictionaries to csv in Python.
Here is an example code:
import csv
some_list = ['jalape\xc3\xb1o']
with open('test_encode_output.csv', 'wb') as csvfile:
output_file = csv.writer(csvfile)
for item in some_list:
output_file.writerow([item])
This works perfectly fine and gives me a csv file with "jalapeño" written in it.
However, when I create a list of dictionaries with values that contain such UTF-8 characters...
import csv
some_list = [{'main': ['4 dried ancho chile peppers, stems, veins
and seeds removed']}, {'main': ['2 jalape\xc3\xb1o
peppers, seeded and chopped', '1 dash salt']}]
with open('test_encode_output.csv', 'wb') as csvfile:
output_file = csv.writer(csvfile)
for item in some_list:
output_file.writerow([item])
I just get a csv file with 2 rows with the following entries:
{'main': ['4 dried ancho chile peppers, stems, veins and seeds removed']}
{'main': ['2 jalape\xc3\xb1o peppers, seeded and chopped', '1 dash salt']}
I know I have my stuff written in the right encoding, but because they aren't strings, when they are written out by csv.writer, they are written as-is. This is frustrating. I searched for some similar questions on here and people have mentioned using csv.DictWriter but that wouldn't really work well for me because my list of dictionaries aren't all just with 1 key 'main'. Some have other keys like 'toppings', 'crust', etc. Not just that, I'm still doing more work on them where the eventual output is to have the ingredients formatted in amount, unit, ingredient, so I will end up with a list of dictionaries like
[{'main': {'amount': ['4'], 'unit': [''],
'ingredient': ['dried ancho chile peppers']}},
{'topping': {'amount': ['1'], 'unit': ['pump'],
'ingredient': ['cool whip']}, 'filling':
{'amount': ['2'], 'unit': ['cups'],
'ingredient': ['strawberry jam']}}]
Seriously, any help would be greatly appreciated, else I'd have to use a find and replace in LibreOffice to fix all those \x** UTF-8 encodings.
Thank you!
You are writing dictionaries to the CSV file, while .writerow() expects lists with singular values that are turned into strings on writing.
Don't write dictionaries, these are turned into string representations, as you've discovered.
You need to determine how the keys and / or values of each dictionary are to be turned into columns, where each column is a single primitive value.
If, for example, you only want to write the main key (if present) then do so:
with open('test_encode_output.csv', 'wb') as csvfile:
output_file = csv.writer(csvfile)
for item in some_list:
if 'main' in item:
output_file.writerow(item['main'])
where it is assumed that the value associated with the 'main' key is always a list of values.
If you wanted to persist dictionaries with Unicode values, then you are using the wrong tool. CSV is a flat data format, just rows and primitive columns. Use a tool that can preserve the right amount of information instead.
For dictionaries with string keys, lists, numbers and unicode text, you can use JSON, or you can use pickle if more complex and custom data types are involved. When using JSON, you do want to either decode from byte strings to Python Unicode values, or always use UTF-8-encoded byte strings, or state how the json library should handle string encoding for you with the encoding keyword:
import json
with open('data.json', 'w') as jsonfile:
json.dump(some_list, jsonfile, encoding='utf8')
because JSON strings are always unicode values. The default for encoding is utf8 but I added it here for clarity.
Loading the data again:
with open('data.json', 'r') as jsonfile:
some_list = json.load(jsonfile)
Note that this will return unicode strings, not strings encoded to UTF8.
The pickle module works much the same way, but the data format is not human-readable:
import pickle
# store
with open('data.pickle', 'wb') as pfile:
pickle.dump(some_list, pfile)
# load
with open('data.pickle', 'rb') as pfile:
some_list = pickle.load(pfile)
pickle will return your data exactly as you stored it. Byte strings remain byte strings, unicode values would be restored as unicode.
As you see in your output, you've used a dictionary so if you want that string to be processed you have to write this:
import csv
some_list = [{'main': ['4 dried ancho chile peppers, stems, veins', '\xc2\xa0\xc2\xa0\xc2\xa0 and seeds removed']}, {'main': ['2 jalape\xc3\xb1o peppers, seeded and chopped', '1 dash salt']}]
with open('test_encode_output.csv', 'wb') as csvfile:
output_file = csv.writer(csvfile)
for item in some_list:
output_file.writerow(item['main']) #so instead of [item], we use item['main']
I understand that this is possibly not the code you want as it limits you to call every key main but at least it gets processed now.
You might want to formulate what you want to do a bit better as now it is not really clear (at least to me). For example do you want a csv file that gives you main in the first cell and then 4 dried ...

Categories

Resources