I'm reading a file and I want to append data in an array then dumps it in json file. i'm using python 2.7.
The problem is that it only returns the last line of the file and populate the file with this.
Don't know if it's clear so I show the code
import re
import json
results = []
contact = {
"id":"",
"email":""
}
source = open('zen_id.txt')
output = open('zen_id_js.json', 'w')
for line in source:
email = re.search(r'[\w\.-]+#[\w\.-]+', line)
contact['email'] = email.group(0)
p = re.search(r'\d\d\d\d\d', line)
contact['id'] = p.group(0)
results.append(contact)
json.dump(results, output)
And the output is :
[
{
"id": "35148",
"email": "****#gmail.com"
},
{
"id": "35148",
"email": "****#gmail.com"
},
{
"id": "35148",
"email": "****#gmail.com"
},
{
"id": "35148",
"email": "****#gmail.com"
},
Anyone knows what s happening ?
Thanks in advance
by doing
contact = {
"id":"",
"email":""
}
outside the loop, you have one instance of the object. You just modify the same instance over and over again (result.append doesn't create a copy of the dictionary, only stores the reference)
One solution is to define it inside the loop or to create a copy
for line in source:
email = re.search(r'[\w\.-]+#[\w\.-]+', line)
contact = {} # create a new, empty instance
contact['email'] = email.group(0)
...
note that it is not necessary to define the dictionary with keys & empty values, since you're overwriting them anyway. Define it empty.
Another alternative is not to use contact at all and create the dictionary on-the-fly using a literal form when appending to the list:
results.append({"email":email.group(0), "id":p.group(0)})
you can also skip the loop altogether and write that in one line using list comprehension:
results = [{"email":re.search(r'[\w\.-]+#[\w\.-]+', line).group(0), "id":re.search(r'\d\d\d\d\d', line).group(0)} for line in source]
The only issue here is that you cannot handle the cases where there isn't a match, at least easily.
You have to add the contact dictionary inside the for loop.
import re
import json
results = []
source = open('zen_id.txt')
output = open('zen_id_js.json', 'w')
for line in source:
contact = {
"id": "",
"email": ""
}
email = re.search(r'[\w\.-]+#[\w\.-]+', line)
contact['email'] = email.group(0)
p = re.search(r'\d\d\d\d\d', line)
contact['id'] = p.group(0)
results.append(contact)
json.dump(results, output)
You can perform deepcopy.
import re
import json
import copy
results = []
contact = {
"id":"",
"email":""
}
source = open('zen_id.txt')
output = open('zen_id_js.json', 'w')
for line in source:
email = re.search(r'[\w\.-]+#[\w\.-]+', line)
contact['email'] = email.group(0)
p = re.search(r'\d\d\d\d\d', line)
contact['id'] = p.group(0)
results.append(copy.deepcopy(contact))
json.dump(results, output)
Related
I have two large files:
one is a text file with a lot of IDs: one ID per row;
the other one is a 6+ GB json file, containing many items.
I need to search for those IDs in a certain field of the json file and copy the whole item it refers to for later analysis (creating a new file).
I give an example:
IDs.txt
unique_id_1
unique_id_2
...
schema.json
[
{
"id": "unique_id_1",
"name": "",
"text": "",
"date": "",
},
{
"id": "unique_id_aaa",
"name": "",
"text": "",
"date": "",
},
{
"id": "unique_id_2",
"name": "",
"text": "",
"date": "",
},
...
]
I am doing these analysis with Python - Pandas but I am getting troubles due to the large dimension of the files. What is the best way to do this thing? I can also consider using other software / languages
I implemented my second suggestion: this only works if the schema is flat (there are no nested objects in the JSON file). I also did not check what happens if a value in the JSON file is a dictionary, but probably if would be handled more carefully, as I currently check for } in a line to decide if the object is over.
You still need to load the entire IDs file, you need to check somehow if the object is needed.
If the useful_objects list grows too large, you can easily save that periodically while parsing the file.
import json
from pathlib import Path
import re
from typing import Dict
schema_name = "schema.json"
schema_path = Path(schema_name)
ids_name = "IDs.txt"
ids_path = Path(ids_name)
# read the ids
useful_ids = set()
with ids_path.open() as id_f:
for line in id_f:
id_ = line.strip()
useful_ids.add(id_)
print(useful_ids)
useful_objects = []
temp: Dict[str, str] = {}
was_useful = False
with schema_path.open() as sc_f:
for line in sc_f:
# remove start/end whitespace
line = line.strip()
print(f"Parsing line {line}")
# an object is ending
if line[0] == "}":
# add it
if was_useful:
useful_objects.append(temp)
# reset the usefulness for the next object
was_useful = False
# reset the temp object
temp = {}
# parse the line
match = re.match(r'"(.*?)": "(.*)"', line)
# if this did not match, skip the line
if match is None:
continue
# extract the data from the regex match
key = match.group(1)
value = match.group(2)
print(f"\tMatched: {key} {value}")
# build the temp object incrementally
temp[key] = value
# check if this object is useful
if key == "id" and value in useful_ids:
was_useful = True
useful_json = json.dumps(useful_objects, indent=4)
print(useful_json)
Again, not very elegant and not very robust, but as long as you are aware of the limitations, it does the job.
Cheers!
I'm trying to append an existing JSON file. When I overwrite the entire JSON file then everything works perfect. The problem that I have been unable to resolve is in the append. I'm completely at a loss at this point.
{
"hashlist": {
"QmVZATT8jWo6ncQM3kwBrGXBjuKfifvrE": {
"description": "Test Video",
"url": ""
},
"QmVqpEomPZU8cpNezxZHG2oc3xQi61P2n": {
"description": "Cat Photo",
"url": ""
},
"QmYdWb4CdFqWGYnPA7V12bX7hf2zxv64AG": {
"description": "test.co",
"url": ""
}
}
}%
Here is the code that I'm using where data['hashlist'].append(entry) receive AttributeError: 'dict' object has no attribute 'append'
#!/usr/bin/python
import json
import os
data = []
if os.stat("hash.json").st_size != 0 :
file = open('hash.json', 'r')
data = json.load(file)
# print(data)
choice = raw_input("What do you want to do? \n a)Add a new IPFS hash\n s)Seach stored hashes\n >>")
if choice == 'a':
# Add a new hash.
description = raw_input('Enter hash description: ')
new_hash_val = raw_input('Enter IPFS hash: ')
new_url_val = raw_input('Enter URL: ')
entry = {new_hash_val: {'description': description, 'url': new_url_val}}
# search existing hash listings here
if new_hash_val not in data['hashlist']:
# append JSON file with new entry
# print entry
# data['hashlist'] = dict(entry) #overwrites whole JSON file
data['hashlist'].append(entry)
file = open('hash.json', 'w')
json.dump(data, file, sort_keys = True, indent = 4, ensure_ascii = False)
file.close()
print('IPFS Hash Added.')
pass
else:
print('Hash exist!')
Usually python errors are pretty self-explanatory, and this is a perfect example. Dictionaries in Python do not have an append method. There are two ways of adding to dictionaries, either by nameing a new key, value pair or passing an iterable with key, value pairs to dictionary.update(). In your code you could do:
data['hashlist'][new_hash_val] = {'description': description, 'url': new_url_val}
or:
data['hashlist'].update({new_hash_val: {'description': description, 'url': new_url_val}})
The first one is probably superior for what you are trying to do, because the second one is more for when you are trying to add lots of key, value pairs.
You can read more about dictionaries in Python here.
I am trying to parse a big json file (hundreds of gigs) to extract information from its keys. For simplicity, consider the following example:
import random, string
# To create a random key
def random_string(length):
return "".join(random.choice(string.lowercase) for i in range(length))
# Create the dicitonary
dummy = {random_string(10): random.sample(range(1, 1000), 10) for times in range(15)}
# Dump the dictionary into a json file
with open("dummy.json", "w") as fp:
json.dump(dummy, fp)
Then, I use ijson in python 2.7 to parse the file:
file_name = "dummy.json"
with open(file_name, "r") as fp:
for key in dummy.keys():
print "key: ", key
parser = ijson.items(fp, str(key) + ".item")
for number in parser:
print number,
I was expecting to retrieve all the numbers in the lists corresponding to the keys of the dic. However, I got
IncompleteJSONError: Incomplete JSON data
I am aware of this post: Using python ijson to read a large json file with multiple json objects, but in my case I have a single json file, that is well formed, with a relative simple schema. Any ideas on how can I parse it? Thank you.
ijson has an iterator interface to deal with large JSON files allowing to read the file lazily. You can process the file in small chunks and save results somewhere else.
Calling ijson.parse() yields three values prefix, event, value
Some JSON:
{
"europe": [
{"name": "Paris", "type": "city"},
{"name": "Rhein", "type": "river"}
]
}
Code:
import ijson
data = ijson.parse(open(FILE_PATH, 'r'))
for prefix, event, value in data:
if event == 'string':
print(value)
Output:
Paris
city
Rhein
river
Reference: https://pypi.python.org/pypi/ijson
The sample json content file is given below: it has records of two people. It might as well have 2 million records.
[
{
"Name" : "Joy",
"Address" : "123 Main St",
"Schools" : [
"University of Chicago",
"Purdue University"
],
"Hobbies" : [
{
"Instrument" : "Guitar",
"Level" : "Expert"
},
{
"percussion" : "Drum",
"Level" : "Professional"
}
],
"Status" : "Student",
"id" : 111,
"AltID" : "J111"
},
{
"Name" : "Mary",
"Address" : "452 Jubal St",
"Schools" : [
"University of Pensylvania",
"Washington University"
],
"Hobbies" : [
{
"Instrument" : "Violin",
"Level" : "Expert"
},
{
"percussion" : "Piano",
"Level" : "Professional"
}
],
"Status" : "Employed",
"id" : 112,
"AltID" : "M112"
}
}
]
I created a generator which would return each person's record as a json object. The code would look like below. This is not the generator code. Changing couple of lines would make it a generator.
import json
curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
#Reading file line by line
line = fp.readline()
lnum = 0
while line:
for a in line:
if a == '{':
curly_idx.append(lnum)
first_curly_found = True
elif a == '}':
curly_idx.pop()
# when the right curly for every left curly is found,
# it would mean that one complete data element was read
if len(curly_idx) == 0 and first_curly_found:
jstr = f'{jstr}{line}'
jstr = jstr.rstrip()
jstr = jstr.rstrip(',')
jstr[:-1]
print("------------")
if len(jstr) > 10:
print("making json")
j = json.loads(jstr)
print(jstr)
jstr = ""
line = fp.readline()
lnum += 1
continue
if first_curly_found:
jstr = f'{jstr}{line}'
line = fp.readline()
lnum += 1
if lnum > 100:
break
You are starting more than one parsing iterations with the same file object without resetting it. The first call to ijson will work, but will move the file object to the end of the file; then the second time you pass the same.object to ijson it will complain because there is nothing to read from the file anymore.
Try opening the file each time you call ijson; alternatively you can seek to the beginning of the file after calling ijson so the file object can read your file data again.
if you are working with json with the following format you can use ijson.item()
sample json:
[
{"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}
{"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}
]
input = 'file.txt'
res=[]
if Path(input).suffix[1:].lower() == 'gz':
input_file_handle = gzip.open(input, mode='rb')
else:
input_file_handle = open(input, 'rb')
for json_row in ijson.items(input_file_handle,
'item'):
res.append(json_row)
I would like some help/ advice on how to parse this file for Gene ontology (.obo)
I am working to create a visualisation in D3, and need to create a "tree" file, in the JSON format -
{
"name": "flare",
"description": "flare",
"children": [
{
"name": "analytic",
"description": "analytics",
"children": [
{
"name": "cluster",
"description": "cluster",
"children": [
{"name": "Agglomer", "description": "AgglomerativeCluster", "size": 3938},
{"name": "Communit", "description": "CommunityStructure", "size": 3812},
{"name": "Hierarch", "description": "HierarchicalCluster", "size": 6714},
{"name": "MergeEdg", "description": "MergeEdge", "size": 743}
]
}, etc..
This format seems fairly easy to replicate in a dictionary in python, with 3 fields for each entry: name, description, and children[].
My probelm here is actually HOW to extract the data. The file linked above has "objects" structured as:
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
Where I will need the id, is_a and name fields. I have tried using python to parse this, but I cant seem to find a way to locate each object.
Any ideas?
Here's a fairly simple way to parse the objects in your '.obo' file. It saves the object data into a dict with the id as the key and the name and is_a data saved in a list. Then it pretty-prints it using the standard json module's .dumps function.
For testing purposes, I used a truncated version of the file in your link that only includes up to id: GO:0000006.
This code ignores any objects that contain the is_obsolete field. It also removes the description info from the is_a fields; I figured you probably wanted that, but it's easy enough to disable that functionality.
#!/usr/bin/env python
''' Parse object data from a .obo file
From http://stackoverflow.com/q/32989776/4014959
Written by PM 2Ring 2015.10.07
'''
from __future__ import print_function, division
import json
from collections import defaultdict
fname = "go-basic.obo"
term_head = "[Term]"
#Keep the desired object data here
all_objects = {}
def add_object(d):
#print(json.dumps(d, indent = 4) + '\n')
#Ignore obsolete objects
if "is_obsolete" in d:
return
#Gather desired data into a single list,
# and store it in the main all_objects dict
key = d["id"][0]
is_a = d["is_a"]
#Remove the next line if you want to keep the is_a description info
is_a = [s.partition(' ! ')[0] for s in is_a]
all_objects[key] = d["name"] + is_a
#A temporary dict to hold object data
current = defaultdict(list)
with open(fname) as f:
#Skip header data
for line in f:
if line.rstrip() == term_head:
break
for line in f:
line = line.rstrip()
if not line:
#ignore blank lines
continue
if line == term_head:
#end of term
add_object(current)
current = defaultdict(list)
else:
#accumulate object data into current
key, _, val = line.partition(": ")
current[key].append(val)
if current:
add_object(current)
print("\nall_objects =")
print(json.dumps(all_objects, indent = 4, sort_keys=True))
output
all_objects =
{
"GO:0000001": [
"mitochondrion inheritance",
"GO:0048308",
"GO:0048311"
],
"GO:0000002": [
"mitochondrial genome maintenance",
"GO:0007005"
],
"GO:0000003": [
"reproduction",
"GO:0008150"
],
"GO:0000006": [
"high-affinity zinc uptake transmembrane transporter activity",
"GO:0005385"
]
}
I have a CSV file that is structured as below :
Store, Region, District, MallName, Location
1234,90,910,MallA,GMT
4567,87,902,MallB,EST
2468,90,811,MallC,PST
1357,87,902,MallD,CST
What I was able to accomplish with my iterative brow-beating was getting a format like so:
{
"90": {
"910": {
"1234": {
"name": "MallA",
"location": "GMT"
}
},
"811": {
"2468": {
"name": "MallB",
"location": "PST"
}
}
},
"87": {
"902": {
"4567": {
"name": "MallB",
"location": "EST"
},
"1357": {
"name": "MallD",
"location": "CST"
}
}
}
}
The code below is stripped down to match the sample data set I provided but you get the idea as to what is happening. Again, it's very iterative and non-pythonic which I'm trying to also move towards. (If anyone feels the defined procedures would be worthwhile to post I can).
#************
# Main()
#************
dictHierarchy = {}
with open(getFilePath(), 'r') as f:
content = [line.strip('\n') for line in f.readlines()]
for data in content:
data = data.split(",")
myRegion = data[1]
myDistrict = data[2]
myName = data[3]
myLocation = data[4]
myStore = data[0]
if myRegion in dictHierarchy:
#check for District
if myDistrict in dictHierarchy[myRegion]:
#checkforStore
dictHierarchy[myRegion][myDistrict].update({myStore:addStoreDetails(data)})
else:
#add district
dictHierarchy[myRegion].update({myDistrict:addStore(data)})
else:
#add region
dictHierarchy.update({myRegion:addDistrict(data)})
with open('hierarchy.json', 'w') as outfile:
json.dump(dictHierarchy, outfile)
Obsessive compulsive me looked at the JSON output above and thought that to someone blindly opening the file it looks like a hodge-podge. What I was hoping to do for plain-text readability is group the data and throw it into JSON format as so:
{"Regions":[
{"Region":"90", "Districts":[
{"District":"910", "Stores":[
{"Store":"1234", "name":"MallA", "location":"GMT"}]},
{"District":"811", "Stores":[
{"Store":"2468", "name":"MallC", "location":"PST"}]}]},
{"Region":"87", "Districts":[
{"District":"902", "Stores":[
{"Store":"4567", "name":"MallB", "location":"EST"},
{"Store":"1357", "name":"MallD", "location":"CST"}]}]}]}
Long story short I wasted quite some time today trying to sort out how to actually populate the data structure in Python and essentially ended up no where. Is there a clean, pythonic way to achieve this? Is it even worth the effort?
I've added headers to your input like:
Store,Region,District,name,location
1234,90,910,MallA,GMT
4567,87,902,MallB,EST
2468,90,811,MallC,PST
1357,87,902,MallD,CST
then used python csv reader and group by like this:
import csv
from itertools import groupby, ifilter
from operator import itemgetter
data = []
with open('in.csv') as csvfile:
reader = csv.DictReader(csvfile)
regions = []
regions_dict = sorted(list(reader), key=itemgetter('Region'))
for region_id, region_group in groupby(regions_dict, itemgetter('Region')):
districts = []
regions.append({'Region': region_id, 'Districts': districts})
districts_dict = sorted(region_group, key=itemgetter('District'))
for district_id, district_group in groupby(districts_dict, itemgetter('District')):
districts.append({'District': district_id, 'Stores': list(district_group)})
print regions