I have about 5k json files structured similarly. I need to get all the values for "loc" key from all the files and store it to a separate json file or two. The total values for "loc" key from all the files will count to 78 million. So how can I get this done and possibly in most optimized and fastest way.
Structure of content in all files looks like:
{
"urlset": {
"#xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9",
"#xmlns:xhtml": "http://www.w3.org/1999/xhtml",
"url": [
{
"loc": "https://www.example.com/a",
"xhtml:link": {
"#rel": "alternate",
"#href": "android-app://com.example/xyz"
},
"lastmod": "2020-12-25",
"priority": "0.8"
},
{
"loc": "https://www.exampe.com/b",
"xhtml:link": {
"#rel": "alternate",
"#href": "android-app://com.example/xyz"
},
"lastmod": "2020-12-25",
"priority": "0.8"
}
]
}
}
I am looking for output json file like:
["https://www.example.com/a","https://www.example.com/b"]
what I am current doing is:
path = r'/home/spark/' # path to folder containing files
link_list = [] # list of required links
li = "" # contains text of all files combined
all_files = glob.glob(path + "/*")
#Looping through each file
for i in range(0,len(all_files)):
filename = all_files[i]
with open(filename,"r") as f:
li = li + f.read()
#Retrieving link from every "loc" key
for k in range(0,7800000):
lk = ((li.split('"loc"',1)[1]).split('"',1)[1]).split(" ",1)[0]
link = lk.replace('",','')
link_list.append(link)
with open("output.json","w") as f:
f.write(json.dumps(link_list))
I guess this is the worst solution anyone can get :D, so I need to optimize it to do the job fast and efficiently.
import json
import glob
dict_results = {}
dict_results['links'] = []
for filename in glob.glob("*json"):
with open("data.json", "r") as msg:
data = json.load(msg)
for url in data['urlset']['url']:
dict_results['links'].append(url['loc'])
print (dict_results)
If you just want all links, that should make it. Just write to file in text or binary as you wish after.
Output:
{'links': ['https://www.example.com/a', 'https://www.exampe.com/b']}
In case you just want a list (and so not a json):
import json
import glob
list_results = []
for filename in glob.glob("*json"):
with open("data.json", "r") as msg:
data = json.load(msg)
for url in data['urlset']['url']:
list_results.append(url['loc'])
print (list_results)
Output:
['https://www.example.com/a', 'https://www.exampe.com/b']
If you work with text json files as it seems, and that you know/trust those files, the fastest way would certainly be this one:
import glob
list_results = []
for filename in glob.glob("*json"):
with open("data.json", "r") as msg:
for line in msg:
if '"loc"' in line:
list_results.append(line.split('"')[3])
print (list_results)
Related
I am trying to delete an element in a json file,
here is my json file:
before:
{
"names": [
{
"PrevStreak": false,
"Streak": 0,
"name": "Brody B#3719",
"points": 0
},
{
"PrevStreak": false,
"Streak": 0,
"name": "XY_MAGIC#1111",
"points": 0
}
]
}
after running script:
{
"names": [
{
"PrevStreak": false,
"Streak": 0,
"name": "Brody B#3719",
"points": 0
}
]
}
how would I do this in python? the file is stored locally and I am deciding which element to delete by the name in each element
Thanks
I would load the file, remove the item, and then save it again. Example:
import json
with open("filename.json") as f:
data = json.load(f)
f.pop(data["names"][1]) # or iterate through entries to find matching name
with open("filename.json", "w") as f:
json.dump(data, f)
You will have to read the file, convert it to python native data type (e.g. dictionary), then delete the element and save the file. In your case something like this could work:
import json
filepath = 'data.json'
with open(filepath, 'r') as fp:
data = json.load(fp)
del data['names'][1]
with open(filepath, 'w') as fp:
json.dump(data, fp)
Try this:
# importing the module
import ast
# reading the data from the file
with open('dictionary.txt') as f:
data = f.read()
print("Data type before reconstruction : ", type(data))
# reconstructing the data as a dictionary
a_dict = ast.literal_eval(data)
{"names":[a for a in a_dict["names"] if a.get("name") !="XY_MAGIC#1111"]}
import json
with open("test.json",'r') as f:
data = json.loads(f.read())
names=data.get('names')
for idx,name in enumerate(names):
if name['name']=='XY_MAGIC#1111':
del names[idx]
break
print(names)
In order to read the file best approach would be using the with statement after which you could just use pythons json library and convert json string to python dict. once you get dict you can access the values and do your operations as required. you could convert it as json using json.dumps() then save it
This does the right thing useing the python json module, and prettyprints the json back to the file afterwards:
import json
jsonpath = '/path/to/json/file.json'
with open(jsonpath) as file:
j = json.loads(file.read())
names_to_remove = ['XY_MAGIC#1111']
for element in j['names']:
if element['name'] in names_to_remove:
j['names'].remove(element)
with open(jsonpath, 'w') as file:
file.write(json.dumps(j, indent=4))
I have a program that does below:
There are multiple folders which contains JSON file called “installed-files.json”.
Program is suppose to read the JSON files from each of the sub-folders.
If the JSON files are there, then convert it into a xlsx format.
The xlsx format shoud have worksheets named as per the sub-folder name.
e.g. if the name of the sub folder is CNA, the sheet name shoud be CNA, etc.
Below is the code snippet
import pandas as pd
import json
import os
def traverse_dir(rootDir, file_name):
dir_names = []
for names in os.listdir(rootDir):
entry_path = os.path.join(rootDir, names)
if os.path.isdir(entry_path):
dir_names.append(entry_path)
for fil_name in dir_names:
file_path = os.path.join(fil_name, file_name)
print(file_path)
if os.path.isfile(file_path):
with open(file_path) as jf:
data = json.load(jf)
df = pd.DataFrame(data)
df1 = pd.DataFrame(data)
new_df = df[df.columns.difference(['SHA256'])]
new_df1 = df1[df.columns.difference(['SHA256'])]
with pd.ExcelWriter('abc.xlsx') as writer:
new_df.to_excel(writer, sheet_name='BRA', index=False)
new_df1.to_excel(writer, sheet_name='CNA', index=False)
else:
print("file not found")
rootDir = <Full_Path_To_Sub-dirs>
file_name = 'installed-files.json'
traverse_dir(rootDir, file_name)
Below is the sample JSON file content
[
{
"SHA256": "123456",
"Name": "/system/Home.apk",
"Size": 99250072
},
{
"SHA256": "987654",
"Name": "/system/Setup.apk",
"Size": 86578788
},
{
"SHA256": "457457",
"Name": "/system/SApp.apk",
"Size": 72207922
},
{
"SHA256": "747645",
"Name": "/system/Lib.apk",
"Size": 57960376
},
{
"SHA256": "368764",
"Name": "/system/mium.so",
"Size": 51161376
},
{
"SHA256": "34455",
"Name": "/system/Smart.apk",
"Size": 50944780
},
{
"SHA256": "66777",
"Name": "/system/framework/work.jar",
"Size": 24772514
},
]
Problem Statement:
While the excel sheet is getting created as per the sub folders name(BRA and CNA). But the data is only coming from CNA. I can confirm this, because the JSON file present in both the sub directories had the same data initially. Therefore, to test my use cases I modified the content of BRA first. But after executing the code those changes were not present the new excel file for any of the two tabs that got created. Hence, I modified the JSON file from the CNA sub-folder. Now, when I execute the program, I could see those modified data in both tabs in the excel file .
Any ideas, why that could be happening?
I have also attached
- project directory structure screenshot.
Your problem is that you are writting a excell everytime you found a file and the data you are reading to both data frames is the same because you are getting it from the same JSON file. Also you must check the new_df1 = df1[df.columns.difference(['SHA256'])] because you are using the df and the df1, I'm not sure if this is what you wanted.
Either way, here is a working code snippet:
import pandas as pd
import json
import os
def traverse_dir(root: str, file_name: str):
data_cna = None
data_bra = None
for dir in os.listdir(root):
dir_path = os.path.join(root, dir)
# Grabs only the directories
if not os.path.isdir(dir_path):
continue
for file in os.listdir(dir_path):
file_path = os.path.join(dir_path, file)
# Grabs only the files within the directories and with the name passed
if not os.path.isfile(file_path):
continue
if file != file_name:
continue
if dir == "CNA":
with open(file_path) as freader:
data_cna = json.load(freader)
elif dir == "BRA":
with open(file_path) as freader:
data_bra = json.load(freader)
else:
# Other directories names are ignored
continue
if data_cna is None:
raise ValueError(f"{file_name} not found in {os.path.join(root, 'CNA')}")
if data_bra is None:
raise ValueError(f"{file_name} not found in {os.path.join(root, 'BRA')}")
df_cna = pd.DataFrame(data_cna)[pd.DataFrame(data_cna).columns.difference(['SHA256'])]
# Shouldn't this be: df_bra = pd.DataFrame(data_bra)[pd.DataFrame(data_bra).columns.difference(['SHA256'])],
# I mean replace data_cna difference by data_bra. Check your code.
df_bra = pd.DataFrame(data_bra)[pd.DataFrame(data_cna).columns.difference(['SHA256'])]
with pd.ExcelWriter('abc.xlsx') as writer:
df_cna.to_excel(writer, sheet_name='CNA', index=False)
df_bra.to_excel(writer, sheet_name='BRA', index=False)
rootDir = "."
file_name = 'installed-files.json'
traverse_dir(rootDir, file_name)
CNA JSON:
[
{
"SHA256": "123456",
"Name": "/system/Home.apk",
"Size": 99250072
},
{
"SHA256": "987654",
"Name": "/system/Setup.apk",
"Size": 86578788
}
]
BRA JSON:
[
{
"SHA256": "66777",
"Name": "/system/framework/work.jar",
"Size": 24772514
}
]
xls output CNA page:
xls output BRA page:
I want to create files with same name of the dictionary keys. For instance I have a json file like this
[{"Word": ["0"], "URL": "http://www..."},
{"Word": ["10"], "URL": "http://www..."},
{"Word": ["100"], "URL": "http://www..."},
{"Word": ["1000"], "URL": "http://www..."},
{"Word": ["11"], "URL": "http://www..."},]
and I want to open a file with key values name like "0" , "10", "100", "1000" and then I want to download the video in the values of dictionary which is in the "URL" section of dictionary.
I am trying to reach the json file with this code
import json
with open('filename.json') as f:
data = json.load(f)
for words in data:
x = words["Word"]
when I print x in that loop I get this ["0"] but I want to get just 0 without " or [] because I will use this value to create file with this method
os.mkdir('Videos')
os.makedirs('Videos/0')
or
for key_name in keys
os.makedirs('Videos/"key_name "')
How can I read that json file and open new files with name of the Keys in Json file?
Thanks
You access the key of the ["Word"] at position 0 to get the first item in the list.
import os
import json
with open('file.txt') as f:
data = json.load(f)
for words in data:
x = words["Word"][0]
new_dir = f'Videos/{x}'
print(f'Create dir : {new_dir}')
os.makedirs(new_dir)
Output:
Create dir : Videos/10
Create dir : Videos/100
Create dir : Videos/1000
Create dir : Videos/11
Try to get the first element of the list
x = words["Word"][0]
I am trying to parse a big json file (hundreds of gigs) to extract information from its keys. For simplicity, consider the following example:
import random, string
# To create a random key
def random_string(length):
return "".join(random.choice(string.lowercase) for i in range(length))
# Create the dicitonary
dummy = {random_string(10): random.sample(range(1, 1000), 10) for times in range(15)}
# Dump the dictionary into a json file
with open("dummy.json", "w") as fp:
json.dump(dummy, fp)
Then, I use ijson in python 2.7 to parse the file:
file_name = "dummy.json"
with open(file_name, "r") as fp:
for key in dummy.keys():
print "key: ", key
parser = ijson.items(fp, str(key) + ".item")
for number in parser:
print number,
I was expecting to retrieve all the numbers in the lists corresponding to the keys of the dic. However, I got
IncompleteJSONError: Incomplete JSON data
I am aware of this post: Using python ijson to read a large json file with multiple json objects, but in my case I have a single json file, that is well formed, with a relative simple schema. Any ideas on how can I parse it? Thank you.
ijson has an iterator interface to deal with large JSON files allowing to read the file lazily. You can process the file in small chunks and save results somewhere else.
Calling ijson.parse() yields three values prefix, event, value
Some JSON:
{
"europe": [
{"name": "Paris", "type": "city"},
{"name": "Rhein", "type": "river"}
]
}
Code:
import ijson
data = ijson.parse(open(FILE_PATH, 'r'))
for prefix, event, value in data:
if event == 'string':
print(value)
Output:
Paris
city
Rhein
river
Reference: https://pypi.python.org/pypi/ijson
The sample json content file is given below: it has records of two people. It might as well have 2 million records.
[
{
"Name" : "Joy",
"Address" : "123 Main St",
"Schools" : [
"University of Chicago",
"Purdue University"
],
"Hobbies" : [
{
"Instrument" : "Guitar",
"Level" : "Expert"
},
{
"percussion" : "Drum",
"Level" : "Professional"
}
],
"Status" : "Student",
"id" : 111,
"AltID" : "J111"
},
{
"Name" : "Mary",
"Address" : "452 Jubal St",
"Schools" : [
"University of Pensylvania",
"Washington University"
],
"Hobbies" : [
{
"Instrument" : "Violin",
"Level" : "Expert"
},
{
"percussion" : "Piano",
"Level" : "Professional"
}
],
"Status" : "Employed",
"id" : 112,
"AltID" : "M112"
}
}
]
I created a generator which would return each person's record as a json object. The code would look like below. This is not the generator code. Changing couple of lines would make it a generator.
import json
curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
#Reading file line by line
line = fp.readline()
lnum = 0
while line:
for a in line:
if a == '{':
curly_idx.append(lnum)
first_curly_found = True
elif a == '}':
curly_idx.pop()
# when the right curly for every left curly is found,
# it would mean that one complete data element was read
if len(curly_idx) == 0 and first_curly_found:
jstr = f'{jstr}{line}'
jstr = jstr.rstrip()
jstr = jstr.rstrip(',')
jstr[:-1]
print("------------")
if len(jstr) > 10:
print("making json")
j = json.loads(jstr)
print(jstr)
jstr = ""
line = fp.readline()
lnum += 1
continue
if first_curly_found:
jstr = f'{jstr}{line}'
line = fp.readline()
lnum += 1
if lnum > 100:
break
You are starting more than one parsing iterations with the same file object without resetting it. The first call to ijson will work, but will move the file object to the end of the file; then the second time you pass the same.object to ijson it will complain because there is nothing to read from the file anymore.
Try opening the file each time you call ijson; alternatively you can seek to the beginning of the file after calling ijson so the file object can read your file data again.
if you are working with json with the following format you can use ijson.item()
sample json:
[
{"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}
{"id":2,"cost":0,"test":0,"testid2":255909890011279,"test_id_3":0,"meeting":"daily","video":"paused"}
]
input = 'file.txt'
res=[]
if Path(input).suffix[1:].lower() == 'gz':
input_file_handle = gzip.open(input, mode='rb')
else:
input_file_handle = open(input, 'rb')
for json_row in ijson.items(input_file_handle,
'item'):
res.append(json_row)
I have a CSV file that is structured as below :
Store, Region, District, MallName, Location
1234,90,910,MallA,GMT
4567,87,902,MallB,EST
2468,90,811,MallC,PST
1357,87,902,MallD,CST
What I was able to accomplish with my iterative brow-beating was getting a format like so:
{
"90": {
"910": {
"1234": {
"name": "MallA",
"location": "GMT"
}
},
"811": {
"2468": {
"name": "MallB",
"location": "PST"
}
}
},
"87": {
"902": {
"4567": {
"name": "MallB",
"location": "EST"
},
"1357": {
"name": "MallD",
"location": "CST"
}
}
}
}
The code below is stripped down to match the sample data set I provided but you get the idea as to what is happening. Again, it's very iterative and non-pythonic which I'm trying to also move towards. (If anyone feels the defined procedures would be worthwhile to post I can).
#************
# Main()
#************
dictHierarchy = {}
with open(getFilePath(), 'r') as f:
content = [line.strip('\n') for line in f.readlines()]
for data in content:
data = data.split(",")
myRegion = data[1]
myDistrict = data[2]
myName = data[3]
myLocation = data[4]
myStore = data[0]
if myRegion in dictHierarchy:
#check for District
if myDistrict in dictHierarchy[myRegion]:
#checkforStore
dictHierarchy[myRegion][myDistrict].update({myStore:addStoreDetails(data)})
else:
#add district
dictHierarchy[myRegion].update({myDistrict:addStore(data)})
else:
#add region
dictHierarchy.update({myRegion:addDistrict(data)})
with open('hierarchy.json', 'w') as outfile:
json.dump(dictHierarchy, outfile)
Obsessive compulsive me looked at the JSON output above and thought that to someone blindly opening the file it looks like a hodge-podge. What I was hoping to do for plain-text readability is group the data and throw it into JSON format as so:
{"Regions":[
{"Region":"90", "Districts":[
{"District":"910", "Stores":[
{"Store":"1234", "name":"MallA", "location":"GMT"}]},
{"District":"811", "Stores":[
{"Store":"2468", "name":"MallC", "location":"PST"}]}]},
{"Region":"87", "Districts":[
{"District":"902", "Stores":[
{"Store":"4567", "name":"MallB", "location":"EST"},
{"Store":"1357", "name":"MallD", "location":"CST"}]}]}]}
Long story short I wasted quite some time today trying to sort out how to actually populate the data structure in Python and essentially ended up no where. Is there a clean, pythonic way to achieve this? Is it even worth the effort?
I've added headers to your input like:
Store,Region,District,name,location
1234,90,910,MallA,GMT
4567,87,902,MallB,EST
2468,90,811,MallC,PST
1357,87,902,MallD,CST
then used python csv reader and group by like this:
import csv
from itertools import groupby, ifilter
from operator import itemgetter
data = []
with open('in.csv') as csvfile:
reader = csv.DictReader(csvfile)
regions = []
regions_dict = sorted(list(reader), key=itemgetter('Region'))
for region_id, region_group in groupby(regions_dict, itemgetter('Region')):
districts = []
regions.append({'Region': region_id, 'Districts': districts})
districts_dict = sorted(region_group, key=itemgetter('District'))
for district_id, district_group in groupby(districts_dict, itemgetter('District')):
districts.append({'District': district_id, 'Stores': list(district_group)})
print regions