Python/Json - Check for a specific object in multiple files - python

I have a huge amount of json files (4000) and I need to check every single one of them for a specific object. My code is like the following:
import os
import json
files = sorted(os.listdir("my files path"))
for f in files:
if f.endswith(".json"):
myFile = open("my path\\" + f)
myJson = json.load(bayesFile)
if myJson["something"]["something"]["what im looking for"] == "ACTION"
#do stuff
myFile.close()
As you can imagine this is taking a lot of execution time and I was wondering if there is a quicker way...?

Here's a multithreaded approach that may help you:
from glob import glob
import json
from concurrent.futures import ThreadPoolExecutor
import os
BASEDIR = 'myDirectory' # the directory containing the json files
def process(filename):
with open(filename) as infile:
data = json.load(infile)
if data.get('foo', '') == 'ACTION':
pass # do stuff
def main():
with ThreadPoolExecutor() as executor:
executor.map(process, glob(os.path.join(BASEDIR, '*.json')))
if __name__ == '__main__':
main()

Related

How to optimize the below code to read very large multiple file?

I have folder containing about 5 million files and i have to read the content of each file so that i can form dataframe.It take very long time to do that. Is there any way i can optimize the below code to speed up the process below.
new_list = []
file_name=[]
for root, dirs, files in os.walk('Folder_5M'):
for file in files:
count+=1
file_name.append(file)
with open(os.path.join(root, file), 'rb') as f:
text = f.read()
new_list.append(text)
This is an IO bound task so multi-threading is the tool for the job. In python there are two ways to implement multi-threads. One using the thread pool and the second is using the asyncio that works with event loop. The event loop usually has better performance the challenge is to limit the number of threads executing at the same time. Fortunately, Andrei wrote a very good solution for this.
This code creates an event loop that reads the files in several threads. The parameter MAX_NUMBER_OF_THREADS defines the amount of thread can execute at the same time. Try to play with this number for better performance as it is affected by the machine that runs it.
import os
import asyncio
async def read_file(file_path: str) -> str:
with open(file_path, "r") as f:
return f.read()
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
MAX_NUMBER_OF_THREADS = 100
file_name = []
file_path = []
for path, subdirs, files in os.walk("Folder_5M"):
for name in files:
file_path.append(os.path.join(path, name))
file_name.append(name)
count = len(file_name)
tasks = [read_file(file) for file in file_path]
asyncio.run(gather_with_concurrency(MAX_NUMBER_OF_THREADS, *tasks))
Here's an idea for how you could use multiprocessing for this.
Constructing a list of files resulting from os.walk is likely to be very fast. It's the processing of those files that's going to take time. With multiprocessing you can do a lot of that work in parallel.
Each process opens the given file, processes it and creates a dataframe. When all of the parallel processing has been carried out you then concatenate the returned dataframes. This last part will be CPU intensive and there's no way (that I can think of) that would allow you to share that load.
from pandas import DataFrame, concat
from os import walk
from os.path import join, expanduser
from multiprocessing import Pool
HOME = expanduser('~')
def process(filename):
try:
with open(filename) as data:
df = DataFrame()
# analyse your data and populate the dataframe here
return df
except Exception:
return DataFrame()
def main():
with Pool() as pool:
filenames = []
for root, _, files in walk(join(HOME, 'Desktop')):
for file in files:
filenames.append(join(root, file))
ar = pool.map_async(process, filenames)
master = concat(ar.get())
print(master)
if __name__ == '__main__':
main()

How do I fix my code so that it is automated?

I have the below code that takes my standardized .txt file and converts it into a JSON file perfectly. The only problem is that sometimes I have over 300 files and doing this manually (i.e. changing the number at the end of the file and running the script is too much and takes too long. I want to automate this. The files as you can see reside in one folder/directory and I am placing the JSON file in a differentfolder/directory, but essentially keeping the naming convention standardized except instead of ending with .txt it ends with .json but the prefix or file names are the same and standardized. An example would be: CRAZY_CAT_FINAL1.TXT, CRAZY_CAT_FINAL2.TXT and so on and so forth all the way to file 300. How can I automate and keep the file naming convention in place, and read and output the files to different folders/directories? I have tried, but can't seem to get this to iterate. Any help would be greatly appreciated.
import glob
import time
from glob import glob
import pandas as pd
import numpy as np
import csv
import json
csvfile = open(r'C:\Users\...\...\...\Dog\CRAZY_CAT_FINAL1.txt', 'r')
jsonfile = open(r'C:\Users\...\...\...\Rat\CRAZY_CAT_FINAL1.json', 'w')
reader = csv.DictReader(csvfile)
out = json.dumps([row for row in reader])
jsonfile.write(out)
****************************************************************************
I also have this code using the python library "requests". How do I make this code so that it uploads multiple json files with a standard naming convention? The files end with a number...
import requests
#function to post to api
def postData(xactData):
url = 'http link'
headers = {
'Content-Type': 'application/json',
'Content-Length': str(len(xactData)),
'Request-Timeout': '60000'
}
return requests.post(url, headers=headers, data=xactData)
#read data
f = (r'filepath/file/file.json', 'r')
data = f.read()
print(data)
# post data
result = postData(data)
print(result)
Use f-strings?
for i in range(1,301):
csvfile = open(f'C:\Users\...\...\...\Dog\CRAZY_CAT_FINAL{i}.txt', 'r')
jsonfile = open(f'C:\Users\...\...\...\Rat\CRAZY_CAT_FINAL{i}.json', 'w')
import time
from glob import glob
import csv
import json
import os
INPATH r'C:\Users\...\...\...\Dog'
OUTPATH = r'C:\Users\...\...\...\Rat'
for csvname in glob(INPATH+'\*.txt'):
jsonname = OUTPATH + '/' + os.basename(csvname[:-3] + 'json')
reader = csv.DictReader(open(csvname,'r'))
json.dump( list(reader), open(jsonname,'w') )

Python JSON import to MongoDB from ZIP files

I have a script that gets all of the .zip files from a folder, then one by one, opens the zip file, loads the content of the JSON file inside and imports this to MongoDB.
The error I am getting is the JSON object must be str, bytes or bytearray, not 'TextIOWrapper'
The code is:
import json
import logging
import logging.handlers
import os
from logging.config import fileConfig
from pymongo import MongoClient
def import_json():
try:
client = MongoClient('5.57.62.97', 27017)
db = client['vuln_sets']
coll = db['vulnerabilities']
basepath = os.path.dirname(__file__)
filepath = os.path.abspath(os.path.join(basepath, ".."))
archive_filepath = filepath + '/vuln_files/'
filedir = os.chdir(archive_filepath)
for item in os.listdir(filedir):
if item.endswith('.json'):
file_name = os.path.abspath(item)
fp = open(file_name, 'r')
json_data = json.loads(fp)
for vuln in json_data:
print(vuln)
coll.insert(vuln)
os.remove(file_name)
except Exception as e:
logging.exception(e)
I can get this working to use a single file but not multiple, i.e. to do one file I wrote:
from zipfile import ZipFile
import json
import pymongo
archive = ZipFile("vulners_collections/cve.zip")
archived_file = archive.open(archive.namelist()[0])
archive_content = archived_file.read()
archived_file.close()
connection = pymongo.MongoClient("mongodb://localhost")
db=connection.vulnerability
vuln1 = db.vulnerability_collection
vulners_objects = json.loads(archive_content)
for item in vulners_objects:
vuln1.insert(item)
From my comment above:
I have no experience with glob, but from skimming the doc I get the impression your archive_files is a simple list of file-paths as strings, correct? You can not perform actions like .open on string (thus your error), so try changing your code to this:
...
archive_filepath = filepath + '/vuln_files/'
archive_files = glob.glob(archive_filepath + "/*.zip")
for file in archive_files:
with open(file, "r") as currentFile:
file_content = currentFile.read()
vuln_content = json.loads(file_content)
for item in vuln_content:
coll.insert(item)
...
file is NOT a file object or anything but just a simple string. So you cant perform methods on it that are not supported by string.
You are redefining your iterator by setting it to the result of the namelist method. You need a for loop within the for to go through the contents of the zip file and of course a new iterator variable.
Isn't file.close wrong and the correct call is file.close().
U can use json.load() to load file directly, instead of json.loads()
fp = open(file_name, 'r')
json_data = json.load(fp)
fp.close()

efficient way to read csv with numeric data in python

I try to convert a code writen in Matlab into python.
I'm trying to read dat file (it's a csv file). that file has about 30 columns and thousands of rows containing (only!) decimal number data (in Matlab it was read into double matrix).
I'm asking for the fastest way to read the dat file and the most similar object/array/... to save the data into.
I tried to read the file in both of the following ways:
my_data1 = numpy.genfromtxt('FileName.dat', delimiter=',' )
my_data2 = pd.read_csv('FileName.dat',delimiter=',')
Is there any better option?
pd.read_csv is pretty efficient as it is. To make it faster, you can use try to use multiple cores to load your data in parallel. Here is some code example where I used joblib when I needed to make data loading with pd.read_csv and processing of that data faster.
from os import listdir
from os.path import dirname, abspath, isfile, join
import pandas as pd
import sys
import time
from datetime import datetime
# Multi-threading
from joblib import Parallel, delayed
import multiprocessing
# Garbage collector
import gc
# Number of cores
TOTAL_NUM_CORES = multiprocessing.cpu_count()
# Path of this script's file
DATA_PATH = 'D:\\'
# Path to save the processed files
TARGET_PATH = 'C:\\'
def read_and_convert(f,num_files):
#global i
# Read the file
dataframe = pd.read_csv(DATA_PATH + f, low_memory=False, header=None, names=['Symbol', 'Date_Time', 'Bid', 'Ask'], index_col=1, parse_dates=True)
# Process the data
data_ask_bid = process_data(dataframe)
# Store processed data in target folder
data_ask_bid.to_csv(TARGET_PATH + f)
print(f)
# Garbage collector. I needed to use this, otherwise my memory would get full after a few files, but you might not need it.
gc.collect()
def main():
# Counter for converted files
global i
i = 0
start_time = time.time()
# Get the paths for all the data files
files_names = [f for f in listdir(DATA_PATH) if isfile(join(DATA_PATH, f))]
# Load and process files in parallel
Parallel(n_jobs=TOTAL_NUM_CORES)(delayed(read_and_convert)(f,len(files_names)) for f in files_names)
# for f in files_names: read_and_convert(f,len(files_names)) # non-parallel
print("\nTook %s seconds." % (time.time() - start_time))
if __name__ == "__main__":
main()

How to set useragent as QuickTime in Python script?

The server only allows access to the videos if the useragent is QT, how to add it to this script ?
#!/usr/bin/env python
from os import pardir, rename, listdir, getcwd
from os.path import join
from urllib import urlopen, urlretrieve, FancyURLopener
class MyOpener(FancyURLopener):
version = 'QuickTime/7.6.2 (verqt=7.6.2;cpu=IA32;so=Mac 10.5.8)'
def main():
# Set up file paths.
data_dir = 'data'
ft_path = join(data_dir, 'titles.txt')
fu_path = join(data_dir, 'urls.txt')
# Open the files
try:
f_titles = open(ft_path, 'r')
f_urls = open(fu_path, 'r')
except:
print "Make sure titles.txt and urls.txt are in the data directory."
exit()
# Read file contents into lists.
titles = []
urls = []
for l in f_titles:
titles.append(l)
for l in f_urls:
urls.append(l)
# Create a dictionary and download the files.
downloads = dict(zip(titles, urls))
for title, url in downloads.iteritems():
fpath = join(data_dir, title.strip().replace('\t',"").replace(" ", "_"))
fpath += ".mov"
urlretrieve(url, fpath)
if __name__ == "__main__": main()
Ignore this, text to fill the posting restriction. blablablabla
This is actually described in the docs. Your code should look something like this:
#!/usr/bin/env python
import urllib
from os import pardir, rename, listdir, getcwd
from os.path import join
class MyOpener(urllib.FancyURLopener):
version = 'QuickTime/7.6.2 (verqt=7.6.2;cpu=IA32;so=Mac 10.5.8)'
# This line tells urllib.urlretrieve and urllib.urlopen to use your MyOpener
# instead of the default urllib.FancyOpener
urllib._urlopener = MyOpener()
def main():
# lots of stuff
for title, url in downloads.iteritems():
fpath = join(data_dir, title.strip().replace('\t',"").replace(" ", "_"))
fpath += ".mov"
urllib.urlretrieve(url, fpath)
You can change it like this:
http://wolfprojects.altervista.org/changeua.php
Then try:
opener = MyOpener()
opener.retrieve(url, fpath)
Instead of using urllib directly and that should do the trick.
(I am not sure why overriding urllib internals does not work, but they are internals and poking them is not guaranteened to work :( )
Also more info here:
http://docs.python.org/library/urllib.html#urllib.URLopener

Categories

Resources