I have got a python script that writes four JSON files (a handful of key, value pairs) with every execution. This script executes around 100k times a day, producing 400k files every day. Every execution produces four JSON files inside a dated directory with a counter, for example:
/30/04/2021/run1/
/30/04/2021/run2/
...
/30/04/2021/run100k/
/01/05/2021/run1/
/02/05/2021/run2/
...
/31/05/2021/run100k/
I am exposing read API (get_runs) for this wherein people can ask for data within a given date range.
def get_runs(from_date, to_date):
files = []
# construct file paths for the given date range (from_date, to_date)
# append file paths to files
# call read_files
return _read_files(files)
def _read_files(files):
import json
data = []
for file in files:
with open(file) as f:
data.append(json.load(f))
return data
If you notice, inside the get_runs API call, I am reading files inside dated directories (in a given range) one by one in a loop but this takes a lot of time to complete. Any suggestions on how to optimize this? I want this to be super fast - Note that I cannot store these JSON objects in the database.
Related
I'm working with JSON filetypes and I've created some code that will open a single file and add it to a pandas dataframe, performing some procedures on the data within, snipper of this code as follows;
response_dic=first_response.json()
print(response_dic)
base_df=pd.DataFrame(response_dic)
base_df.head()
The code then goes on to extract parts of the JSON data into dataframes, before merging and printing to CSV.
Where I want to develop the code, is to have it iterate through a folder first, find filenames that match my list of filenames that I want to work on and then perform the functions on those filenames. For example, I have a folder with 1000 docs, I will only need to perform the function on a sample of these.
I've created a list in CSV of the account codes that I want to work on, I've then imported the csv details and created a list of account codes as follows:
csv_file=open(r'C:\filepath','r')
cikas=[]
cikbs=[]
csv_file.readline()
for a,b,c in csv.reader(csv_file, delimiter=','):
cikas.append(a)
cikbs.append(b)
midstring=[s for s in cikbs]
print(midstring)
My account names are then stored in midstring, for example ['12345', '2468', '56789']. This means I can control which account codes are worked on by amending my CSV file in future. These names will vary at different stages hence I don't want to absolutely define them at this stage.
What I would like the code to do, is check the working directory, see if there is a file that matches for example C:\Users*12345.json. If there is, perform the pandas procedures upon it, then move to the next file. Is this possible? I've tried a number of tutorials involving glob, iglob, fnmatch etc but struggling to come up with a workable solution.
you can list all the files with .json extension in the current directory first.
import os, json
import pandas as pd
path_to_json = 'currentdir/'
json_files = [json_file for json_file in os.listdir(path_to_json) if json_file.endswith('.json')]
print(json_files)
Now iterate over the list of json_files and perform a check
# example list json_files= ['12345.json','2468.json','56789.json']
# midstring = ['12345', '2468, '56789']
for file in json_files:
if file.split('.')[0] in midstring:
df = pd.DataFrame.from_dict(json_file)
# perform pandas functions
else:
continue
please help me with challenge i have, that is to list files every 30seconds and process them (process them means for example -- copying to another location, each file is moved out of the directory once processed), and when i list files after 30seconds, i want to avoid any files that are listed previously for processing (due to the reason that they were listed previously and FOR LOOP is still in progress)
Means i want to avoid duplicate file processing while listing the files every 30seconds.
here is my code.
def List_files():
path = 'c:\\projects\\hc2\\'
files = []
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
class MyFilethreads:
def __init__(self, t1):
self.t1 = t1
def start_threading(self):
for file in List_files():
self.t1 = Thread(target=<FILEPROCESS_FUNCTION>, args=(file,))
self.t1.start()
t1 = Thread()
myclass = MyFilethreads(t1)
while True:
myclass.start_threading()
time.sleep(30)
I have not included my actual function for processing files, since its big,,it is called with thread as FILEPROCESS_FUNCTION.
Problem:
if the file size is high, my file processing time may increase some times (in other words, FOR LOOP is taking more than 30 sec ) but i cant reduce 30sec timer since it's very rare possibility, and my python script takes hundreds of files every min..
Hence, i am looking for a way to avoid files that are already listed previously, and by this i wanted to avoid duplicate file processing.
please help.
thanks in advance.
Make a dictionary in your class, and insert all the files you have seen. then, in your start_threading check if the file is in the dictionary, and pass in that case.
Within a script is a watcher algorithm which I've adapted from here:
https://www.michaelcho.me/article/using-pythons-watchdog-to-monitor-changes-to-a-directory
My aim is now to add a few lines to grab the name of any file that's modified so I can have an if statement checking for a certain file such as:
if [modified file name] == "my_file":
print("do something")
I don't have any experience with watchdog or file watching so I am struggling finding an answer for this. How would I receive the modified file name?
Current setup of the watchdog class is pretty useless since it just prints...it doesn't return anything.
Let me offer a different approach:
following would get you list of files modified in past 12 hours:
result = [os.path.join(root,f) for root, subfolder, files in os.walk(my_dir)
for f in files
if dt.datetime.fromtimestamp(os.path.getmtime(os.path.join(root,f))) >
dt.datetime.now() - dt.timedelta(hours=12)]
in stead of fixed delta hours, you can save the last_search_time and use it in subsequent searches.
You can then search result to see if it contains your file:
if my_file in result:
print("Sky is falling. Take cover.")
I've written a batch script that is deployed to our network via Chocolatey and FOG that acquires the serial number of the machine and then ejects it via .txt in a file bearing the name of the PC that the serial number belongs to:
net use y: \\192.168.4.104\shared
wmic bio get serialnumber > Y:\IT\SSoftware\Serial_Numbers\%COMPUTERNAME%.txt
net use y: /delete
The folder Serial_Numbers is subsequently filled with .txts bearing the names of every computer on Campus. With this in mind I'd like to write a Python script to go through and grab every .txt name and second interior string to form a dictionary, where you can call for the PC's name, and have the serial number returned.
I'm aware as to how I'd create the dictionary, and call from it, but I'm having troubles figuring out how to properly grab the .txt's name and second interior string, any help would be greatly appreciated.
Format of .txt documents:
SerialNumber
#############
You can use os.listdir to list the directory files nad list comprehension to filter them.
Use glob to list the files in your directory.
You can simply read the first line and stop using the file while populating the dictionary and you're done:
import glob
d = {}
# loop over '.txt' files only
for filename in glob.glob('/path_to_Serial_Numbers_folder/*.txt'):
with open(filename, 'r') as f:
file_name_no_extension = '.'.join(filename.split('.')[:-1])
d[file_name_no_extension] = f.readline().strip()
print d
import glob
data = {}
for fnm in glob.glob('*.txt'):
data[fnm[:-4]] = open(fnm).readlines()[1].strip()
or, more succinctly
import glob
data = {f[:-4]:open(f).readlines()[1].strip() for f in glob.glob('*.txt')}
In the dictionary comprehension above,
f[:-4] is the filename except the last four characters (i.e., ".txt"),
open(f).readlines()[1].strip() is the second line of the file
object and eventually
f is an element of the list of filenames returned by glob.glob().
I have a bunch of apache log files, which I need to parse and extract information from. My script is working fine for a single file, but I'm wondering on the best approach to handle multiple files.
Should I:
- loop through all files and create a temporary file holding all contents
- run my logic on the "contact-ed" file
Or
- loop through every file
- run my logic file by file
- try to merge the results of every file
Filewise I'm looking at logs of about a year, with roughly 2 million entries per day, reported for a large number of machines. My single-file script is generating an object with "entries" for every machine, so I'm wondering:
Question:
Should I generate a joint temporary file or run file-by-file, generate file-based-objects and merge x files with entries for the same y machines?
You could use glob and the fileinput module to effectively loop over all of them and see it as one "large file":
import fileinput
from glob import glob
log_files = glob('/some/dir/with/logs/*.log')
for line in fileinput.input(log_files):
pass # do something