I have a bunch of apache log files, which I need to parse and extract information from. My script is working fine for a single file, but I'm wondering on the best approach to handle multiple files.
Should I:
- loop through all files and create a temporary file holding all contents
- run my logic on the "contact-ed" file
Or
- loop through every file
- run my logic file by file
- try to merge the results of every file
Filewise I'm looking at logs of about a year, with roughly 2 million entries per day, reported for a large number of machines. My single-file script is generating an object with "entries" for every machine, so I'm wondering:
Question:
Should I generate a joint temporary file or run file-by-file, generate file-based-objects and merge x files with entries for the same y machines?
You could use glob and the fileinput module to effectively loop over all of them and see it as one "large file":
import fileinput
from glob import glob
log_files = glob('/some/dir/with/logs/*.log')
for line in fileinput.input(log_files):
pass # do something
Related
I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro
If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).
Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.
You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)
I have got a python script that writes four JSON files (a handful of key, value pairs) with every execution. This script executes around 100k times a day, producing 400k files every day. Every execution produces four JSON files inside a dated directory with a counter, for example:
/30/04/2021/run1/
/30/04/2021/run2/
...
/30/04/2021/run100k/
/01/05/2021/run1/
/02/05/2021/run2/
...
/31/05/2021/run100k/
I am exposing read API (get_runs) for this wherein people can ask for data within a given date range.
def get_runs(from_date, to_date):
files = []
# construct file paths for the given date range (from_date, to_date)
# append file paths to files
# call read_files
return _read_files(files)
def _read_files(files):
import json
data = []
for file in files:
with open(file) as f:
data.append(json.load(f))
return data
If you notice, inside the get_runs API call, I am reading files inside dated directories (in a given range) one by one in a loop but this takes a lot of time to complete. Any suggestions on how to optimize this? I want this to be super fast - Note that I cannot store these JSON objects in the database.
I have multiple directories that require the same large json file. Rather than storing the json file multiple times, I would like to store it in one directory and create an alias that points to the actual file in the other directories.
How do I create a shortcut/alias for a file in Python?
You can create a symlink to the file. In python this looks like:
import os
os.symlink("my_json_file.json", "the_link");
I am trying to rename a file in which it is auto-generated by some local modules but I was wondering if using os.listdir is the only way for me to filter/ narrow down this file.
This file will always be generated before it is removed and the code will generate the next one (still in the same directory) based on the next item in list.
Basically, whenever this file is generated, it comes in the following file path:
/user_data/.tmp/tempLibFiles/2015_03_16_192212_182096.con
I had only wanted to rename the 2015_03_16_192212_182096 into connectionFile while keeping the rest the same.
You can also use the glob module to narrow down the list of files to the one that matches a particular pattern. For example:
import glob
files = glob.glob('/user_data/.tmp/tempLibFiles/*.con')
What is the most efficient and fastest way to get a single file from a directory using Python?
More details on my specific problem:
I have a directory containing a lot of pregenerated files, and I just want to pick a random one. Since I know that there's no really efficient way of picking a random file from a directory other than listing all the files first, my files are generated with an already random name, thus they are already randomly sorted, and I just need to pick the first file from the folder.
So my question is: how can I pick the first file from my folder, without having to load the whole list of files from the directory (nor having the OS to do that, my optimal goal would be to force the OS to just return me a single file and then stop!).
Note: I have a lot of files in my directory, hence why I would like to avoid listing all the files to just pick one.
Note2: each file is only picked once, then deleted to ensure that only new files are picked the next time (thus ensuring some kind of randomness).
SOLUTION
I finally chose to use an index file that will store:
the index of the current file to be picked (eg: 1 for file1.ext, 2 for file2.ext, etc..)
the index of the last file generated (eg: 1999 for file1999.ext)
Of course, this means that my files are not generated with a random name anymore, but using a deterministic incrementable pattern (eg: "file%s.ext" % ID)
Thus I have a near constant time for my two main operations:
Accessing the next file in the folder
Counting the number of files that are left (so that I can generate new files in a background thread when needed).
This is a specific solution for my problem, for more generic solutions, please read the accepted answer.
Also you might be interested into these two other solutions I've found to optimize the access of files and directory walking using Python:
os.walk optimized
Python FAM (File Alteration Monitor)
Don't have a lot of pregenerated files in 1 directory. Divide them over subdirectories if more than 'n' files in the directory.
when creating the files add the name of the newest file to a list stored in a text file. When you want to read/process/delete a file:
Open the text file
Set filename to the name on the top of the list.
Delete the name from the top of the list
Close the text file
Process filename.
Just use random.choice() on the os.listdir() result:
import random
import os
randomfilename = random.choice(os.listdir(path_to_directory))
os.listdir() returns results in the ordering given by the OS. Using random filenames does not change that ordering, only adding items to or removing items from the directory can influence that ordering.
If your fear that you'll have too many files, do not use a single directory. Instead, set up a tree of directories with pre-generated names, pick one of those at random, then pick a file from there.