I have used the following code to read multiple files simultaneously
from contextlib import ExitStack
files_to_parse = [file1, file2, file3]
with ExitStack() as stack:
files = [stack.enter_context(open(i, "r")) for i in files_to_parse]
for rows in zip(*files):
for r in rows:
#do stuff
However I have noticed that since all my files don't have the same number of lines, whenever the shortest file reaches the end, all the files will close.
I used the code above (which I found here on stackoverflow) because I need to parse several files at the same time (to save time). Doing so, divide the computing time by 4. However all files aren't parsed entirely because of the problem I have mentioned above.
Is there any way to solve this problem?
open might be used as context manager, but does not have to. You might use it in ancient way, where you take responsibility to close it, that is
try:
files = [open(fname) for fname in filenames]
# here do what you need to with files
finally:
for file in files:
file.close()
Related
I would be very grateful indeed for some help for a frustrated and confused Python beginner.
I am trying to create a script that searches a windows directory containing multiple subdirectories and different file types for a specific single string (a name) in the file contents and if found prints the filenames as a list. There are approximately 2000 files in 100 subdirectories, and all the files I want to search don't necessarily have the same extension - but are all in essence, ASCII files.
I've been trying to do this for many many days but I just cannot figure it out.
So far I have tried using glob recursive coupled with reading the file but I'm so very bewildered. I can successfully print a list of all the files in all subdirectories, but don't know where to go from here.
import glob
files = []
files = glob.glob('C:\TEMP' + '/**', recursive=True)
print(files)
Can anyone please help me? I am 72 year old scientist trying to improve my skills and "automate the boring stuff", but at the moment I'm just losing the will.
Thank you very much in advance to this community.
great to have you here!
What you have done so far is found all the file paths, now the simplest way is to go through each of the files, read them into the memory one by one and see if the name you are looking for is there.
import glob
files = glob.glob('C:\TEMP' + '/**', recursive=True)
target_string = 'John Smit'
# itereate over files
for file in files:
try:
# open file for reading
with open(file, 'r') as f:
# read the contents
contents = f.read()
# check if contents have your target string
if target_string in conents:
print(file)
except:
pass
This will print the file path each time it found a name.
Please also note I have removed the second line from your code, because it is redundant, you initiate the list in line 3 anyway.
Hope it helps!
You could do it like this, though i think there must be a better approach
When you find all files in your directory, you iterate over them and check if they contain that specific string.
for file in files:
if(os.path.isfile(file)):
with open(file,'r') as f:
if('search_string' in f.read()):
print(file)
I'm trying to create a list of excel files that are saved to a specific directory, but I'm having an issue where when the list is generated it creates a duplicate entry for one of the file names (I am absolutely certain there is not actually a duplicate of the file).
import glob
# get data file names
path =r'D:\larvalSchooling\data'
filenames = glob.glob(path + "/*.xlsx")
output:
>>> filenames
['D:\\larvalSchooling\\data\\copy.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_70dpf_GroupA_n5_20200808_1015-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx', 'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx']
you'll note 'D:\larvalSchooling\data\Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx' is listed twice.
Rather than going through after the fact and removing duplicates I was hoping to figure out why it's happening to begin with.
I'm using python 3.7 on windows 10 pro
If you wrote the code to remove duplicates (which can be as simple as filenames = set(filenames)) you'd see that you still have two filenames. Print them out one on top of the other to make a visual comparison easier:
'D:\\larvalSchooling\\data\\Raw data-SF_Sat_84dpf_GroupABCD_n5_20200822_1440-Trial 1.xlsx',
'D:\\larvalSchooling\\data\\~$Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial 1.xlsx'
The second one has a leading ~ (probably an auto-backup).
Whenever you open an excel file it will create a ghost copy that works as a temporary backup copy for that specific file. In this case:
Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
~$ Raw data-SF_Fri_70dpf_GroupABC_n5_20200828_1140-Trial1.xlsx
This means that the file is open by some software and it's showing you that backup inside(usually that file is hidden from the explorer as well)
Just search for the program and close it. Other actions, such as adding validation so the "~$.*.xlsx" type of file is ignored should be also implemented if this is something you want to avoid.
You can use os.path.splittext to get the file extension and loop through the directory using os.listdir . The open excel files can be skipped using the following code:
filenames = []
for file in os.listdir('D:\larvalSchooling\data'):
filename, file_extension = os.path.splitext(file)
if file_extension == '.xlsx':
if not file.startswith('~$'):
filenames.append(file)
Note: this might not be the best solution, but it'll get the job done :)
I am new to python and not adept to it. I need to traverse a huge list of directories which contain zipped files within them. While this can be done via the method,
for file in list:
for filename in file:
with open.gizp(filename) as fileopen:
for line in fileopen:
process
The time taken would be take a few days. Would i be able to use any function that allows me to traverse other parts of the directory concurrently to perform the same function and not have any repeats in the traversal?
Any help or direction would be greatly appreciated
Move the heavy processing to a separate program, then call that program with subprocess to keep a certain number of parallel processes running:
import subprocess
import time
todo = []
for file in list:
for filename in file:
todo.append(filename)
running_processes = []
while len(todo)>0:
running_processes = [p for p in running_processes if p.poll() is None]
if len(running_processes)<8:
target = todo.pop()
running_processes.append( subprocess.Popen(['python','process_gzip.py',target]) )
time.sleep(1)
You can open many files concurrently. For instance:
files = [gzip.open(f,"rb") for f in fileslist]
processed = [process(f) for f in files]
(btw, don't call your files list "list", or a list of files "file", since they are reserved words by the language and do not describe what the object really is in your case).
Now it is going to take about the same time, since you always process them one at a time. So, is it the processing of them that you want to parallelize? Then you want to look at threading or multiprocessing.
Are you looking for os.path.walk to traverse directories? (https://docs.python.org/2/library/os.path.html). You can also do:
for folder in folderslist:
fileslist = os.listdir(folder)
for file in fileslist:
....
Are you interested by fileinput to iterate over lines from multiple input streams? (https://docs.python.org/2/library/fileinput.html, fileinput.hook_compressed seems to handle gzip).
Using IronPython 2.6 (I'm new), I'm trying to write a program that opens a file, saves it at a series of locations, and then opens/manipulates/re-saves those. It will be run by an upper-level program on a loop, and this entire procedure is designed to catch/preserve corrupted saves so my company can figure out why this glitch of corruption occasionally happens.
I've currently worked out the Open/Save to locations parts of the script and now I need to build a function that opens, checks for corruption, and (if corrupted) moves the file into a subfolder (with an iterative renaming applied, for copies) or (if okay), modifies the file and saves a duplicate, where the process is repeated on the duplicate, sans duplication.
I tell this all for context to the root problem. In my situation, what is the most pythonic, consistent, and windows/unix friendly way to move a file (corrupted) into a subfolder while also renaming it based on the number of pre-existing copies of the file that exist within said subfolder?
In other words:
In a folder structure built as:
C:\Folder\test.txt
C:\Folder\Subfolder
C:\Folder\Subfolder\test.txt
C:\Folder\Subfolder\test01.txt
C:\Folder\Subfolder\test02.txt
C:\Folder\Subfolder\test03.txt
How to I move test.txt such that:
C:\Folder\Subfolder
C:\Folder\Subfolder\test.txt
C:\Folder\Subfolder\test01.txt
C:\Folder\Subfolder\test02.txt
C:\Folder\Subfolder\test03.txt
C:\Folder\Subfolder\test04.txt
In an automated way, so that I can loop my program overnight and have it stack up the corrupted text files I need to save? Note: They're not text files in practice, just example.
assuming you are going to use the convention of incrementally suffinxing numbers to the files:
import os.path
import shutil
def store_copy( file_to_copy, destination):
filename, extension = os.path.splitext( os.path.basename(file_to_copy)
existing_files = [i for in in os.listdir(destination) if i.startswith(filename)]
new_file_name = "%s%02d%s" % (filename, len(existing_files), extension)
shutil.copy2(file_to_copy, os.path.join(destination, new_file_name)
There's a fail case if you have subdirectories or files in destination whose names overlap with the source file, ie, if your file is named 'example.txt' and the destination containst 'example_A.txt' as well as 'example.txt' and 'example01.txt' If that's a possibility you'd have to change the test in the existing files = line to something more sophisticated
I would like to know how I can use Python to concatenate multiple Javascript files into just one file.
I am building a component based engine in Javascript, and I want to distribute it using just one file, for example, engine.js.
Alternatively, I'd like the users to get the whole source, which has a hierarchy of files and directories, and with the whole source they should get a build.py Python script, that can be edited to include various systems and components in it, which are basically .js files in components/ and systems/ directories.
How can I load files which are described in a list (paths) and combine them into one file?
For example:
toLoad =
[
"core/base.js",
"components/Position.js",
"systems/Rendering.jd"
]
The script should concatenate these in order.
Also, this is a Git project. Is there a way for the script to read the version of the program from Git and then write it as a comment at the beginning?
This will concatenate your files:
def read_entirely(file):
with open(file, 'r') as handle:
return handle.read()
result = '\n'.join(read_entirely(file) for file in toLoad)
You may then output them as necessary, or write them using code similar to the following:
with open(file, 'w') as handle:
handle.write(result)
How about something like this?
final_script = ''
for script_name in toLoad:
with open(script_name, 'r') as f:
final_script += ('\n' + f.read())
with open('engine.js', 'w') as f:
f.write(final_script)
You can do it yourself, but this is a real problem that real tools are solving more sophisticatedly. Consider "JavaScript Minification", e.g. using http://developer.yahoo.com/yui/compressor/