Python split a large file to multiple file - python

I’m new to python and have a small bit of code that reads a list of files into a Windows batch file to execute. The trouble is, the code is (correctly) writing up 700 lines per batch file which would take an age to process on one PC. I’d like to break each batch file into, say 100 lines per batch script, and then run each batch file across 7 PC's, for example, but my lack of python knowledge is hampering me a little. I have;
def find_files(directory, pattern):
#directory= (raw_input("Enter a directory to search for Userlists: ")
directory=("c:\\test\\")
os.chdir(directory)
for root, dirs, files in os.walk(directory):
for basename in files:
if fnmatch.fnmatch(basename, pattern):
filename = os.path.join(root, basename)
yield filename
for files in find_files('a-zA-Z0-9','*.btm'):
#print files
lines="call "+files +"\n"
print lines
with open ("c://Extract.bat","a+") as infile:
infile.write(lines)
infile.close
I've tried using (islice(f,n)) but it failed to write anything to 'extract.bat'. Any help, would, as always be really appreciated.
Many thanks

A quite simple way would be to put this in your loop :
i += 1 # don't forget to initialize it before the loop, though
with open ("c://Extract" + str(i/700) + ".bat","a+") as infile:
gives you a new file names Extract0.bat, Extract1.bat, ... every 700 line.

Related

Concatenating fasta files in folder into single file in python

I have multiple fasta sequence files stored in a folder within my current working directory (called "Sequences") and am trying to combine all the sequences into a single file to run a MUSLCE multiple sequence alignment on.
This is what I have so far and it is functional up until the output_fas.close(), where i get the error message FileNotFoundError: [Errno 2] No such file or directory: './Sequences'
Here is the code:
import os
os.getcwd() #current directory
DIR = input("\nInput folder path containing FASTA files to combine into one FASTA file: ")
os.chdir(DIR)
FILE_NAME = input("\nWhat would you like to name your output file (e.g. combo.fas)? Note: "
"Please add the .fas extension: ")
output_fas = open(FILE_NAME, 'w')
file_count = 0
for f in os.listdir(DIR):
if f.endswith(( ".fasta")):
file_count += 1
fh = open(os.path.join(DIR, f))
for line in fh:
output_fas.write(line)
fh.close()
output_fas.close()
print(str(file_count) + " FASTA files were merged into one file, which can be found here: " + DIR)
When i input the directory i input it as './Sequences' which successfully changes the directory.
Not quite sure what to do. I adjusted the code before and it successfully created the new files with all the sequences concatenated together, however it ran continuously and would not end and had multiple repeats of each sequence.
Appreciate the help!
The error should occur before the output_fas.close(), and should be seen at the os.listdir(DIR) call. The problem is that DIR becomes meaningless as soon as you execute the os.chdir(DIR) command. DIR was provided as a relative path, and os.chdir(DIR) changes to the new directory, making the old relative path no longer correct relative to the new directory.
If you're going to use os.chdir(DIR), then never use DIR again, and just change your loop to:
# Use with statement for guaranteed deterministic close at end of block & to avoid need
# for explicit close
with open(FILE_NAME, 'w') as output_fas:
file_count = 0
for f in os.listdir(): # Remove DIR to list current directory
if f.endswith(".fasta"):
file_count += 1
# Use a with for same reason as above
with open(f) as fh: # Don't join to DIR because f is already correctly in current directory
output_fas.writelines(fh) # writelines does the loop of write calls for you

Python search files in multiple subdirectories for specific string and return file path(s) if present

I would be very grateful indeed for some help for a frustrated and confused Python beginner.
I am trying to create a script that searches a windows directory containing multiple subdirectories and different file types for a specific single string (a name) in the file contents and if found prints the filenames as a list. There are approximately 2000 files in 100 subdirectories, and all the files I want to search don't necessarily have the same extension - but are all in essence, ASCII files.
I've been trying to do this for many many days but I just cannot figure it out.
So far I have tried using glob recursive coupled with reading the file but I'm so very bewildered. I can successfully print a list of all the files in all subdirectories, but don't know where to go from here.
import glob
files = []
files = glob.glob('C:\TEMP' + '/**', recursive=True)
print(files)
Can anyone please help me? I am 72 year old scientist trying to improve my skills and "automate the boring stuff", but at the moment I'm just losing the will.
Thank you very much in advance to this community.
great to have you here!
What you have done so far is found all the file paths, now the simplest way is to go through each of the files, read them into the memory one by one and see if the name you are looking for is there.
import glob
files = glob.glob('C:\TEMP' + '/**', recursive=True)
target_string = 'John Smit'
# itereate over files
for file in files:
try:
# open file for reading
with open(file, 'r') as f:
# read the contents
contents = f.read()
# check if contents have your target string
if target_string in conents:
print(file)
except:
pass
This will print the file path each time it found a name.
Please also note I have removed the second line from your code, because it is redundant, you initiate the list in line 3 anyway.
Hope it helps!
You could do it like this, though i think there must be a better approach
When you find all files in your directory, you iterate over them and check if they contain that specific string.
for file in files:
if(os.path.isfile(file)):
with open(file,'r') as f:
if('search_string' in f.read()):
print(file)

IOError: [Errno 2] No such file or directory: when the name was made by looping over existing files

I'm trying to have the bottom part of the code iterate over some files. These files should be corresponding and are differentiated by a number, so the counter is to change the number part of the file.
The file names are generated by looking through the given files and selecting files containing certain things in the title, then having them ordered using the count.
This code works independently, in it's own (lonely) folder, and prints the correct files in the correct order. However when i use this in my main code, where file_1 and file_2 are referenced (the decoder and encoder parts of the code) I get the error in the title. There is no way there is any typo or that the files don't exist because python made these things itself based on existing file names.
import os
count = 201
while 205 > count:
indir = 'absolute_path/models'
for root, dirs, filenames in os.walk(indir):
for f in filenames:
if 'test-decoder' in f:
if f.endswith(".model"):
if str(count) in f:
file_1 = f
print(file_1)
indir = 'absolute_path/models'
for root, dirs, filenames in os.walk(indir):
for f in filenames:
if 'test-encoder' in f:
if f.endswith(".model"):
if str(count) in f:
file_2 = f
print(file_2)
decoder1.load_state_dict(
torch.load(open(file_1, 'rb')))
encoder1.load_state_dict(
torch.load(open(file_2, 'rb')))
print(getBlueScore(encoder1, decoder1, pairs, src, tgt))
print_every=10
print(file_1 + file_2)
count = count + 1
i then need to use these files two by two.
It's very possible that you are running into issues with variable scoping, but without being able to see your entire code it's hard to know for sure.
If you know what the model files should be called, might I suggest this code:
for i in range(201, 205):
e = 'absolute_path/models/test_encoder_%d.model' % i
d = 'absolute_path/models/test_decoder_%d.model' % i
if os.path.exists(e) and os.path.exists(d):
decoder1.load_state_dict(torch.load(open(e, 'rb')))
encoder1.load_state_dict(torch.load(open(d, 'rb')))
Instead of relying on the existence of strings in a path name which could lead to errors this would force only those files you want to open to be opened. Also it gets rid of any possible scoping issues.
We could clean it up a bit more but you get the idea.

Python - Duplicate File Finder using defaultdict

I'm experimenting with different ways to identify duplicate files, based on file content, by looping through the top level directory where folders A-Z exist. Within folders A-Z there is one additional layer of folders named after the current date. Finally, within the dated folders there are between several thousand to several million (<3 million) files in various formats.
Using the script below I was able to process roughly 800,000 files in about 4 hours. However, running it over a larger data set of roughly 13,000,000 files total it consistently breaks on letter "I" that contains roughly 1.5 million files.
Given the size of data I'm dealing with I'm considering outputting the content directly to a text file and then importing it into MySQL or something similar for further processing. Please let me know if I'm going down the right track or if you feel a modified version of the script below should be able to handle 13+ million files.
Question - How can I modify the script below to handle 13+ million files?
Error traceback:
Traceback (most recent call last):
File "C:/Users/"user"/PycharmProjects/untitled/dups.py", line 28, in <module>
for subdir, dirs, files in os.walk(path):
File "C:\Python34\lib\os.py", line 379, in walk
yield from walk(new_path, topdown, onerror, followlinks)
File "C:\Python34\lib\os.py", line 372, in walk
nondirs.append(name)
MemoryError
my code:
import hashlib
import os
import datetime
from collections import defaultdict
def hash(filepath):
hash = hashlib.md5()
blockSize = 65536
with open(filepath, 'rb') as fpath:
block = fpath.read(blockSize)
while len(block) > 0:
hash.update(block)
block = fpath.read(blockSize)
return hash.hexdigest()
directory = "\\\\path\\to\\files\\"
directories = [name for name in os.listdir(directory) if os.path.isdir(os.path.join(directory, name))]
outFile = open("\\path\\output.txt", "w", encoding='utf8')
for folder in directories:
sizeList = defaultdict(list)
path = directory + folder
print("Start time: " + str(datetime.datetime.now()))
print("Working on folder: " + folder)
# Walk through one level of directories
for subdir, dirs, files in os.walk(path):
for file in files:
filePath = os.path.join(subdir, file)
sizeList[os.stat(filePath).st_size].append(filePath)
print("Hashing " + str(len(sizeList)) + " Files")
## Hash remaining files
fileList = defaultdict(list)
for fileSize in sizeList.values():
if len(fileSize) > 1:
for dupSize in fileSize:
fileList[hash(dupSize)].append(dupSize)
## Write remaining hashed files to file
print("Writing Output")
for fileHash in fileList.values():
if len(fileHash) > 1:
for hashOut in fileHash:
outFile.write(hashOut + " ~ " + str(os.stat(hashOut).st_size) + '\n')
outFile.write('\n')
outFile.close()
print("End time: " + str(datetime.datetime.now()))
Disclaimer: I don't know if this is a solution.
I looked at your code, and I realized the error is provoked by .walk. Now it's true that this might be because of too much info being processed (so maybe an external DB would help matters, though the added operations might hinder your speed). But other than that, .listdir (which is called by .walk) is really terrible when you handle a huge amount of files. Hopefully, this is resolved in Python 3.5 because it implements the way better scandir, so if you're willing* to try the latest (and I do mean latest, it was release, what, 8 days ago?), that might help.
Other than that you can try and trace bottlenecks, and garbage collection to maybe figure it out.
*you can also just install it with pip using your current python, but where's the fun in that?

Move files from one directory to another if they appear in a text file in Python (take 2)

Ok I'm going to try this again, apologies for my poor effort in my pervious question.
I am writing a program in Java and I wanted to move some files from one directory to another based on whether they appear in a list. I could do it manually but there a thousands of files in the directory so it would be an arduous task and I need to repeat it several times! I tried to do it in Java but because I am using Java it appears I cannot use java.nio, and I am not allowed to use external libraries.
So I have tried to write something in python.
import os
import shutil
with open('files.txt', 'r') as f:
myNames = [line.strip() for line in f]
print myNames
dir_src = "trainfile"
dir_dst = "train"
for file in os.listdir(dir_src):
print file # testing
src_file = os.path.join(dir_src, file)
dst_file = os.path.join(dir_dst, file)
shutil.move(src_file, dst_file)
"files.txt" is in the format:
a.txt
edfs.txt
fdgdsf.txt
and so on.
So at the moment it is moving everything from train to trainfile, but I need to only move files if the are in the myNames list.
Does anyone have any suggestions?
check whether the file name exists in the myNames list
put it before shutil.move
if file in myNames:
so at the moment it is moving everything from train to trainfile, but ii need to only move files if the are in the myName list
You can translate that "if they are in the myName list" directly from English to Python:
if file in myNames:
shutil.move(src_file, dst_file)
And that's it.
However, you probably want a set of names, rather than a list. It makes more conceptual sense. It's also more efficient, although the speed of looking things up in a list will probably be negligible compared to the cost of copying files around. Anyway, to do that, you need to change one more line:
myNames = {line.strip() for line in f}
And then you're done.

Categories

Resources