Python Find Most Recent File Slow

Python Find Most Recent File Slow - python

Am I doing something wrong or is finding the most recent file at a file path location supposed to be fairly slow?
The below code is taking upwards of 3 minutes. Is this expected for parsing thru a list of ~850 files?
I am using a regex pattern to find only .txt files and so after searching thru my file share location it returns a list of ~850 files. This is the list it parses thru to get the max(File) by key=os.path.getctime
I tried sort instead of max and to just grab the top file but that wasn't any faster.
import os
import glob
path='C:\Desktop\Test'
fileRegex='*.*txt'
latestFile = get_latest_file(filePath, fileRegex)
def get_latest_file(path,fileRegex):
fullpath = os.path.join(path, fileRegex)
list_of_files = glob.iglob(fullpath, recursive=True)
if not list_of_files:
latestFile=''
latestFile = max(list_of_files, key=os.path.getctime)
return latestFile

Try using os.scandir(), this speeded up my file searching massively.

Related

python: set file path to only point to files with a specific ending

I am trying to run a program with requires pVCF files alone as inputs. Due to the size of the data, I am unable to create a separate directory containing the particular files that I need.
The directory contains multiple files with 'vcf.gz.tbi' and 'vcf.gz' endings. Using the following code:
file_url = "file:///mnt/projects/samples/vcf_format/*.vcf.gz"
I tried to create a file path that only grabs the '.vcf.gz' files while excluding the '.vcf.gz.tbi' but I have been unsuccesful.

The code you have, as written, is just assigning your file path to the variable file_url. For something like this, glob is popular but isn't the only option:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
os.chdir(file_url)
for file in glob.glob("*.vcf.gz"):
print(file)
Note that the file path doesn't contain the kind of file you want (in this case, a gzipped VCF), the glob for loop does that.
Check out this answer for more options.
It took some digging but it looks like you're trying to use the import_vcf function of Hail. To put the files in a list so that it can be passed as input:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
def get_vcf_list(path):
vcf_list = []
os.chdir(path)
for file in glob.glob("*.vcf.gz"):
vcf_list.append(path + "/" + file)
return vcf_list
get_vcf_list(file_url)
# Now you pass 'get_vcf_list(file_url)' as your input instead of 'file_url'
mt = hl.import_vcf(get_vcf_list(file_url), force_bgz=True, reference_genome="GRCh38", array_elements_required=False)

Python search files in multiple subdirectories for specific string and return file path(s) if present

I would be very grateful indeed for some help for a frustrated and confused Python beginner.
I am trying to create a script that searches a windows directory containing multiple subdirectories and different file types for a specific single string (a name) in the file contents and if found prints the filenames as a list. There are approximately 2000 files in 100 subdirectories, and all the files I want to search don't necessarily have the same extension - but are all in essence, ASCII files.
I've been trying to do this for many many days but I just cannot figure it out.
So far I have tried using glob recursive coupled with reading the file but I'm so very bewildered. I can successfully print a list of all the files in all subdirectories, but don't know where to go from here.
import glob
files = []
files = glob.glob('C:\TEMP' + '/**', recursive=True)
print(files)
Can anyone please help me? I am 72 year old scientist trying to improve my skills and "automate the boring stuff", but at the moment I'm just losing the will.
Thank you very much in advance to this community.

great to have you here!
What you have done so far is found all the file paths, now the simplest way is to go through each of the files, read them into the memory one by one and see if the name you are looking for is there.
import glob
files = glob.glob('C:\TEMP' + '/**', recursive=True)
target_string = 'John Smit'
# itereate over files
for file in files:
try:
# open file for reading
with open(file, 'r') as f:
# read the contents
contents = f.read()
# check if contents have your target string
if target_string in conents:
print(file)
except:
pass
This will print the file path each time it found a name.
Please also note I have removed the second line from your code, because it is redundant, you initiate the list in line 3 anyway.
Hope it helps!

You could do it like this, though i think there must be a better approach
When you find all files in your directory, you iterate over them and check if they contain that specific string.
for file in files:
if(os.path.isfile(file)):
with open(file,'r') as f:
if('search_string' in f.read()):
print(file)

Getting the latest file name etc using glob.glob & max (os.path.getctime)

I am trying to get the file name of the latest file on a directory which has couple hundred files on a network drive.
Basically the idea is to snip the file name (its the date/time the file was downloaded, eg xyz201912191455.csv) and paste it on a config file every time the script is run.
Now the list_of_files usually run in about a second but latest_file takes about 100 seconds which is extremely slow.
Is there a faster way to extract the information about the latest file?
The code sample as below:
import os
import glob
import time
from configparser import ConfigParser
import configparser
list_of_files = glob.glob('filepath\*', recursive=True)
latest_file = max(list_of_files, key=os.path.getctime)
list_of_files2 = glob.glob('filepath\*', recursive=True)
latest_file2 = max(list_of_files2, key=os.path.getctime)

If the filenames already include the datetime, why bother getting their stat information? And if the names are like xyz201912191455.csv, one could use [-16:-4] to extract 201912191455 and as these are zero padded they will sort lexicographically in numerical order. Also recursive=True is not needed here as the pattern does not have a ** in it.
list_of_files = glob.glob('filepath\*')
latest_file = max(list_of_files, key=lambda n: n[-16:-4])

How do I delete a specified number of files in a directory in Python?

I want to reduce the number of files in a folder or rather remove a specified number of files from the folder.
I'd appreciate simple solutions likely limited around my code below, which definitely is not working and wrong.
files = os.listdir()
for file in files:
for file in range(11):
os.remove(file)

You simply have to iterate correctly in the range:
files = os.listdir('path/to/your/folder')
for file in files[:11]:
os.remove(file)
in this way you are iterating through a list containing the first 11 files.
If you want to remove random files, you can use:
from random import sample
files = os.listdir('path/to/your/folder')
for file in sample(files,11):
os.remove(file)

Thanks everyone, I also did a bit of digging and applying solutions from your contributions. Here is what I got. The sorting was via creation time which was more logical for the task.
import os
import re
filepath = os.getcwd()
files = os.listdir(filepath)
files = sorted(files,key=os.path.getmtime)
for file in files[:11]:
os.remove(file)

import os
files = os.listdir(r'C:/testdeletefolder')
fileslist=[]
for file in files:
print file
fileslist.append(file)
for i in range(0,2):
os.remove(fileslist[i])
But ensure that you if condition to not to delete the python file which has code.this will delete 2 files you can change the counter in range

Load files from a list of file paths in python

I have a text file with a couple hundred file paths to text files which I would like to open, write / cut up pieces from it and save under a new name.
I've been Googling how to do this and found the module glob, but I can't figure out exactly how to use this.
Could you guys point me in the right direction?

If you have specific paths to files, you won't need to glob module. The glob module is useful when you want to use path like /user/home/someone/pictures/*.jpg. From what I understand you have a file with normal paths.
You can use this code as a start:
with open('file_with_paths', 'r') as paths_list:
for file_path in paths_list:
with open(file_path, 'r') as file:
# Do what you want with one of the files here.

You can just traverse the file line by line and then take out what you want from that name. Later save/create it . Below sample code might help
with open('file_name') as f:
for file_path in f:
import os
file_name = os.path.basename(file_path)
absolute path = os.path.dirname(file_path)
# change whatever you want to with above two and save the file
# os.makedirs to create directry
# os.open() in write mode to create the file
Let me know if it helps you

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Find Most Recent File Slow - python

Try using os.scandir(), this speeded up my file searching massively.

Related

python: set file path to only point to files with a specific ending

Python search files in multiple subdirectories for specific string and return file path(s) if present

Getting the latest file name etc using glob.glob & max (os.path.getctime)

How do I delete a specified number of files in a directory in Python?

Load files from a list of file paths in python

Categories

Resources