Find efficiently a file with unknown extension

Find efficiently a file with unknown extension - python

I have a problem that feels easy, but I cannot come up with a satisfying solution.
I have a file structure with a directory containing a very large number of files. The file names are just their index with an unknown extension. For example, the 10th file is "10.pdf" and the 42th file is "42.png". There can be many different extensions.
I need to access the i-th file from python, given index i but not knowing the extension. This will happen a lot, so I should be able to do it efficiently.
Here are the partial solutions I could think about:
I can glob the pattern f"{i}.*"
However, I think glob will check every file in the directory? This will be very slow for a large number of files.
I can save and preload the full name in a dict, in a JSON file like {..., 10: "10.pdf", ...}
This works, but I have to load and keep track of another heavy object. This feels wrong somehow...
If I have a list of all allowed extensions, I can just test all possibilities. This feels weird and unnecessary, but that's my best guess for now.
What do you think ? Is one of those proposal the correct way to do it ?

As I think, you only need the file name instead full filename+ext. So, one way is to remove the extension from the file, for example:
import os
path = r"Enter your folder's path here"
file_dict = {}
for file in os.listdir(path):
if os.path.isfile(file): # because os.listdir return both files and folders
file_name, ext = os.path.splitext(file)
print(file_name, ext)
For example, if your file is '10.pdf' then file_name='10' and ext='.pdf'. Then you can add it to a dictionary for the future:
file_dict[file_name] = os.path.join(path, file)
Another way is using regular expressions or "re"! if you have a patter(even complex pattern) 're' is awesome! You need to type your desired pattern, for example:
import os
import re
path = r"Enter your folder's path here"
file_dict = {}
for file in os.listdir(path):
if os.path.isfile(file):
mo = re.search(r'(.*\)(..*)', file)
file_name, ext = mo.groups()
print(file_name, ext)

Related

python: set file path to only point to files with a specific ending

I am trying to run a program with requires pVCF files alone as inputs. Due to the size of the data, I am unable to create a separate directory containing the particular files that I need.
The directory contains multiple files with 'vcf.gz.tbi' and 'vcf.gz' endings. Using the following code:
file_url = "file:///mnt/projects/samples/vcf_format/*.vcf.gz"
I tried to create a file path that only grabs the '.vcf.gz' files while excluding the '.vcf.gz.tbi' but I have been unsuccesful.

The code you have, as written, is just assigning your file path to the variable file_url. For something like this, glob is popular but isn't the only option:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
os.chdir(file_url)
for file in glob.glob("*.vcf.gz"):
print(file)
Note that the file path doesn't contain the kind of file you want (in this case, a gzipped VCF), the glob for loop does that.
Check out this answer for more options.
It took some digging but it looks like you're trying to use the import_vcf function of Hail. To put the files in a list so that it can be passed as input:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
def get_vcf_list(path):
vcf_list = []
os.chdir(path)
for file in glob.glob("*.vcf.gz"):
vcf_list.append(path + "/" + file)
return vcf_list
get_vcf_list(file_url)
# Now you pass 'get_vcf_list(file_url)' as your input instead of 'file_url'
mt = hl.import_vcf(get_vcf_list(file_url), force_bgz=True, reference_genome="GRCh38", array_elements_required=False)

Reading .csv and .xlsx in a directory using Data frame

How to read all files in a directory if the directory contains .csv and .xlsx files?
I tried:
read_files = Path(path).rglob("*.csv","*.xlsx")
all_files = [pd.read_excel(file) for file in read_files]
But it is not working. How can I achieve it?

The rglob method of pathlib.Path accepts exactly one pattern. You can (somewhat inefficiently) loop twice:
pathobj = Path(path)
read_files = pathobj.rglob("*.csv") + pathobj.rglob("*.xlsx")
or refactor the code to use a different traversal function - os.walk comes to mind, or do an rglob on all files (or all files with a dot in their name) and extract only the ones which match either of your patterns:
read_files = filter(lambda x: x.suffix in ('.xls', '.csv'), Path(path).rglob("*.*"))
Like the documentation already tells you, the pattern syntax supported by rglob is the same as for the fnmatch module.
... Though as it turns out, using read_excel for CSV files is probably not the way to go anyway; try
pathobj = Path(path)
all_files = [pd.read_csv(file) for file in pathobj.rglob("*.csv")]
all_files.extend(pd.read_excel(file) for file in pathobj.rglob("*.xlsx"))

How can I read files with similar names on python, rename them and then work with them?

I've already posted here with the same question but I sadly I couldn't come up with a solution (even though some of you guys gave me awesome answers but most of them weren't what I was looking for), so I'll try again and this time giving more information about what I'm trying to do.
So, I'm using a program called GMAT to get some outputs (.txt files with numerical values). These outputs have different names, but because I'm using them to more than one thing I'm getting something like this:
GMATd_1.txt
GMATd_2.txt
GMATf_1.txt
GMATf_2.txt
Now, what I need to do is to use these outputs as inputs in my code. I need to work with them in other functions of my script, and since I will have a lot of these .txt files I want to rename them as I don't want to use them like './path/etc'.
So what I wanted was to write a loop that could get these files and rename them inside the script so I can use these files with the new name in other functions (outside the loop).
So instead of having to this individually:
GMATds1= './path/GMATd_1.txt'
GMATds2= './path/GMATd_2.txt'
I wanted to write a loop that would do that for me.
I've already tried using a dictionary:
import os
import fnmatch
dict = {}
for filename in os.listdir('.'):
if fnmatch.fnmatch(filename, 'thing*.txt'):
examples[filename[:6]] = filename
This does work but I can't use the dictionary key outside the loop.

If I understand correctly, you try to fetch files with similar names (at least a re-occurring pattern) and rename them. This can be accomplished with the following code:
import glob
import os
all_files = glob.glob('path/to/directory/with/files/GMAT*.txt')
for file in files:
new_path = create_new_path(file) # possibly split the file name, change directory and/or filename
os.rename(file, new_path)
The glob library allows for searching files with * wildcards and makes it hence possible to search for files with a specific pattern. It lists all the files in a certain directory (or multiple directories if you include a * wildcard as a directory). When you iterate over the files, you could either directly work with the input of the files (as you apparently intend to do) or rename them as shown in this snippet. To rename them, you would need to generate a new path - so you would have to write the create_new_path function that takes the old path and creates a new one.

Since python 3.4 you should be using the built-in pathlib package instead of os or glob.
from pathlib import Path
import shutil
for file_src in Path("path/to/files").glob("GMAT*.txt"):
file_dest = str(file_src.resolve()).replace("ds", "d_")
shutil.move(file_src, file_dest)

you can use
import os
path='.....' # path where these files are located
path1='.....' ## path where you want these files to store
i=1
for file in os.listdir(path):
if file.endswith(end='.txt'):
os.rename(path + "/" + file, path1 + "/"+str(i) + ".txt")
i+=1
it will rename all the txt file in the source folder to 1,2,3,....n.txt

Programmaticallly moving files in python

I'm trying to simply move files from folder path1 to folder path.
import os
import shutil
path1 = '/home/user/Downloads'
file_dir = os.listdir(path1)
fpath = '/home/user/music'
for file in file_dir:
if file.endswith('.mp3'):
shutil.move(os.path.join(file_dir,file), os.path.join(fpath, file))
... but I get this error
TypeError: expected str, bytes or os.PathLike object, not list

First of all, you shouldn't use file as a variable name, it's a builtin in python, consider using f instead.
Also notice that in the shutil.move line, I've changed your (os.path.join(file_dir,f) to (os.path.join(path1,f). file_dir is a list, not the name of the directory that you're looking for, that value is stored in your path1 variable.
Altogether, it looks like this:
import os
import shutil
path1 = '/home/user/Downloads'
file_dir = os.listdir(path1)
fpath = '/home/user/music'
for f in file_dir:
if f.endswith('.mp3'):
shutil.move(os.path.join(path1,f), os.path.join(fpath, f))

You have confused your variable purposes from one line to the next. You've also over-built your file path construction.
You set up file_dir as a list of all the files in path1. That works fine through your for command, where you iterate through that list. The move method requires two file names, simple strings. Look at how you construct your file name:
os.path.join(file_dir,file)
Remember, file_dir is a list of files in path1. file is one of the files in that list. What are you trying to do here? Do you perhaps mean to concatenate path1 with file?
NOTE: Using pre-defined names as variables is really bad practice. file is a pre-defined type. Instead, use f or local_file, perhaps.

Read carefully the error message. file_dir is list. You can not join it with os.path.join. You probably want to write:
shutil.move(os.path.join(path1, f), os.path.join(fpath, f))
I suggest to name variables with meaningful names like:
file_list = os.listdir(path1)
This way you will not join a file list with a path :)

Fast way to read filename from directory?

Given a local directory structure of /foo/bar, and assuming that a given path contains exactly one file (filename and content does not matter), what is a reasonably fast way to get the filename of that single file (NOT the file content)?

1st element of os.listdir()
import os
os.listdir('/foo/bar')[0]

Well I know this code works...
for file in os.listdir('.'):
#do something

you can also use glob
import glob
print glob.glob("/path/*")[0]

os.path.basename will return the file name for you
so you can use it for the exact one file by adding your file path :
os.path.basename("/foo/bar/file.file")
or you can run through the files in the folder and read all names
file_src = "/foo/bar/"
for x in os.listdir(file_src):
print(os.path.basename(x))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find efficiently a file with unknown extension - python

Related

python: set file path to only point to files with a specific ending

Reading .csv and .xlsx in a directory using Data frame

How can I read files with similar names on python, rename them and then work with them?

Programmaticallly moving files in python

Fast way to read filename from directory?

Categories

Resources