Regex in Python to match all the files in a folder - python

I'm very bad at regex.
I'm trying to locate files in a folder based on the file names. Most of the filenames are in the format GSE1234_series_matrix.txt, hence I've been using os.path.join("files", GSE_num + "_series_matrix.txt"). However, a few files have names like GSE1234-GPL22_series_matrix.txt. I'm not sure how to address all the files starting with a GSE number and ending with _series_matrix.txt together, possibly in one statement. I'd really appreciate any help.
EDIT - I have these series matrix text files in a folder, for which I mention the path using path join. I also input a text file, which has all the GSE numbers. This way it runs the script only for selected GSE numbers. So not everything that's in the folder is in GSE num list AND the list just has GSE numbers and not GPL. For instance the file GSE1234-GPL22_series_matrix.txt would be GSE1234 in the list.

Skip using regexes entirely.
good_filenames = [name for name in filenames if name.startswith("GSE") and name.endswith("_series_matrix.txt")]

You could use glob. Depending on how much of the path you include in the pattern, you wouldn't have to worry about using os.path.join at all.
import glob
good_filenames = glob.glob('/your/path/here/GSE*_series_matrix.txt')
returns:
['/your/path/here/GSE1234_series_matrix.txt',
'/your/path/here/GSE1234-GPL22_series_matrix.txt']

Kevin's answer is great! If you'd like to use a regex, you can do something like this:
^GSE\d+.*series_matrix.txt$
That would match anything that starts with GSE and a number, and ends with series_matrix.txt

Related

Deleting files based on day within filename

I have a directory with files like: data_Mon_15-8-22.csv, data_Tue_16-8-22.csv, data_Mon_22-8-22.csv etc and I am trying to delete all but the Monday files. However, my script doesn't seem to differentiate between the filenames and just deletes everything despite me stating it. Where did I go wrong? Any help would be much appreciated!
My Code:
def file_delete():
directory = pathlib.Path('/Path/To/Data')
for file in directory.glob('data_*.csv'):
if file != 'data_Mon_*.csv':
os.remove(file)]
if all Monday files start with "data_Mon_" then you might use str.startswith:
def file_delete():
directory = pathlib.Path('/Path/To/Data')
for file in directory.glob('data_*.csv'):
if not file.name.startswith('data_Mon_'):
os.remove(file)
if file != 'data_Mon_*.csv'
There's two problems here:
file is compared against the string 'data_Mon_*.csv'. Since file isn't a string, these two objects will never be equal. So the if condition will always be true. To fix this, you need to get the file's name, rather than using the file object directly.
Even if you fix this, the string 'data_Mon_*.csv' is literal. In other words, the * is a *. Unlike directory.glob('data_*.csv'), this will only match a * rather than match "anything" as in a glob expression. In order to fix this, you need to use a regular expression to match against your file name.

Rename directory with constantly changing name

I created a script that is supposed to download some data, then run a few processes. The data source (being ArcGIS Online) always downloads the data as a zip file and when extracted the folder name will be a series of letters and numbers. I noticed that these occasionally change (not entirely sure why). My thought is to run an os.listdir to get the folder name then rename it. Where I run into issues is that the list returns the folder name with brackets and quotes. It returns as ['f29a52b8908242f5b1f32c58b74c063b.gdb'] as the folder name while folder in the file explorer does not have the brackets and quotes. Below is my code and the error I receive.
from zipfile import ZipFile
file_name = "THDNuclearFacilitiesBaseSandboxData.zip"
with ZipFile(file_name) as zip:
# unzipping all the files
print("Unzipping "+ file_name)
zip.extractall("C:/NAPSG/PROJECTS/DHS/THD_Nuclear_Facilities/SCRIPT/CountyDownload/Data")
print('Unzip Complete')
#removes old zip file
os.remove(file_name)
x = os.listdir("C:/NAPSG/PROJECTS/DHS/THD_Nuclear_Facilities/SCRIPT/CountyDownload/Data")
os.renames(str(x), "Test.gdb")
Output:
FileNotFoundError: [WinError 2] The system cannot find the file specified: "['f29a52b8908242f5b1f32c58b74c063b.gdb']" -> 'Test.gdb'
I'm relatively new to python scripting, so if there is an easier alternative, that would be great as well. Thanks!
os.listdir() returns a list files/objects that are in a folder.
lists are represented, when printed to the screen, using a set of brackets.
The name of each file is a string of characters and strings are represented, when printed to the screen, using quotes.
So we are seeing a list with a single filename:
['f29a52b8908242f5b1f32c58b74c063b.gdb']
To access an item within a list using Python, you can using index notation (which happens to also use brackets to tell Python which item in the list to use by referencing the index or number of the item.
Python list indexes starting at zero, so to get the first (and in this case only item in the list), you can use x[0].
x = os.listdir("C:/NAPSG/PROJECTS/DHS/THD_Nuclear_Facilities/SCRIPT/CountyDownload/Data")
os.renames(x[0], "Test.gdb")
Having said that, I would generally not use x as a variable name in this case... I might write the code a bit differently:
files = os.listdir("C:/NAPSG/PROJECTS/DHS/THD_Nuclear_Facilities/SCRIPT/CountyDownload/Data")
os.renames(files[0], "Test.gdb")
Square brackets indicate a list. Try x[0] that should get rid of the brackets and be just the data.
The return from listdir may be a list with only one value or a whole bunch

Find a directory in a list of directories containing a portion of a string using python

My first post and I'm rather new to programming so please be patient with me. I've searched quite a bit for somebody else trying to solve this problem but I can't find a specific case like mine.
If I have a directory with a bunch of subdirectories, and I want to find a specific directory who's name contains a portion of the text I'm looking for, I want to be able to move that directory to another directory.
So for example, I have the following directories
c:/users/bob/folders/folder123
c:/users/bob/folders/folder456
c:/users/bob/folders/folder789
c:/users/bob/folders/folder0
I ask the user "What string are you looking for?". Let's imagine the user tells me that they're looking for the string "123". I want to be able to find a folder that contains that text and then move that folder and all of its contents to a new directory.
Getting the input from the user is obviously quite easy
print('What is the rev number?') #ask the user for the rev number
revNumber = input()
Now I need to pass that variable into some code that searches a specific directory for folders containing that text. Once the folder is found, how do I move it? I know how to move individual files using shutil.move and was wondering if it could also be used to move an entire folder and all of its contents.
Thanks so much in advance.
You can use os package to create directories and then copying the content with shutil. Note that with os you have functions to traverse entire directory hierarchies with "walk()". Have a look at the documentation.
Shutil can also move directories, though.
Then to search for a particular string in another string, you can use the "in" keywords that returns true if the string was found:
# Example path
path = path/to/folder/file6854.jpg
# String to look for
to_find = "854"
if to_find in path:
# Do something

How to find all files in current directory with filenames that match a certain pattern in python?

I am trying to find all the files in the same directory as my script that has a filename matching a certain pattern. Ideally, I would like to store it in an array once I get them. The pattern I need to match is something like: testing.JUNK.08-05.txt. All the filenames have the testing in the front and end with the date (08-05.txt). The only difference is the JUNK in the middle which can include any valid characters.
What would be the most efficient way to do this? I can be working with anywhere from 1 to thousands of files?
Additional things to note: Using python 2.6 and I need this to work on Unix-based operating systems.
Use the glob module:
import glob
for name in glob.glob('testing*08-05.txt'):
print name

Regex to match the first file in a rar archive file set in Python

I need to uncompress all the files in a directory and for this I need to find the first file in the set. I'm currently doing this using a bunch of if statements and loops. Can i do this this using regex?
Here's a list of files that i need to match:
yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
yes.r01
yes.r001
These should NOT be matched:
no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
no.r002
no.r02
I found a similar regex on this thread but it seems that Python doesn't support varible length lookarounds. A single line regex would be complicated but I'll document it well and it's not a problem. It's just one of those problems you beat your heap up, over.
Thanks in advance guys.
:)
Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.
RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.
HEAD_FLAGS Bit flags:
2 bytes
0x0100 - First volume (set only by RAR 3.0 and later)
So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt.
Update: I've just confirmed this by taking a look at some spanning archives in a hex editor. The files headers are constructed exactly as the link above indicates. It's just a matter of opening the files and reading the header for that flag. The file with that flag is the first volume.
There's no need to use look behind assertions for this. Since you start looking from the beginning of the string, you can do everything with look-aheads that you can with look-behinds. This should work:
^((?!\.part(?!0*1\.rar$)\d+\.rar$).)*\.(?:rar|r?0*1)$
To capture the first part of the filename as you requested, you could do this:
^((?:(?!\.part\d+\.rar$).)*)\.(?:(?:part0*1\.)?rar|r?0*1)$
Are you sure you want to match these cases?
yes.r01
They are not the first archives: .rar always is.
It's bla.rar, bla.r00 and then only bla.r01. You'll probably extract the files twice if you match .r01 and .rar as first archive.
yes.r001
.r001 doesn't exist. Do you mean the .001 files that WinRAR supports?
After .r99, it's .s00. If it does exist, then somebody manually renamed the files.
In theory, matching on filename should be as reliable as matching on the 0x0100 flag to find the first archive.

Categories

Resources