Python: filename matching using regex - python

I have written this code
import os
from datetime import datetime
import re
now = datetime.now()
filename = now.strftime("%Y%m%d%H%M") #For example 202006191839
for fname in os.listdir(downloadPath):
if re.match('export_' + filename + '[0-9]{2}.xlsx', fname):
print(fname)
In downloadPath I have these files
export_20200619183900.xlsx
export_20200619183921.xlsx
export_20200619183930.xlsx
But the re.match is not matching as desired.
But, if i change
filename = now.strftime("%Y%m%d%H%M")
with a simple assignment
filename = "202006191839"
The code works.
The problem is, I need to have dynamic data.
Can anyone help me?

I think it is because you are matching 'export_' + filename, but you said the file was excel_20200619183900

Ok.
I am solved the problem... i am blind probably
The file I search, is download before the above code, despite being very small, the search command starts before the download...
I have add a simple time.sleep(2) before search command.
Thanks to all.

Related

How to move files with a specified extension to a new folder in Python

I am trying to move files within a folder to another folder whilst only moving files with the extensions .bmp.
I am using the shutil.move() method and it works when I don't specify file types but once I do it stops working. I have tried to debug it but cant figure out why my code isn't working. I dont get any tracebacks, nothing happens.
import time
import os
import shutil
from datetime import datetime
today = datetime.now()
src = "."
dst = ('C:\Python\Image Compressor\File Saving\Archive\BDB040803_St14_' +
today.strftime('%d_%m_%Y'))
files = os.listdir(src)
for file in src:
if file.endswith(".bmp"):
shutil.move(os.path.join(src,file), os.path.join(dst,file))
Replace src by files in line:
for file in src:
In case you're interested in an alternative approach (yours isn't bad, this isn't meant as criticism).
You could use pathlib:
from pathlib import Path
from datetime import date
src = Path()
dst = Path(
r"C:\Python\Image Compressor\FileSaving\Archive\BDB040803_St14_"
+ date.today().strftime("%d_%m_%Y")
)
for file in src.glob("*.bmp"):
file.replace(dst / file.name)
.glob allows you to fetch path/file-names that match patterns. The pattern structures aren't as powerful as regex, but can still be pretty helpful. You could also use the glob module directly.
.replace takes over shutil.move's job.
Okay so I have finally solved my problem. I will post my code here for a reference for anyone else who runs into this problem. #siloob thank you for your input, you were correct with your correction but I still had tracebacks after this so I had a few more changes to make.
I needed my bmp files to transfer and not my .py files so I added an if statement with a continue to skip over any files with .py. This solved my problem and only transferred the bmp files.
import time
import os
import shutil
from datetime import datetime
today = datetime.now()
src = "."
dst = ('C:\Python\Image Compressor\File Saving\Archive\BDB040803_St14_' +
today.strftime('%d_%m_%Y'))
files = os.listdir(src)
for file in files:
if file.endswith(".py"):
continue
else:
shutil.move(os.path.join(src,file), os.path.join(dst,file))

How do I iterate through files in my directory so they can be opened/read using PyPDF2?

I am working on an invoice scraper for work, where I have successfully written all the code to scrape the fields that I need using PyPDF2. However, I am having trouble figuring out how to put this code into a for loop so I can iterate through all the invoices stored in my directory. There could be anywhere from 1 to 250+ files depending on which project I am using this for.
I thought I would be able to use "*.pdf" in place of the pdf name, but it does not work for me. I am relatively new to Python and have not used that many loops before, so any guidance would be appreciated!
import re
pdfFileObj = open(r'C:\Users\notylerhere\Desktop\Test Invoices\SampleInvoice.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
#Print all text on page
#print(pageObj.extractText())
#Grab Account Number Meter Number
accountNumber = re.compile(r'\d\d\d\d\d-\d\d\d\d\d')
meterNumber = re.compile(r'(\d\d\d\d\d\d\d\d)')
moAccountNumber = accountNumber.search(pageObj.extractText())
moMeterNumber = meterNumber.search(pageObj.extractText())
print('Account Number: '+moAccountNumber.group())
print('Meter Number: '+moMeterNumber.group(1))'''
Thanks very much!
Another option is glob:
import glob
files = glob.glob("c:/mydirectory/*.pdf")
for file in files:
(Do your processing of file here)
You need to ensure everything past the colon is properly indented.
You want to iterate over your directory and deal with every file independently.
There are many functions depending on your use case. os.walk is a good place to start.
Example:
import os
for root, directories, files in os.walk('.'):
for file in files:
if '.pdf' in file:
openAndDoStuff(file)
import os
import PyPDF2
for el in os.listdir(os.getcwd()):
if el.endswith("pdf"):
pdf_reader = PyPDF2.PdfFileReader(open(os.getcwd() + "/" + el))

access a file with not fully known filename in python

I have a huge database of files whose names are like:
XYZ-ABC-K09235D1-20151220-5H1E2H4A.txt
XYZ-ABC-W8D2S5G5-20151225-HG2EK4GE.txt
XYZ-ABC-ME2C5K32-20160206-DD8BA4R6.txt
etc...
Names have all the same structure:
'XYZ-ABC-' + 8 random char + '%y%m%d' + 8 random char + '.txt'
Now, I need to open a file, given the date. The point is that, I don't know the exact name of the file, as there are some random chars within. For instance, for datetime 12/05/2014 I know the filename will be something like
XYZ-ABC-????????-20140512-????????.txt
but I don't know the exact name when using f.open command. What could be the best way to do this? (I thought about first creating a list with all filenames, but I don't know whether it's a good technique or if it's better to use something like glob...). Thank you in advance.
You can use following code
import os
fileName = [filename for filename in os.listdir('.') if filename.startswith("prefix") and 'otherstring' in filename]
Hope this helps !

Python, Opening files in loop (dicom)

I am currently reading in 200 dicom images manually using the code:
ds1 = dicom.read_file('1.dcm')
so far, this has worked but I am trying to make my code shorter and easier to use by creating a loop to read in the files using this code:
for filename in os.listdir(dirName):
dicom_file = os.path.join("/",dirName,filename)
exists = os.path.isfile(dicom_file)
print filename
ds = dicom.read_file(dicom_file)
This code is not currently working and I am receiving the error:
"raise InvalidDicomError("File is missing 'DICM' marker. "
dicom.errors.InvalidDicomError: File is missing 'DICM' marker. Use
force=True to force reading
Could anyone advice me on where I am going wrong please?
I think the line:
dicom_file = os.path.join("/",dirName,filename)
might be an issue? It will join all three to form a path rooted at '/'. For example:
os.path.join("/","directory","file")
will give you "/directory/file" (an absolute path), while:
os.path.join("directory","file")
will give you "directory/file" (a relative path)
If you know that all the files you want are "*.dcm"
you can try the glob module:
import glob
files_with_dcm = glob.glob("*.dcm")
This will also work with full paths:
import glob
files_with_dcm = glob.glob("/full/path/to/files/*.dcm")
But also, os.listdir(dirName) will include everything in the directory including other directories, dot files, and whatnot
Your exists = os.path.isfile(dicom_file) line will filter out all the non files if you use an "if exists:" before reading.
I would recommend the glob approach, if you know the pattern, otherwise:
if exists:
try:
ds = dicom.read_file(dicom_file)
except InvalidDicomError as exc:
print "something wrong with", dicom_file
If you do a try/except, the if exists: is a bit redundant, but doesn't hurt...
Try adding:
dicom_file = os.path.join("/",dirName,filename)
if not dicom_file.endswith('.dcm'):
continue

How to get the substring from a String in python

I have a string path='/home/user/Desktop/My_file.xlsx'.
I want to extract the "My_file" substring. I am using Django framework for python.
I have tried to get it with:
re.search('/(.+?).xlsx', path).group(1)
but it returns the whole path again.
Can someone please help.
If you know that the file extension is always the same (e.g. ".xlsx") I would suggest you to go this way:
import os
filename_full = os.path.basename(path)
filename = filename_full.split(".xlsx")[0]
Hope it helps
More generally:
import os
filename = os.path.basename(os.path.splitext(path)[0])
If you need to match the exact extension:
# (?<=/) ensure that before the match is /
# [^/]*.xlsx search for anything but / followed by .xlsx
mo1 = re.search('(?<=/)[^/]*.xlsx', path).group(0)
print(mo1)
My_file.xlsx
otherwise:
path='/home/user/Desktop/My_file.xlsx'
with regex:
mo = re.search(r'(?<=/)([\w.]+$)',path)
print(mo.group(1))
My_file.xlsx
with rsplit:
my_file = path.rsplit('/')[-1]
print(my_file)
My_file.xlsx

Categories

Resources