Accommodating variable filename for python script - python

I have to read multiple filenames which i will be treating as input for my python script. But the input files may have variable name depending upon the time it got generated.
File1: RM_Sales_Japan_2011201920191124194200.xlsx
File2: RM_Volume_Australia_201120192019154321194200.xlsx
How to accommodate these changes while reading a file instead of exactly specifying the filename every time we run the script?
Things i tried:
I have used below method in my previous scripts because it had only one file with known extension:
xlsxfile = "*.xlsx"
filelocation = "/user/script/" + xlsxfile
But with multiple files with similar extension i am not sure how to get the definition done.
EDIT1:
I was trying to get more clarity on using glob with read_excel. Please see my example code below:
import os
import glob
import pandas as pd
os.chdir ('D:\\Users\\RMoharir\\Downloads\\Smart Spend\\Input')
fls=glob.glob("Medical*.*")
df1 = pd.read_excel(fls, parse_cols = 'A:H', skiprows = 10, header = None)
But this gives me an error:
ValueError: Invalid file path or buffer object type: <class 'list'>
Any help is appreciated.

If you simply need to find all the files that match a given pattern in a directory, os and re modules have you covered.
import os
import re
files = os.listdir()
for file in files:
if re.match(r".*\.xlsx$", file):
print(file)
This short program will print out every file in the current directory whose name ends with .xslx. If you need to match a more complicated pattern, you may need to read up on Regular Expressions
Note that os.listdir takes an optional string argument of what path to look in, if not given it will look in the directory the program was ran from

Related

python: set file path to only point to files with a specific ending

I am trying to run a program with requires pVCF files alone as inputs. Due to the size of the data, I am unable to create a separate directory containing the particular files that I need.
The directory contains multiple files with 'vcf.gz.tbi' and 'vcf.gz' endings. Using the following code:
file_url = "file:///mnt/projects/samples/vcf_format/*.vcf.gz"
I tried to create a file path that only grabs the '.vcf.gz' files while excluding the '.vcf.gz.tbi' but I have been unsuccesful.
The code you have, as written, is just assigning your file path to the variable file_url. For something like this, glob is popular but isn't the only option:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
os.chdir(file_url)
for file in glob.glob("*.vcf.gz"):
print(file)
Note that the file path doesn't contain the kind of file you want (in this case, a gzipped VCF), the glob for loop does that.
Check out this answer for more options.
It took some digging but it looks like you're trying to use the import_vcf function of Hail. To put the files in a list so that it can be passed as input:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
def get_vcf_list(path):
vcf_list = []
os.chdir(path)
for file in glob.glob("*.vcf.gz"):
vcf_list.append(path + "/" + file)
return vcf_list
get_vcf_list(file_url)
# Now you pass 'get_vcf_list(file_url)' as your input instead of 'file_url'
mt = hl.import_vcf(get_vcf_list(file_url), force_bgz=True, reference_genome="GRCh38", array_elements_required=False)

How to handle wildcard in a raw string in python

I have a script I run daily to compile a bunch of spreadsheets into one. Well after a year of running one of the filenames changed due to it being produced 14 seconds later. I read the filename in like this
uproduction = Path(r"\\server\folder\P"+year+month+day+r"235900.xls")
and then df = pd.read_excel(upreduction)
This was working fine until the file name changed to P20210225235914.xls . When I am using a raw string like that is there a way I can make it pick any file that starts with P20210225*.xls ? I can't seem to find exactly what i'm looking for for in the docs
You can use glob:
from glob import glob
glob(r"\\server\folder\P"+year+month+day+"*.xls")
You can use the glob method on the Path:
for file in Path(r'\\server\folder\').glob(r'P20210225*.xls'):
print(file.name)

How do I dynamically select a csv from the string of its name?

I am looking to pull in a csv file that is downloaded to my downloads folder into a pandas dataframe. Each time it is downloaded it adds a number to the end of the string, as the filename is already in the folder. For example, 'transactions (44).csv' is in the folder, the next time this file is downloaded it is named 'transactions (45).csv'.
I've looked into the glob library or using the os library to open the most recent file in my downloads folder. I was unable to produce a solution. I'm thinking I need some way to connected to the downloads path, find all csv file types, those with the string 'transactions' in it, and grab the one with the max number in the full filename string.
list(csv.reader(open(path + '/transactions (45).csv'))
I'm hoping for something like this path + '/%transactions%' + 'max()' + '.csv' I know the final answer will be completely different, but I hope this makes sense.
Assuming format "transactions (number).csv", try below:
import os
import numpy as np
files=os.listdir('Downloads/')
tranfiles=[f for f in files if 'transactions' in f]
Now, your target file is as below:
target_file=tranfiles[np.argmax([int(t.split('(')[1].split(')')[0]) for t in tranfiles])]
Read that desired file as below:
df=pd.read_csv('Downloads/'+target_file)
One option is to use regular expressions to extract the numerically largest file ID and then construct a new file name:
import re
import glob
last_id = max(int(re.findall(r" \(([0-9]+)\).csv", x)[0]) \
for x in glob.glob("transactions*.csv"))
name = f'transactions ({last_id}).csv'
Alternatively, find the newest file directly by its modification time
Note that you should not use a CSV reader to read CSV files in Pandas. Use pd.read_csv() instead.

How can I read files with similar names on python, rename them and then work with them?

I've already posted here with the same question but I sadly I couldn't come up with a solution (even though some of you guys gave me awesome answers but most of them weren't what I was looking for), so I'll try again and this time giving more information about what I'm trying to do.
So, I'm using a program called GMAT to get some outputs (.txt files with numerical values). These outputs have different names, but because I'm using them to more than one thing I'm getting something like this:
GMATd_1.txt
GMATd_2.txt
GMATf_1.txt
GMATf_2.txt
Now, what I need to do is to use these outputs as inputs in my code. I need to work with them in other functions of my script, and since I will have a lot of these .txt files I want to rename them as I don't want to use them like './path/etc'.
So what I wanted was to write a loop that could get these files and rename them inside the script so I can use these files with the new name in other functions (outside the loop).
So instead of having to this individually:
GMATds1= './path/GMATd_1.txt'
GMATds2= './path/GMATd_2.txt'
I wanted to write a loop that would do that for me.
I've already tried using a dictionary:
import os
import fnmatch
dict = {}
for filename in os.listdir('.'):
if fnmatch.fnmatch(filename, 'thing*.txt'):
examples[filename[:6]] = filename
This does work but I can't use the dictionary key outside the loop.
If I understand correctly, you try to fetch files with similar names (at least a re-occurring pattern) and rename them. This can be accomplished with the following code:
import glob
import os
all_files = glob.glob('path/to/directory/with/files/GMAT*.txt')
for file in files:
new_path = create_new_path(file) # possibly split the file name, change directory and/or filename
os.rename(file, new_path)
The glob library allows for searching files with * wildcards and makes it hence possible to search for files with a specific pattern. It lists all the files in a certain directory (or multiple directories if you include a * wildcard as a directory). When you iterate over the files, you could either directly work with the input of the files (as you apparently intend to do) or rename them as shown in this snippet. To rename them, you would need to generate a new path - so you would have to write the create_new_path function that takes the old path and creates a new one.
Since python 3.4 you should be using the built-in pathlib package instead of os or glob.
from pathlib import Path
import shutil
for file_src in Path("path/to/files").glob("GMAT*.txt"):
file_dest = str(file_src.resolve()).replace("ds", "d_")
shutil.move(file_src, file_dest)
you can use
import os
path='.....' # path where these files are located
path1='.....' ## path where you want these files to store
i=1
for file in os.listdir(path):
if file.endswith(end='.txt'):
os.rename(path + "/" + file, path1 + "/"+str(i) + ".txt")
i+=1
it will rename all the txt file in the source folder to 1,2,3,....n.txt

Make glob directory variable

I'm trying to write a Python script that searches a folder for all files with the .txt extension. In the manuals, I have only seen it hardcoded into glob.glob("hardcoded path").
How do I make the directory that glob searches for patterns a variable? Specifically: A user input.
This is what I tried:
import glob
input_directory = input("Please specify input folder: ")
txt_files = glob.glob(input_directory+"*.txt")
print(txt_files)
Despite giving the right directory with the .txt files, the script prints an empty list [ ].
If you are not sure whether a path contains a separator symbol at the end (usually '/' or '\'), you can concatenate using os.path.join. This is a much more portable method than appending your local OS's path separator manually, and much shorter than writing a conditional to determine if you need to every time:
import glob
import os
input_directory = input('Please specify input folder: ')
txt_files = glob.glob(os.path.join(input_directory, '*.txt'))
print(txt_files)
For Python 3.4+, you can use pathlib.Path.glob() for this:
import pathlib
input_directory = pathlib.Path(input('Please specify input folder: '))
if not input_directory.is_dir():
# Input is invalid. Bail or ask for a new input.
for file in input_directory.glob('*.txt'):
# Do something with file.
There is a time of check to time of use race between the is_dir() and the glob, which unfortunately cannot be easily avoided because glob() just returns an empty iterator in that case. On Windows, it may not even be possible to avoid because you cannot open directories to get a file descriptor. This is probably fine in most cases, but could be a problem if your application has a different set of privileges from the end user or from other applications with write access to the parent directory. This problem also applies to any solution using glob.glob(), which has the same behavior.
Finally, Path.glob() returns an iterator, and not a list. So you need to loop over it as shown, or pass it to list() to materialize it.

Categories

Resources