return value of fnmatch query python - python

All,
I have bunch of files and I want to extract files of form
_10_C.xlsx, _23_C.xlsx,.. so on
I am using the following snippet to store these files
Pattern_Meas = '*_*_C*.xlsx',
for name in LIV_files:
if(fnmatch.fnmatchcase(name, Pattern_Meas)):
fname_Temp_LI_list.append(name)
else: fname_Temp_VI_list.append(name)
I was wondering if there is a way to extract the value of "*" in *_C, as to get output as 10, 23 etc. One way maybe using rsplit after I get the files but was wondering if there was something more efficient, like doing it in the same filename

Related

Looping through files using lists

I have a folder with pseudo directory (/usr/folder/) of files that look like this:
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
target_07751_20181130.tsv.gz
target_07751_20181203.tsv.gz
target_07751_20181204.tsv.gz
target_27103_20181128.tsv.gz
target_27103_20181129.tsv.gz
target_27103_20181130.tsv.gz
I am trying to join the above tsv files to one xlsx file on store code (found in the file names above).
I am reading say file.xlsx and reading that in as a pandas dataframe.
I have extracted store codes from file.xlsx so I have the following:
stores = instore.store_code.astype(str).unique()
output:
07750
07751
27103
So my end goal is to loop through each store in stores and find which filename that corresponds to in directory. Here is what I have so far but I can't seem to get the proper filename to print:
import os
for store in stores:
print(store)
if store in os.listdir('/usr/folder/'):
print(os.listdir('/usr/folder/'))
The output I'm expecting to see for say store_code in loop = '07750' would be:
07750
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
Instead I'm only seeing the store codes returned:
07750
07751
27103
What am I doing wrong here?
The reason your if statement fails is that it checks if "07750" etc is one of the filenames in the directory, which it is not. What you want is to see if "07750" is contained in one of the filenames.
I'd go about it like this:
from collections import defaultdict
store_files = defaultdict(list)
for filename in os.listdir('/usr/folder/'):
store_number = <some string magic to extract the store number; you figure it out>
store_files[store_number].append(filename)
Now store_files will be a dictionary with a list of filenames for each store number.
The problem is that you're assuming a substring search -- that's not how in works on a list. For instance, on the first iteration, your if looks like this:
if "07750" in ["target_07750_20181128.tsv.gz",
"target_07750_20181129.tsv.gz",
"target_07751_20181130.tsv.gz",
... ]:
The string "07755" is not an element of that list. It does appear as a substring, but in doesn't work that way on a list. Instead, try this:
for filename in os.listdir('/usr/folder/'):
if '_' + store + '_' in filename:
print(filename)
Does that help?

Searching a directory for two different file formats

I have the following method:
def file_match(self, fundCodes):
# Get a list of the files
files = os.listdir(self.unmappedDir)
# loop through all the files and search for matching file
for check_fund in fundCodes:
# Format of file to look for
file_match = 'unmapped_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
# look in the unmappeddir and see if there's a file with that name
if file_match in files:
# if there's a match, load unmapped positions as etl
return self.read_file(file_match)
The method would seek for files that can match this type of format:
unmapped_A-AGEI_2018-07-01_2018-07-09.csv or
unmapped_PWMA_2018-07-01_2018-07-09.csv
NOTE: The fundCodes argument would be an array of "fundCodes"
Now, I want it to be able to look for another type of format, which would be the following:
citco_unmapped_trades_2018-07-01_2018-07-09 I'm having a little trouble trying to figure out how to re-write the function so it can look for two possible formats and if it finds one then move on to the self.read_file(file_match) method. (If it finds both, I might have to do some error handling). Any suggestions?
There are many various approaches that can be used to do this, it depends, in particular, on your possible further enhancements. The easiest and quite straightforward way is to make a list of allowed filenames and check it one by one:
file_matches = [
'unmapped_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate),
'citco_unmapped_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
]
# look in the unmappeddir and see if there's a file with that name
for file_match in file_matches:
if file_match in files:
# if there's a match, load unmapped positions as etl
return self.read_file(file_match)
I came across this when I was looking for answers about something else. Keep in mind I wrote this in a couple of minutes. I am sure it can be improved. You should be able to copy and paste this and run it. You will just have to create the files or drop the script in the same directory as the files. Feel free to modify it the way you want. This may not be the best solution but it should work. I wrote this so you can test it immediately. You will just have to modify it so that it runs correctly in your program. If you need me to elaborate please comment below.
import os
def file_search(formats, fund_codes):
files = os.listdir()
for fund in fund_codes:
for fmt in formats:
file_match = fmt.format(fund=fund[0], start=fund[1], end=fund[2])
if file_match in files:
print(file_match)
formats = ['unmapped_{fund}_{start}_{end}.csv', 'citco_unmapped_{fund}_{start}_{end}.csv']
fund_codes = [['PWMA', '2018-07-01', '2018-07-09'], ['A-AGEI', '2018-07-01', '2018-07-09'], ['trades', '2018-07-01', '2018-07-09']]
file_search(formats, fund_codes)

Attempting to read data from multiple files to multiple arrays

I would like to be able to read data from multiple files in one folder to multiple arrays and then perform analysis on these arrays such as plot graphs etc. I am currently having trouble reading the data from these files into multiple arrays.
My solution process so far is as follows;
import numpy as np
import os
#Create an empty list to read filenames to
filenames = []
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
This works so far, what I'd like to do next is to iterate over the filenames in the list using numpy.genfromtxt.
I'm trying to use os.path join to put the individual list entry at the end of the path specified in listdir earlier. This is some example code:
for i in filenames:
file_name = os.path.join('C:\\entryfromabove','i')
'data_'+[i] = np.genfromtxt('file_name',skiprows=2,delimiter=',')
This piece of code returns "Invalid syntax".
To sum up the solution process I'm trying to use so far:
1. Use os.listdir to get all the filenames in the folder I'm looking at.
2. Use os.path.join to direct np.genfromtxt to open and read data from each file to a numpy array named after that file.
I'm not experienced with python by any means - any tips or questions on what I'm trying to achieve are welcome.
For this kind of task you'd want to use a dictionary.
data = {}
for file in os.listdir('C\\folderwherefileslive'):
filenames.append(file)
path = os.path.join('C:\\folderwherefileslive', i)
data[file] = np.genfromtxt(path, skiprows=2, delimiter=',')
# now you could for example access
data['foo.txt']
Notice, that everything you put within single or double quotes ends up being a character string, so 'file_name' will just be some characters, whereas using file_name would use the value stored in variable by that name.

Creating Unique Names

I'm creating a corpus from a repository. I download the text from the repository in pdf, convert these to text files, and save them. However, I'm trying to find a good way to name these files.
To get the filenames I do this: (the records generator is an object from the Sickle package that I use to get access to all the records in the repository)
for record in records:
record_data = [] # data is stored in record_data
for name, metadata in record.metadata.items():
for i, value in enumerate(metadata):
if value:
record_data.append(value)
file_path = ''
fulltext = ''
for data in record_data:
if 'Fulltext' in data:
fulltext = data.replace('Fulltext ', '')
file_path = '/' + os.path.basename(data) + '.txt'
print fulltext
print file_path
The print statements on the two last lines:
https://www.duo.uio.no/bitstream/handle/10852/34910/1/Bertelsen-Master.pdf
/Bertelsen-Master.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34912/1/thesis-output.pdf
/thesis-output.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9976/1/gartmann.pdf
/gartmann.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34174/1/thesis-mariusno.pdf
/thesis-mariusno.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9285/1/thesis2.pdf
/thesis2.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9360/1/OMyhre.pdf
As you can see I add a .txt to the end of the original filename and want to use that name to save the file. However, a lot of the files have the same filename, like thesis.pdf. One way I thought about solving this was to add a few random numbers to the name, or have a number that gets incremented on each record and use that, like this: thesis.pdf.124.txt (adding 124 to the name).
But that does not look very good, and the repository is huge, so in the end I would have quite large numbers appended to each filename. Any smart suggestions on how I can solve this?
I have seen suggestions like using the time module. I was thinking maybe I can use regex or another technique to extract part of the name (so every name is equally long) and then create a method that adds a string to each file pased on the url of the file, which should be unique.
One thing you could do is to compute a unique hash of the files, e.g. with MD5 or SHA1 (or any other), cf. this article. For a large number of files this can become quite slow, though.
But you don't really see to touch the files in this piece of code. For generating some unique id, you could use uuid and put this somewhere in the name.

Automatically find files that start with similar strings (and find these strings) using Python

I have a directory with a number of files in a format similar to this:
"ABC_01.dat", "ABC_02.dat", "ABC_03-08.dat", "DEF_13.dat", "DEF_14.dat", "DEF_16.dat", "GHI_09.dat", "GHI_12-14.dat"
etc., you get the idea. Essentially, what I want to do is merge all files whose names start with a similar string. At the moment, I do this by manually setting a variable names = ["ABC", "DEF", "GHI"], iterating over it (for name in names) and getting the respective filenames using glob glob.glob(name + "*.dat"). The merging step is later done using pandas. I don't just need the names/prefixes for finding the files; they are used later in my script to set the output files' names.
Is there a way I can automatically generate the variable names if I know that the files are all in the format name_*.dat?
Consider this :
names = set([name.rpartition('_')[0] for name in glob('*_*.dat')])
This will get all unique prefixes before '_'. You will also want to set a correct path in glob() before matching.
You can do this:
result = [filter(lambda x:x.startswith(sn), fileNames) for sn in set([i.split('_')[0] for i in glob.glob("*.*")])]
print result
output:
[['ABC_01.dat', 'ABC_02.dat', 'ABC_03-08.dat'], ['GHI_09.dat', 'GHI_12-14.dat'], ['DEF_13.dat', 'DEF_14.dat', 'DEF_16.dat']]
Now, all files from result[0] are to be merged; similarly for result[1],...

Categories

Resources