Creating Unique Names - python

I'm creating a corpus from a repository. I download the text from the repository in pdf, convert these to text files, and save them. However, I'm trying to find a good way to name these files.
To get the filenames I do this: (the records generator is an object from the Sickle package that I use to get access to all the records in the repository)
for record in records:
record_data = [] # data is stored in record_data
for name, metadata in record.metadata.items():
for i, value in enumerate(metadata):
if value:
record_data.append(value)
file_path = ''
fulltext = ''
for data in record_data:
if 'Fulltext' in data:
fulltext = data.replace('Fulltext ', '')
file_path = '/' + os.path.basename(data) + '.txt'
print fulltext
print file_path
The print statements on the two last lines:
https://www.duo.uio.no/bitstream/handle/10852/34910/1/Bertelsen-Master.pdf
/Bertelsen-Master.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34912/1/thesis-output.pdf
/thesis-output.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9976/1/gartmann.pdf
/gartmann.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/34174/1/thesis-mariusno.pdf
/thesis-mariusno.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9285/1/thesis2.pdf
/thesis2.pdf.txt
https://www.duo.uio.no/bitstream/handle/10852/9360/1/OMyhre.pdf
As you can see I add a .txt to the end of the original filename and want to use that name to save the file. However, a lot of the files have the same filename, like thesis.pdf. One way I thought about solving this was to add a few random numbers to the name, or have a number that gets incremented on each record and use that, like this: thesis.pdf.124.txt (adding 124 to the name).
But that does not look very good, and the repository is huge, so in the end I would have quite large numbers appended to each filename. Any smart suggestions on how I can solve this?
I have seen suggestions like using the time module. I was thinking maybe I can use regex or another technique to extract part of the name (so every name is equally long) and then create a method that adds a string to each file pased on the url of the file, which should be unique.

One thing you could do is to compute a unique hash of the files, e.g. with MD5 or SHA1 (or any other), cf. this article. For a large number of files this can become quite slow, though.
But you don't really see to touch the files in this piece of code. For generating some unique id, you could use uuid and put this somewhere in the name.

Related

Python Dictionary: Multi keywords key with single identical value

I am trying to create a dictionary as follows:
key( file_size checksum ) = filename.
I want the double keywords to make up a key taking into account both value. These keys are derived from the actual file in question. If the key is matched, I have a duplicate file, not just a duplicate file name.
It is easy to determine duplicity if there was a single key:filename. But not all files will have the same filename, either by path or in actual filename. So far, no Python website has been able to supply an answer. Although one did have this format, I haven't found it again.
I have tried various combinations of brackets and commas with little effect.
A simple example that finds duplicate files in subdirs using a dictionary like you suggest:
from pathlib import Path
import hashlib
found = {}
for fn in Path('.').glob('**/*'):
if fn.is_file():
with open(fn, 'rb') as f:
key = (fn.stat().st_size, hashlib.md5(f.read()).hexdigest())
if key in found:
print(f'File {fn} is a duplicate of {found[key]}')
else:
found[key] = fn
Specifically, key is a tuple of two integers, the first the file size and the second an MD5 hash (i.e. a sort of checksum, as you suggest).
Note that this bit of code reads every file it encounters in its entirety, so I suggest pointing it at a small or moderate file collection. Computing a full checksum or hash for each file isn't a very fast way to find duplicates.
You should consider looking at more basic attributes first (size, hash for the first few bytes, etc.) and only doing a full hash if you have candidates for being duplicates.
Also, the odds of two files having the same MD5 hash, but somehow different sizes is astronomically small, so adding the file size to the key here is more or less pointless if you hash the whole file.
(Edit) This is a somewhat better way to achieve the same, a lot faster:
from pathlib import Path
import hashlib
import filecmp
found = {}
for fn in Path('.').glob('**/*'):
if fn.is_file():
with open(fn, 'rb') as f:
key = (fn.stat().st_size, hashlib.md5(f.read(1024)).hexdigest())
if key in found:
if filecmp.cmp(fn, found[key], shallow=False):
print(f'File {fn} is a duplicate of {found[key]}')
else:
found[key] = fn

Compare one directory at time 1 to same directory at time 2

My goal : compare the content of one directory (including sub-directories and files) at time 1 to the content of the same directory at time 2 (e.g. 6 months later). "Content" means : number and names of the subdirectories + number and names and size of files. The main intended outcome is : being sure that no files were destroyed or corrupted in the mean time.
I did not find any existing tool, although I was wondering whether https://github.com/njanakiev/folderstats folderstats could help.
Would you have any suggestion of modules or anything to start well? If you heard about an existing tool for this, I would also be interested.
Thanks.
Here's some code that should help to get you started. It defines a function that will build a data structure of nested dictionaries that correspond the contents of the starting root directory and everything below it in the filesystem. Each each item dictionary that has with a 'type' key with the value 'file', will also have a 'stat' key that can contain whatever file metadata you want or need, such as time of creation, last modification time, length in bytes, … etc.
You can use it to obtain a "before" and "after" snapshots of the directory you're tracking and use them for comparison purposes. I've left the latter (the comparing) out since I'm not sure exactly what you're interested in.
Note that when I actually went about implementing this, I found it simpler to write a recursive function than to use os.walk(), as I suggested in a comment.
The following implements a version of the function and prints out the data structure of nested dictionaries it returns.
import os
from pathlib import PurePath
def path_to_dict(path):
result = {}
result['full_path'] = PurePath(path).as_posix()
if os.path.isdir(path):
result['type'] = 'dir'
result['items'] = {filename: path_to_dict(os.path.join(path, filename))
for filename in os.listdir(path)}
else:
result['type'] = 'file'
result['stat'] = 'os.stat(path)' # Preserve any needed metadata.
return result
root = './folder' # Change as desired.
before = path_to_dict(root)
# Pretty-print data structure created.
from pprint import pprint
pprint(before, sort_dicts=False)

Looping through files using lists

I have a folder with pseudo directory (/usr/folder/) of files that look like this:
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
target_07751_20181130.tsv.gz
target_07751_20181203.tsv.gz
target_07751_20181204.tsv.gz
target_27103_20181128.tsv.gz
target_27103_20181129.tsv.gz
target_27103_20181130.tsv.gz
I am trying to join the above tsv files to one xlsx file on store code (found in the file names above).
I am reading say file.xlsx and reading that in as a pandas dataframe.
I have extracted store codes from file.xlsx so I have the following:
stores = instore.store_code.astype(str).unique()
output:
07750
07751
27103
So my end goal is to loop through each store in stores and find which filename that corresponds to in directory. Here is what I have so far but I can't seem to get the proper filename to print:
import os
for store in stores:
print(store)
if store in os.listdir('/usr/folder/'):
print(os.listdir('/usr/folder/'))
The output I'm expecting to see for say store_code in loop = '07750' would be:
07750
target_07750_20181128.tsv.gz
target_07750_20181129.tsv.gz
Instead I'm only seeing the store codes returned:
07750
07751
27103
What am I doing wrong here?
The reason your if statement fails is that it checks if "07750" etc is one of the filenames in the directory, which it is not. What you want is to see if "07750" is contained in one of the filenames.
I'd go about it like this:
from collections import defaultdict
store_files = defaultdict(list)
for filename in os.listdir('/usr/folder/'):
store_number = <some string magic to extract the store number; you figure it out>
store_files[store_number].append(filename)
Now store_files will be a dictionary with a list of filenames for each store number.
The problem is that you're assuming a substring search -- that's not how in works on a list. For instance, on the first iteration, your if looks like this:
if "07750" in ["target_07750_20181128.tsv.gz",
"target_07750_20181129.tsv.gz",
"target_07751_20181130.tsv.gz",
... ]:
The string "07755" is not an element of that list. It does appear as a substring, but in doesn't work that way on a list. Instead, try this:
for filename in os.listdir('/usr/folder/'):
if '_' + store + '_' in filename:
print(filename)
Does that help?

Searching a directory for two different file formats

I have the following method:
def file_match(self, fundCodes):
# Get a list of the files
files = os.listdir(self.unmappedDir)
# loop through all the files and search for matching file
for check_fund in fundCodes:
# Format of file to look for
file_match = 'unmapped_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
# look in the unmappeddir and see if there's a file with that name
if file_match in files:
# if there's a match, load unmapped positions as etl
return self.read_file(file_match)
The method would seek for files that can match this type of format:
unmapped_A-AGEI_2018-07-01_2018-07-09.csv or
unmapped_PWMA_2018-07-01_2018-07-09.csv
NOTE: The fundCodes argument would be an array of "fundCodes"
Now, I want it to be able to look for another type of format, which would be the following:
citco_unmapped_trades_2018-07-01_2018-07-09 I'm having a little trouble trying to figure out how to re-write the function so it can look for two possible formats and if it finds one then move on to the self.read_file(file_match) method. (If it finds both, I might have to do some error handling). Any suggestions?
There are many various approaches that can be used to do this, it depends, in particular, on your possible further enhancements. The easiest and quite straightforward way is to make a list of allowed filenames and check it one by one:
file_matches = [
'unmapped_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate),
'citco_unmapped_{fund}_{start}_{end}.csv'.format(fund=check_fund, start=self.startDate, end=self.endDate)
]
# look in the unmappeddir and see if there's a file with that name
for file_match in file_matches:
if file_match in files:
# if there's a match, load unmapped positions as etl
return self.read_file(file_match)
I came across this when I was looking for answers about something else. Keep in mind I wrote this in a couple of minutes. I am sure it can be improved. You should be able to copy and paste this and run it. You will just have to create the files or drop the script in the same directory as the files. Feel free to modify it the way you want. This may not be the best solution but it should work. I wrote this so you can test it immediately. You will just have to modify it so that it runs correctly in your program. If you need me to elaborate please comment below.
import os
def file_search(formats, fund_codes):
files = os.listdir()
for fund in fund_codes:
for fmt in formats:
file_match = fmt.format(fund=fund[0], start=fund[1], end=fund[2])
if file_match in files:
print(file_match)
formats = ['unmapped_{fund}_{start}_{end}.csv', 'citco_unmapped_{fund}_{start}_{end}.csv']
fund_codes = [['PWMA', '2018-07-01', '2018-07-09'], ['A-AGEI', '2018-07-01', '2018-07-09'], ['trades', '2018-07-01', '2018-07-09']]
file_search(formats, fund_codes)

return value of fnmatch query python

All,
I have bunch of files and I want to extract files of form
_10_C.xlsx, _23_C.xlsx,.. so on
I am using the following snippet to store these files
Pattern_Meas = '*_*_C*.xlsx',
for name in LIV_files:
if(fnmatch.fnmatchcase(name, Pattern_Meas)):
fname_Temp_LI_list.append(name)
else: fname_Temp_VI_list.append(name)
I was wondering if there is a way to extract the value of "*" in *_C, as to get output as 10, 23 etc. One way maybe using rsplit after I get the files but was wondering if there was something more efficient, like doing it in the same filename

Categories

Resources