Add a unique File ID to PDF documents received - python

I need to track PDF documents RECEIVED. I can keep a list of the documents in a database, however sometimes the documents get renamed or moved, so the file path to the PDF is not always reliable.
For other document types, I sometimes add a unique ID as metadata, so that I can recognize that a file that was moved and/or renamed is the same as one seen previously.
I am looking for a solution that will work on Windows 10, and would prefer a solution based on Node.js, although Python would also be acceptable.
The documents are received from many different sources, and I do not have the option of requiring the source of the document to add a unique identifier.
I have use IPTCinfo in this way for media files, but (as far as I know) that can't be used with PDFs.
I am looking for something similar that can be used with PDFs.

Use md5sum:
import os
def check_md5sum(file_path):
md5sum = os.system(f'md5sum {file_path}')
return md5sum

Related

Searchable database of files

I have 1000's of scanned field books as PDF. Each has a unique filename. In a spreadsheet I have metadata for each, where each row has:
index number, filename, info1, info2, info3, info4, etc.
filename is the exact file name of the PDF. info1 is just an example of a metadata field, such as 'Year' or whatever. There are only about 8 fields or so, not ever PDF is relevant to all of them.
I assume there should be a reasonable way to create a database, mysql, or other, reading the spreadsheet (which I can just saves as .csv or .txt or something). This part I am sure I can handle.
I want to be able to lookup/search for a pdf file based on entering in various search items based on the metadata, and get a list of results. In a web interface, or a custom window, and be able to click on the results and open the file. Basically a typical search window with predefined fields you can enter and get results - like at an old school library terminal.
I have decent coding skills in python, mostly math, but some file skills as well. Looking for guidance on what tools and approach I should take to this. My short term goal is to be able to query and find files and open whatever results. Long term want to be able to share this with the public so they can search and find stuff.
After trying to figure out what to search for online, I am obviously at a loss. How do you suggest I do this and what tools or libraries should I use. I cannot find an example of this online. Not sure how to word it.
The actual data stuff could be done with Pandas:
read the excel file into Pandas
perform the search on the Pandas dataframe, e.g. using df.query()
But this does not give you a GUI. For that you could go for a web app, using Flask or Django framework. That, however, one does not master over night :)
This is a good course to learn that kind of stuff: https://www.edx.org/course/cs50s-web-programming-with-python-and-javascript?index=product&queryID=01efddd992de28a8b1b27d136111a2a8&position=3

Is there a Python or R package to access the "tags" of Windows 10 files?

I would like to organize images by different categories with tags on Windows10. This is an example of the info I would like to read.
I also need to preserve the file names, since they contain other useful information. I want to later have a list, or a dataframe, with the file name and the corresponding tag name.
Most google searches about accessing files metadata suggest reading the exif metadata of the image but this does not contain the "tag" info given on Windows10. Apparently the tags are stored on NTFS (https://karl-voit.at/2019/11/26/Tagging-Files-With-Windows-10/). Trying to access the tags with Python's IPTCInfo3 doesn't work either.
Does anyone know how to read these tags from Python or R?

Bulk import of .json files in arangodb with python

I have huge collection of .json files containing hundreds or thousands of documents I want to import to arangodb collections. Can I do it using python and if the answer is yes, can anyone send an example on how to do it from a list of files? i.e:
for i in filelist:
import i to collection
I have read the documentation but I couldn't find anything even resembling that
So after a lot of trial and error I found out that I had the answer in front of me. So I didn't need to import the .json file, I just needed to read it and then do a bulk import of documents. The code is like this:
a = db.collection('collection_name')
for x in list_of_json_files:
with open(x,'r') as json_file:
data = json.load(json_file)
a.import_bulk(data)
So actually it was quite simple. In my implementation I am collecting the .json files from multiple folders and importing them to multiple collections. I am using the python-arango 5.4.0 driver
I had this same problem. Though your implementation will be slightly different, the answer you need (maybe not the one you're looking for) is to use the "bulk import" functionality.
Since ArangoDB doesn't have an "official" Python driver (that I know of), you will have to peruse other sources to give you a good idea on how to solve this.
The HTTP bulk import/export docs provide curl commands, which can be neatly translated to Python web requests. Also see the section on headers and values.
ArangoJS has a bulk import function, which works with an array of objects, so there's no special processing or preparation required.
I have also used the arangoimport tool to great effect. It's command-line, so it could be controlled from Python, or used stand-alone in a script. For me, the key here was making sure my data was in JSONL or "JSON Lines" format (each line of the file is a self-contained JSON object, no bounding array or comma separators).

pelican taking longer and longer as I write more and more posts.. is it re-making the old posts too?

Hi I started using pelican static site generator, but I noticed, its while making html its taking more and more time as I go on writing more and more posts,
is it re-making the old posts as well ? is their any I can just make html and make only the new post to add to the existing ones ?
Is it re-making the old posts as well?
Yes, it does!
In fact it's said writing the files each time is a lot faster and a lot more reliable than to compare, save and generate hash. From the Pelican FAQ:
In order to reliably determine whether the HTML output is different
before writing it, a large part of the generation environment
including the template contexts, imported plugins, etc. would have to
be saved and compared, at least in the form of a hash (which would
require special handling of unhashable types), because of all the
possible combinations of plugins, pagination, etc. which may change in
many different ways. This would require a lot more processing time and
memory and storage space. Simply writing the files each time is a lot
faster and a lot more reliable.
Read "Why does Pelican always write all HTML files even with content caching enabled?"
is their any I can just make html and make only the new post to add to the existing ones ?
When setting WRITE_SELECTED list on settings you can specify only select content to write to. Thus only those files will be written.
This list can be also specified on the command line using the --write-selected option, which accepts a comma-separated list of output file paths. By default this list is empty, so all output is written.
Read: "Writing only selected content"

How to programmatically insert comments into a Microsoft Word document?

Looking for a way to programmatically insert comments (using the comments feature in Word) into a specific location in a MS Word document. I would prefer an approach that is usable across recent versions of MS Word standard formats and implementable in a non-Windows environment (ideally using Python and/or Common Lisp). I have been looking at the OpenXML SDK but can't seem to find a solution there.
Here is what I did:
Create a simple document with word (i.e. a very small one)
Add a comment in Word
Save as docx.
Use the zip module of python to access the archive (docx files are ZIP archives).
Dump the content of the entry "word/document.xml" in the archive. This is the XML of the document itself.
This should give you an idea what you need to do. After that, you can use one of the XML libraries in Python to parse the document, change it and add it back to a new ZIP archive with the extension ".docx". Simply copy every other entry from the original ZIP and you have a new, valid Word document.
There is also a library which might help: openxmllib
If this is server side (non-interactive) use of the Word application itself is unsupported (but I see this is not applicable). So either take that route or use the OpenXML SDK to learn the markup needed to create a comment. With that knowledge it is all about manipulating data.
The .docx format is a ZIP of XML files with a defines structure, so mostly once you get into the ZIP and get the right XML file it becomes a matter of modifying an XML DOM.
The best route might be to take a docx, copy it, add a comment (using Word) to one, and compare. A diff will show you the kind of elements/structures you need to be looking up in the SDK (or ISO/Ecma standard).

Categories

Resources