Pymongo - store PDF data as Binary

Pymongo - store PDF data as Binary - python

I am working on a project where I want to store PDF data within a Mongo collection. I realize this has been asked around the community, but I am struggling with the result. My code is able to insert a document into the collection, but stalls because it enters a blank binary object.
import base64, os
import pymongo
import bson
from pathlib import Path
file_used = 'sample.pdf'
my_file = Path(file_used)
if my_file.is_file():
print("File Exists") #prints successfully
with open(file_used, 'rb') as fout:
string = base64.b64encode(fout.read())
collection.insert_one({"filename":file_used,"pdf_data":string})
Document in MDB Compass:
{
"_id":{"$oid":"60583bfdffea362641c50944"},
"filename":"sample.pdf",
"pdf_data":{"$binary":"","$type":"0"}
}
My python file is also located in the same directory as my sample PDF file.
Execution: python3 test.py
How can I debug my code further? Thank you in advance for your help?

Looks like something was wrong with the PDF file. It worked once I used another PDF file as my test case.

Related

How to write file to memory filepath and read from memory filepath in Python?

An existing Python package requires a filepath as input parameter for a method to be able to parse the file from the filepath. I want to use this very specific Python package in a cloud environment, where I can't write files to the harddrive. I don't have direct control over the code in the existing Python package, and it's not easy to switch to another environment, where I would be able to write files to the harddrive. So I'm looking for a solution that is able to write a file to a memory filepath, and let the parser read directly from this memory filepath. Is this possible in Python? Or are there any other solutions?
Example Python code that works by using harddrive, which should be changed so that no harddrive is used:
temp_filepath = "./temp.txt"
with open(temp_filepath, "wb") as file:
file.write("some binary data")
model = Model()
model.parse(temp_filepath)
Example Python code that uses memory filesystem to store file, but which does not let parser read file from memory filesystem:
from fs import open_fs
temp_filepath = "./temp.txt"
with open_fs('osfs://~/') as home_fs:
home_fs.writetext(temp_filepath, "some binary data")
model = Model()
model.parse(temp_filepath)

You're probably looking for StringIO or BytesIO from io
import io
with io.BytesIO() as tmp:
tmp.write(content)
# to continue working, rewind file pointer
tmp.seek(0)
# work with tmp
pathlib may also be an advantage

json dump() into specific folder

This seems like it should be simple enough, but haven't been able to find a working example of how to approach this. Simply put I am generating a JSON file based on a list that a script generates. What I would like to do, is use some variables to run the dump() function, and produce a json file into specific folders. By default it of course dumps into the same place the .py file is located, but can't seem to find a way to run the .py file separately, and then produce the JSON file in a new folder of my choice:
import json
name = 'Best'
season = '2019-2020'
blah = ['steve','martin']
with open(season + '.json', 'w') as json_file:
json.dump(blah, json_file)
Take for example the above. What I'd want to do is the following:
Take the variable 'name', and use that to generate a folder of the same name inside the folder the .py file is itself. This would then place the JSON file, in the folder, that I can then manipulate.
Right now my issue is that I can't find a way to produce the file in a specific folder. Any suggestions, as this does seem simple enough, but nothing I've found had a method to do this. Thanks!

Python's pathlib is quite convenient to use for this task:
import json
from pathlib import Path
data = ['steve','martin']
season = '2019-2020'
Paths of the new directory and json file:
base = Path('Best')
jsonpath = base / (season + ".json")
Create the directory if it does not exist and write json file:
base.mkdir(exist_ok=True)
jsonpath.write_text(json.dumps(data))
This will create the directory relative to the directory you started the script in. If you wanted a absolute path, you could use Path('/somewhere/Best').
If you wanted to start the script while beeing in some other directory and still create the new directory into the script's directory, use: Path(__file__).resolve().parent / 'Best'.

First of all, instead of doing everything in same place have a separate function to create folder (if already not present) and dump json data as below:
def write_json(target_path, target_file, data):
if not os.path.exists(target_path):
try:
os.makedirs(target_path)
except Exception as e:
print(e)
raise
with open(os.path.join(target_path, target_file), 'w') as f:
json.dump(data, f)
Then call your function like :
write_json('/usr/home/target', 'my_json.json', my_json_data)

Use string format
import json
import os
name = 'Best'
season = '2019-2020'
blah = ['steve','martin']
try:
os.mkdir(name)
except OSError as error:
print(error)
with open("{}/{}.json".format(name,season), 'w') as json_file:
json.dump(blah, json_file)

Use os.path.join():
with open(os.path.join(name, season+'.json'), 'w') as json_file
The advantage above writing a literal slash is that it will automatically pick the type of slash for the operating system you are on (slash on linux, backslash on windows)

How do you know where a csv file is stored once you write a DataFrame onto disk?

import pandas as pd
hand_1=pd.DataFrame({
'Tables of 5':[5,10,15,20,25],
'Tables of 6':[6,12,18,24,30]})
hand_1.to_csv('Tables.csv')`
How do i find out where Tables.csv is stored?
Is this where python stores csv files by default and can this be changed?

It will be saved in your current working directory. If you would like to learn it, you can use the following code:
import os
current_directory = os.getcwd()
You can give a full path instead of tables.csv to store in another directory.

Open and read latest json file one time only

SO members...how can i read latest json file in a directory one time only (if no new file print something). So far I can only read the latest file ...The sample script (run every 45mins) below open and read latest json file in a directory. In this case latest file is file3.json (json file created every 30mins). Thus, if file4 is not created for some reason (for example server fail to create new json file). If the script run again.. it will still read the same last file3.
files in directory
file1.json
file2.json
file3.json
The script below able to open and read latest json file created in the directory.
import glob
import os
import os.path
import datetime, time
listFiles = glob.iglob('logFile/*.json')
latestFile = max(listFiles, key=os.path.getctime)
with open(latestFile, 'r') as f:
mydata = json.load(f)
print(mydata)
To ensure the script will only read newest file and read the newest file one time only...aspect something below:-
listFiles = glob.iglob('logFile/*.json')
latestFile = max(listFiles, key=os.path.getctime)
if latestFile newer than previous open/read file: # Not sure to compare the latest file with the previous file.
with open(latestFile, 'r') as f:
mydata = json.load(f)
print(mydata)
else:
print("no new file created")
Thank you for your help. Example solution would be good to share.
I can't figure out the solution...seems simple but few days try n error without any luck.
(1)Make sure read latest file in directory
(2)Make sure read file/s that may miss to read (due to script fail to run)
(3)Only read once all the files and if no new file give warning.
Thank you.
After SO discussion and suggestion, I got few methods to resolve or at least to accommodate some of the requirement. I just move files that have been processed. If no file create, script will run nothing and if script fail and once normalize it will run and read all related files available. I think its good for now. Thank you guyz...

Below is the answer rather an approach, I would like to propose:
The idea is as follows:
Every log file that is written to a directory can have a key-val in it called "creation_time": timestamp (fileX.json that gets stored in the server). Now, your script runs at 45min to obtain the file which is dumped to a directory. In normal cases, you must be able to read the file, and finally, when you exit the script you can store the last read filename and the creation_time taken from the fileX.json into a logger.json.
An example for a logger.json is as follows:
{
"creation_time": "03520201330",
"file_name": "file3.json"
}
Whenever a server fail or any delay occurs, there could be a rewritten of the fileX.json or new fileX's.json would have been created in the directory. In these situations, you would first open the logger.json and obtain both the timestamp and last filename as shown in the example above. By using the last filename, you can compare the old timestamp that is present in logger with the new timestamp in fileX.json. If they match basically there is no change you only read ahead files and rewrite the logger.
If that is not the case you would re-read the last fileX.json again and proceed to read other ahead files.

python: open a json file in a different directory?

Here's my code:
import json
with open("json.items") as json_file:
json_data = json.load(json_file)
It works fine when I move the json file into the same directory. However, I'm trying to get the json file from a different directory. How would I do that? This is what I have tried and its not working:
with open("/lowerfolder/json.items") as json_file:
Any help? Thanks

Depending on your platform, starting a path with / means absolute path from the root
Meaning a relative path should be open("lowerfolder/json.items") without the /

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pymongo - store PDF data as Binary - python

Looks like something was wrong with the PDF file. It worked once I used another PDF file as my test case.

Related

How to write file to memory filepath and read from memory filepath in Python?

json dump() into specific folder

How do you know where a csv file is stored once you write a DataFrame onto disk?

Open and read latest json file one time only

python: open a json file in a different directory?

Categories

Resources