Django-dbbackup: how to change backup file extension? - python

I am trying to use django-dbbackup module to backup my database. After running dbbackup it saves a db dump with a name something like this:
default-user-2018-03-22-220805.psql
Then I delete 1 last row in one tables of my db. Next I run dbrestore and get the follow:
> Finding latest backup Restoring backup for database 'default' and
> server 'None'
> Restoring: default-user-2018-03-22-220805.psql
> Restore tempfile created: 407.0 KiB
> Are you sure you want to continue?
> [Y/n] y
> Following files were affected
But after that - nothing happens. The deleted row is not restored in my db.
I have read some where (unfortunately already lost that page) that to restore my db the dump file must be in .tar format.
Also I have tried to use that .psql with pgAdmin 4 - to restore db via UI tool. But got an error that input file is not a valid archive.
And last I tried to use that file with Windows cmd running pd_restore and got:
pg_restore: [archiver] input file does not appear to be a valid archive
So the question is: how to use django-dbbackup dbrestore with its
generated file or how to change the extension format of its output
files if it is not possible to restore from .psql files?
P.S. I have also found in the source code a row extension = 'psql', tried to change that, and it worked - I got a .tar file on the output but next dbrestore said that:
Finding latest backup
CommandError: There's no backup file available.

You can change file backup extension with this variable
DBBACKUP_FILENAME_TEMPLATE = '{databasename}-{servername}-{datetime}.sql'
You define DBBACKUP_FILENAME_TEMPLATE variable in settings.py
This setting change extension backup file from dump to sql.
You can change sql extension file with another extension.

Related

How to access DVC-controlled files from Oracle?

I have been storing my large files in CLOBs within Oracle, but I am thinking of storing my large files in a shared drive, then having a column in Oracle contain pointers to the files. This would use DVC.
When I do this,
(a) are the paths in Oracle paths that point to the files in my shared drive, as in, the actual files themselves?
(b) or do the paths in Oracle point somehow to the DVC metafile?
Any insight would help me out!
Thanks :)
Justin
EDIT to provide more clarity:
I checked here (https://dvc.org/doc/api-reference/open), and it helped, but I'm not fully there yet ...
I want to pull a file from a remote dvc repository using python (which I have connected to the Oracle database). So, if we can make that work, I think I will be good. But, I am confused. If I specify 'remote' below, then how do I name the file (e.g., 'activity.log') when the remote files are all encoded?
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
(NOTE: For testing purposes, my "remote" DVC directory is just another folder on my MacBook.)
I feel like I'm missing a key concept about getting remote files ...
I hope that adds more clarity. Any help figuring out remote file access is appreciated! :)
Justin
EDIT to get insights on 'rev' parameter:
Before my question, some background/my setup:
(a) I have a repo on my MacBook called 'basics'.
(b) I copied into 'basics' a directory of 501 files (called 'surface_files') that I subsequently pushed to a remote storage folder called 'gss'. After the push, 'gss' contains 220 hash directories.
The steps I used to get here are as follows:
> cd ~/Desktop/Work/basics
> git init
> dvc init
> dvc add ~/Desktop/Work/basics/surface_files
> git add .gitignore surface_files.dvc
> git commit -m "Add raw data"
> dvc remote add -d remote_storage ~/Desktop/Work/gss
> git commit .dvc/config -m "Configure remote storage"
> dvc push
> rm -rf ./.dvc/cache
> rm -rf ./surface_files
Next, I ran the following Python code to take one of my surface files, named surface_100141.dat, and used dvc.api.get_url() to get the corresponding remote storage file name. I then copied this remote storage file into my desktop under the file's original name, i.e., surface_100141.dat.
The code that does all this is as follows, but FIRST, MY QUESTION --- when I run the code as it is shown below, no problems; but when I uncomment the 'rev=' line, it fails. I am not sure why this is happening. I used git log and cat .git/refs/heads/master to make sure that I was getting the right hash. WHY IS THIS FAILING? That is my question.
(In full disclosure, my git knowledge is not too strong yet. I'm getting there, but it's still a work in progress! :))
import dvc.api
import os.path
from os import path
import shutil
filename = 'surface_100141.dat' # This file name would be stored in my Oracle database
home_dir = os.path.expanduser('~')+'/' # This simply expanding '~' into '/Users/ricej/'
resource_url = dvc.api.get_url(
path=f'surface_files/{filename}', # Works when 'surface_files.dvc' exists, even when 'surface_files' directory and .dvc/cache do not
repo=f'{home_dir}Desktop/Work/basics',
# rev='5c92710e68c045d75865fa24f1b56a0a486a8a45', # Commit hash, found using 'git log' or 'cat .git/refs/heads/master'
remote='remote_storage')
resource_url = home_dir+resource_url
print(f'Remote file: {resource_url}')
new_dir = f'{home_dir}Desktop/' # Will copy fetched file to desktop, for demonstration
new_file = new_dir+filename
print(f'Remote file copy: {new_file}')
if path.exists(new_file):
os.remove(new_file)
dest = shutil.copy(resource_url, new_file) # Check your desktop after this to see remote file copy
I'm not 100% sure that I understand the question (it would be great to expand it a bit on the actual use case you are trying to solve with this database), but I can share a few thoughts.
When we talk about DVC, I think you need to specify a few things to identify the file/directory:
Git commit + path (actual path like data/data/xml). Commit (or to be precise any Git revision) is needed to identify the version of the data file.
Or path in the DVC storage (/mnt/shared/storage/00/198493ef2343ao ...) + actual name of this file. This way you would be saving info that .dvc` files have.
I would say that second way is not recommended since to some extent it's an implementation detail - how does DVC store files internally. The public interface to DVC organized data storage is its repository URL + commit + file name.
Edit (example):
with dvc.api.open(
'activity.log',
repo='location/of/dvc/project',
remote='my-s3-bucket'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
location/of/dvc/project this path must point to an actual Git repo. This repo should have a .dvc or dvc.lock file that has activity.log name in it + its hash in the remote storage:
outs:
- md5: a304afb96060aad90176268345e10355
path: activity.log
By reading this Git repo and analyzing let's say activity.log.dvc DVC will be able to create the right path s3://my-bucket/storage/a3/04afb96060aad90176268345e10355
remote='my-s3-bucket' argument is optional. By default it will use the one that is defined in the repo itself.
Let's take another real example:
with dvc.api.open(
'get-started/data.xml',
repo='https://github.com/iterative/dataset-registry'
) as fd:
for line in fd:
match = re.search(r'user=(\w+)', line)
# ... Process users activity log
In the https://github.com/iterative/dataset-registry you could find the .dvc file that is enough for DVC to create a path to the file by also analyzing its config
https://remote.dvc.org/dataset-registry/a3/04afb96060aad90176268345e10355
you could run wget on this file to download it

Using paramiko, renaming a file and changing directory fails. Why?

I have to process a file on an SFTP server and, when done, move that file to an archive directory using Paramiko. However, if the file already exists in the archive directory, I want to rename the file at the same time. I have the basics for detecting the existing file in archive and adjusting the name. Basically, the final call looks like:
client.rename('/main-path/file.txt', '/main-path/archive/file_1.txt')
or
client.posix_rename('/main-path/file.txt', '/main-path/archive/file_1.txt')
These commands work on SOME servers with no problem. On other servers, I get an "Errno 2" error from paramiko.
Am I going about this wrong? Maybe I need to rename the file, in place, first?
client.rename('/main-path/file.txt', '/main-path/file_1.txt')
and then
client.rename('/main-path/file_1.txt', '/main-path/archive/file_1.txt')
???
Any help would be appreciated.
You can check if the file exist by using OS
import os
list_element= os.listdir("/main-path/archive")
if "file.txt" in list_element:
client.posix_rename('/main-path/file.txt', '/main-path/archive/file_1.txt')
else:
client.posix_rename('/main-path/file.txt', '/main-path/archive/file.txt')

Why Does a Strange File Shows Up in Directory When Using os.walk()?

The project is written in Pycharm on Windows 10.
I wrote a program that grabs .docx files from a directory and searches for information. At the end of the list of file names I get this file: "~$640188.docx"
I get this error when it hits this file:
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file
This error happens when I try to put file '~$640188.docx' into the docx2text method process
text = docx2txt.process(r'C:\path\to\folder\~$640188.docx')
From what I can see, this file does not exist in the directory I'm searching nor anywhere on my computer. The other strange part is that yesterday I wasn't getting this error.
I know there are sometimes "hidden" files in directories and I ran into those before on my mac (specifically '.DS_Store') but this is a .docx file.
I currently have an ugly solution, which says "don't run the code if you run into '~$640188.docx'". My concern is that this will become more of a problem when I dump 11000 files into the directory.
Where does this file come from?
Below is the code for reference
import docx2txt
import os
check_files = []
for dir, subdir, files in os.walk(r'C:\path\to\folder'):
for file in files:
check_files.append(file)
for file in check_files:
print "file: {0}".format(file)
text = docx2txt.process(r'C:\path\to\folder\{0}'.format(file))
Hidden .docx files starting with ~$ are simply temporary files created by Word while a file is actively open and being edited – the first two characters of the respective parent file's name are replaced with the ~$. They are usually deleted once you save and close a document, but sometimes they manage to stick around after you quit anyway. Since they are designed to be temporary compliments to a proper .docx file, they do not necessary have the correct zip package structure at all times.
You will do well to skip those. Checking if the file name starts with '~' should be good enough. Just add the following filtering:
check_files2 = [fl for fl in check_files if fl[0] != '~']
for file in check_files2:

Permission denied when pandas dataframe to tempfile csv

I'm trying to store a pandas dataframe to a tempfile in csv format (in windows), but am being hit by:
[Errno 13] Permission denied: 'C:\Users\Username\AppData\Local\Temp\tmpweymbkye'
import tempfile
import pandas
with tempfile.NamedTemporaryFile() as temp:
df.to_csv(temp.name)
Where df is the dataframe. I've also tried changing the temp directory to one I am sure I have write permissions:
tempfile.tempdir='D:/Username/Temp/'
This gives me the same error message
Edit:
The tempfile appears to be locked for editing as when I change the loop to:
with tempfile.NamedTemporaryFile() as temp:
df.to_csv(temp.name + '.csv')
I can write the file in the temp directory, but then it is not automatically deleted at the end of the loop, as it is no longer a temp file.
However, if I change the code to:
with tempfile.NamedTemporaryFile(suffix='.csv') as temp:
training_data.to_csv(temp.name)
I get the same error message as before. The file is not open anywhere else.
I encountered the same error message and the issue was resolved after adding "/df.csv" to file_path.
df.to_csv('C:/Users/../df.csv', index = False)
Check your permissions and, according to this post, you can run your program as an administrator by right click and run as administrator.
We can use the to_csv command to do export a DataFrame in CSV format. Note that the code below will by default save the data into the current working directory. We can save it to a different folder by adding the foldername and a slash to the file
verticalStack.to_csv('foldername/out.csv').
Check out your working directory to make sure the CSV wrote out properly, and that you can open it! If you want, try to bring it back into python to make sure it imports properly.
newOutput = pd.read_csv('out.csv', keep_default_na=False, na_values=[""])
ref
Unlike TemporaryFile(), the user of mkstemp() is responsible for deleting the temporary file when done with it.
With the use of this function may introduce a security hole in your program. By the time you get around to doing anything with the file name it returns, someone else may have beaten you to the punch. mktemp() usage can be replaced easily with NamedTemporaryFile(), passing it the delete=False paramete.
Read more.
After export to CSV you can close your file with temp.close().
with tempfile.NamedTemporaryFile(delete=False) as temp:
df.to_csv(temp.name + '.csv')
temp.close()
Sometimes,you need check the file path that if you have right permission to read and write file. Especially when you use relative path.
xxx.to_csv('%s/file.csv'%(file_path), index = False)
Sometimes, it gives that error simply because there is another file with the same name and it has no permission to delete the earlier file and replace it with the new file.
So either name the file differently while saving it,
or
If you are working on Jupyter Notebook or a other similar environment, delete the file after executing the cell that reads it into memory. So that when you execute the cell which writes it to the machine, there is no other file that exists with that name.
I encountered the same error. I simply had not yet saved my entire python file. Once I saved my python file in VS code as "insertyourfilenamehere".py to documents(which is in my path), I ran my code again and I was able to save my data frame as a csv file.
As per my knowledge, this error pops up when one attempt to save the file that have been saved already and currently open in the background.
You may try closing those files first and then rerun the code.
Just give a valid path and a file name
e.g:
final_df.to_csv('D:\Study\Data Science\data sets\MNIST\sample.csv')

How do I delete temp .bed file created by shellfish.py

I am running shellfish.py to perform principal component analsysis. However, I am getting this error below. How do I fix this?
17:18:39 Found .bed format data data_WTCCC_f_650.bed
17:18:39 Found binary mapfile data_WTCCC_f_650.bim
17:18:39 shellfish error: Trying to create link to original data file data_WTCCC_f_650.bed, but link file shellfish-temp-16297/848716990677.bed already exists, presumably from a previous shellfish run. Delete any such files before running again.
You can remove this file from Python with:
import os
os.remove('path/to/bed/file')
Of course you can always delete this file by hand. Just find it in the file explorer (or the tool you use) and delete it.

Categories

Resources