Python 3 shelve hiding data? - python

I scraped a large amount of data from a database and saved it as "first_database.db" using Python's shelve module (I'm using Python 3.4). I've had problems with shelve before (see my old issues), which IIRC were probably due to something relating to my ancient OS (OSX 10.9.4) and gdbm/dbm.gnu.
Now I have now a more intractable problem: I made a new file that's ~170 MB, and now I can only access a single key/value, no matter what.
I know the superset of possible keys, and trying to access any of them gives me a KeyError (except for one). When I save the value of the single key that doesn't return a KeyError as a new shelve database, its size is only 16 KB, so I know the data is in the 170 MB file, but I can't access it.
Am I just screwed?
Furthermore, I have made a copy of the database and tried to add more keys to it (~95). That database will say that it has three keys, but when I try to access the value of the third one, I get the following error:
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/shelve.py", line 114, in __getitem__
value = Unpickler(f).load()
_pickle.UnpicklingError: invalid load key, ''.

I don't know the issue, but maybe this alternative might help you:
https://github.com/dagnelies/pysos
It's like shelve but does not rely on an underlying implementation and saves its data in plain text. That way, you could even open the DB file to inspect its content if something unexpected occurs.
Note also that shelve relies on an underlying dbm implementation. That means that if you saved your shelve on a Linux, you might not be able to read it on Mac for instance, if its dbm implementation differs (there are several of).

Related

Removing JSON objects that aren't correctly formatted Python

I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?
There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.
If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.

Overwrite a python file while using it?

I have first file (data.py):
database = {
'school': 2,
'class': 3
}
my second python file (app.py)
import data
del data.database['school']
print(data.database)
>>>{'class': 3}
But in data.py didn't change anything? Why?
And how can I change it from my app.py?
del data.database['school'] modifies the data in memory, but does not modify the source code.
Modifying a source code to manage the persistence of your data is not a good practice IMHO.
You could use a database, a csv file, a json file ...
To elaborate on Gelineau's answer: at runtime, your source code is turned into a machine-usable representation (known as "bytecode") which is loaded into the process memory, then executed. When the del data.database['school'] statement (in it's bytecode form) is executed, it only modifies the in-memory data.database object, not (hopefully!) the source code itself. Actually, your source code is not "the program", it's a blueprint for the runtime process.
What you're looking for is known as data persistance (data that "remembers" it's last known state between executions of the program). There are many solutions to this problem, ranging from the simple "write it to a text or binary file somewhere and re-read it at startup" to full-blown multi-servers database systems. Which solution is appropriate for you depends on your program's needs and constraints, whether you need to handle concurrent access (multiple users / processes editing the data at the same time), etc etc so there's really no one-size-fits-all answers. For the simplest use cases (single user, small datasets etc), json or csv files written to disk or a simple binary key:value file format like anydbm or shelve (both in Python's stdlib) can be enough. As soon as things gets a bit more complex, SQL databases are most often your best bet (no wonder why they are still the industry standard and will remain so for long years).
In all cases, data persistance is not "automagic", you will have to write quite some code to make sure your changes are saved in timely manner.
As what you are trying to achieve is basically related to file operation.
So when you are importing data , it just loads instanc of your file in memory and create a reference from your new file, ie. app.py. So, if you modify it in app.py its just modifiying the instance which is in RAM not in harddrive where your actual file is stored in harddrive.
If you want to change source code of another file "As its not good practice" then you can use file operations.

Attribute system similar to HTTP Headers for local files

I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)
Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.
If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually

How to create a corrupt pkl file python

I'm creating a library that caches function return values to pkl files. However, sometimes when I terminate the program while writing to the pkl files, I wind up with corrupt pkl files (not always). I'm setting up the library to deal with these corrupt files (that lead mostly to an EOFError, but may also lead to an IOError). However, I need to create files that I know are corrupt to test this, and the method of terminating the program is not consistent. Is there some other way to write to a pkl file and be guaranteed an EOFError or IOError when I subsequently read from it?
Short answer: You don't need them.
Long answer: There's a better way to handle this, take a look below.
Ok, let's start by understanding each of these exception separately:
The EOFError happens whenever the parser reaches the end of file
without a complete representation of an object and, therefore, is unable to rebuild
the object.
An IOError represents a reading error, the file could be deleted or have it's permissions revoked during the process.
Now, let's develop a strategy for testing it.
One common idiom is to encapsulate the offending method, pickle.Pickler for example, with a method that may randomly throw these exceptions. Here is an example:
import pickle
from random import random
def chaos_pickle(obj, file, io_error_chance=0, eof_error_chance=0):
if random < io_error_chance:
raise IOError("Chaotic IOError")
if random < eof_error_chance:
raise EOFError("Chaotic EOFError")
return pickle.Pickler(obj, file)
Using this instead of the traditional pickle.Pickler ensures that your code randomly throws both of the exceptions (notice that there's a caveat, though, if you set io_error_chance to 1, it will never raise a EOFError.
This trick is quite useful when used along the mock library (unittest.mock) to create faulty objects for testing purposes.
Enjoy!
Take a bunch of your old, corrupted pickles and use those. If you don't have any, take a bunch of working pickles, truncate them quasi-randomly, and see which ones give errors when you try to load them. Alternatively, if the "corrupt" files don't need to even resemble valid pickles, you could just unpickle random crap you wouldn't expect to work. For example, mash the keyboard and try to unpickle the result.
Note that the docs say
The pickle module is not intended to be secure against erroneous or
maliciously constructed data. Never unpickle data received from an
untrusted or unauthenticated source.

*large* python dictionary with persistence storage for quick look-ups

I have a 400 million lines of unique key-value info that I would like to be available for quick look ups in a script. I am wondering what would be a slick way of doing this. I did consider the following but not sure if there is a way to disk map the dictionary and without using a lot of memory except during dictionary creation.
pickled dictionary object : not sure if this is an optimum solution for my problem
NoSQL type dbases : ideally want something which has minimum dependency on third party stuff plus the key-value are simply numbers. If you feel this is still the best option, I would like to hear that too. May be it will convince me.
Please let me know if anything is not clear.
Thanks!
-Abhi
If you want to persist a large dictionary, you are basically looking at a database.
Python comes with built in support for sqlite3, which gives you an easy database solution backed by a file on disk.
No one has mentioned dbm. It is opened like a file, behaves like a dictionary and is in the standard distribution.
From the docs https://docs.python.org/3/library/dbm.html
import dbm
# Open database, creating it if necessary.
with dbm.open('cache', 'c') as db:
# Record some values
db[b'hello'] = b'there'
db['www.python.org'] = 'Python Website'
db['www.cnn.com'] = 'Cable News Network'
# Note that the keys are considered bytes now.
assert db[b'www.python.org'] == b'Python Website'
# Notice how the value is now in bytes.
assert db['www.cnn.com'] == b'Cable News Network'
# Often-used methods of the dict interface work too.
print(db.get('python.org', b'not present'))
# Storing a non-string key or value will raise an exception (most
# likely a TypeError).
db['www.yahoo.com'] = 4
# db is automatically closed when leaving the with statement.
I would try this before any of the more exotic forms, and using shelve/pickle will pull everything into memory on loading.
Cheers
Tim
In principle the shelve module does exactly what you want. It provides a persistent dictionary backed by a database file. Keys must be strings, but shelve will take care of pickling/unpickling values. The type of db file can vary, but it can be a Berkeley DB hash, which is an excellent light weight key-value database.
Your data size sounds huge so you must do some testing, but shelve/BDB is probably up to it.
Note: The bsddb module has been deprecated. Possibly shelve will not support BDB hashes in future.
Without a doubt (in my opinion), if you want this to persist, then Redis is a great option.
Install redis-server
Start redis server
Install redis python pacakge (pip install redis)
Profit.
import redis
ds = redis.Redis(host="localhost", port=6379)
with open("your_text_file.txt") as fh:
for line in fh:
line = line.strip()
k, _, v = line.partition("=")
ds.set(k, v)
Above assumes a files of values like:
key1=value1
key2=value2
etc=etc
Modify insertion script to your needs.
import redis
ds = redis.Redis(host="localhost", port=6379)
# Do your code that needs to do look ups of keys:
for mykey in special_key_list:
val = ds.get(mykey)
Why I like Redis.
Configurable persistance options
Blazingly fast
Offers more than just key / value pairs (other data types)
#antrirez
I don't think you should try the pickled dict. I'm pretty sure that Python will slurp the whole thing in every time, which means your program will wait for I/O longer than perhaps necessary.
This is the sort of problem for which databases were invented. You are thinking "NoSQL" but an SQL database would work also. You should be able to use SQLite for this; I've never made an SQLite database that large, but according to this discussion of SQLite limits, 400 million entries should be okay.
What are the performance characteristics of sqlite with very large database files?
I personally use LMDB and its python binding for a few million records DB.
It is extremely fast even for a database larger than the RAM.
It's embedded in the process so no server is needed.
Dependency are managed using pip.
The only downside is you have to specify the maximum size of the DB. LMDB is going to mmap a file of this size. If too small, inserting new data will raise a error. To large, you create sparse file.

Categories

Resources