Store large text corpus in Python - python

I am trying to build a large text corpus from the Wikipedia dump.
I represent every article as a document object consisting of:
the original text: a string
the preprocessed text: a list of tuples, where each tuple contains a (stemmed) word and the position of the word in the original text
some additional information like title and author
I am searching for an efficient way to save these objects to disk. The following operations should be possible:
adding a new document
accessing documents via an ID
iterate over all documents
It is not necessary to remove an object once it was added.
I could imagine the following methods:
Serializing each article to a separate file, for example using pickle: the downside here are probably lots of operating system calls
Store all documents to a single xml file or blocks of documents to several xml files: Storing the list that represents the preprocessed document in xml format uses a lot of overhead and I think its quite slow to read a list from xml
Using an existing package for storing the corpus: I found Corpora package, which seems to be very fast and efficient, but it only supports storing strings plus a header including metadata. Simply putting the preprocessed text into the header makes it run incredibly slow.
What would be a good way to do this? Maybe a package for that purpose, which i have not found until now?

Related

How to store data consisting of articles and dates and other attributes for use in python

I am working with large textual data (articles containing words, symbols,escape characters,line breaks etc). Each article also consists of attributes like date , author etc.
It will be used in python for NLP purposes . What is the best way to store this data such that it can be read efficiently from disk.
EDIT :
I have loaded the data as a pandas dataframe in python
Storing as a csv results in corruption due to line breaks(\n) in the text making the data unusable.
storing as JSON is not working due to encoding problems.

Best structure for on-disk retrieval of large data using Python?

I basically have a large (multi-terabyte) dataset of text (it's in JSON but I could change it to dict or dataframe). It has multiple keys, such as "group" and "user".
Right now I'm filtering the data by reading through the entire text for these keys. It would be far more efficient to have a structure where I filter and read only the key.
Doing the above would be trivial if it fit in memory, and I could use standard dict/pandas methods and hash tables. But it doesn't fit in memory.
There must be an off the shelf system for this. Can anyone recommend one?
There are discussions about this, but some of the better ones are old. I'm looking for the simplest off the shelf solution.
I suggest you to split your large file to multiple small files with method readlines(CHUNK) and then you can process it one by one.
I worked with large Json and at beginning, the process was 45sec by file and my program ran while 2 days but when I splintered it, the program finished only for 4h

HDF5 and Python: Automatic export of an object

I'm new to the HDF5 file format and have been experimenting successfully in Python with h5py. Now its time to store real data.
I will need to store a list of objects, where each object can be one of several types and each type will have a number of arrays and strings. The critical part is that the list of objects will be different for each file, and there will be hundreds of files.
Is there a way to automatically export an arbitrary, nested object into (and back from) an HDF5 file? I'm imagining a routine that would automatically span the hierarchy of a nested object and build the same hierarchy in the HDF5 file.
I've read through the H5PY doc and don't see any spanning routines. Furthermore, google and SO searches are (strangely) not showing this capability. Am i missing something or is there another way to look at this.

Implementing python list and Binary search tree

I am working on making a index for a text file. index will be an list of every word and symbols like(~!##$%^&*()_-{}:"<>?/.,';[]1234567890|) and counting the number of times each token occurred in the text file. printing all this in an ascending ASCII value order.
I am going to read a .txt file and split the words and special characters and store it in a list. Can any one throw me idea on how to use binary search in this case.
If your lookup is small (say, up to 1000 records), then you can use a dict; you can either pickle() it, or write it out to a text file. Overhead (for this size) is fairly small anyway.
If your look-up table is bigger, or there are a small number of lookups per run, I would suggest using a key/value database (e.g. dbm).
If it is too complex to use a dbm, use a SQL (e.g. sqlite3, MySQL, Postgres) or NoSQL database. Depending on your application, you can get a huge benefit from the extra features these provide.
In either case, all the hard work is done for you, much better than you can expect to do it yourself. These formats are all standard, so you will get simple-to-use tools to read the data.

build optimal deflate dictionary for common data

I went through SO question in this field, but couldn't find what I was looking for.
I'm sending small binary files (~5MB) over narrow-band network, which should be pretty much similar and I want to compress them using zlib (python).
I would like to build a pre-defined dictionary, but standard common dictionaries are not relevant since it's a non-textual information.
Moreover, finding the common sequences manually is also not an easy job and would work only on this specific type of file.
I'm looking for a test-n-inspect method where I could just compress a file, and see the dictionary used for that output (the compressed data).
Then, by collecting those dictionaries I can run some tests to find the optimal.
Question is (after searching in zlib specification): how can I extract the dictionary from the compressed binary data?
I see that each compressed data starts with binary data then 2 \x00 bytes, then the data.
So I believe it's there, but how can I extract and use it? (or I'm not even close...)
(testing zlib with python 2.7)

Categories

Resources