Alternative to a very large dictionary (~40 million keys)

Alternative to a very large dictionary (~40 million keys) - python

I have a rather large dictionary with about 40 million keys which I naively stored just by writing {key: value, key: value, ...} into a text file. I didn't consider the fact that I could never realistically access this data because python has an aversion to loading and evaluating a 1.44GB text file as a dictionary.
I know I could use something like shelve to be able to access the data without reading all of it at once, but I'm not sure how I would even convert this text file to a shelve file without regenerating all the data (which I would prefer not to do). Are there any better alternatives for storing, accessing, and potentially later changing this much data? If not, how should I go about converting this monstrosity over to a format usable by shelve?
If it matters, the dictionary is of the form {(int, int, int int): [[int, int], Bool]}

Redis is a in-memory key-value store that can be used for this kind of problems.
There are several Python clients.
hmset operation allows you to insert multiple key-values.

https://github.com/dagnelies/pysos
https://github.com/dagnelies/pysos
It works like a normal python dict, but has the advantage that it's much more efficient than shelve on windows and is also cross-platform, unlike shelve where the data storage differs based on the OS.
To install:
pip install pysos
Usage:
import pysos
db = pysos.Dict('somefile')
db['hello'] = 'persistence!'
Just to give a ballpark figure, here is a mini benchmark (on my windows laptop):
import pysos
t = time.time()
import time
N = 100 * 1000
db = pysos.Dict("test.db")
for i in range(N):
db["key_" + str(i)] = {"some": "object_" + str(i)}
db.close()
print('PYSOS time:', time.time() - t)
# => PYSOS time: 3.424309253692627
The resulting file was about 3.5 Mb big.
So, in your case, if a million key/value pairs take roughly 1 minute to insert ...it would take you almost an hour to insert it all. Of course, the machine's specs can influence that a lot. It's just a very rough estimate.

Related

Exploding memory usage while building large dict of lists

I'm generating lots of nested list from a text tokenizing process and would like to store them in a dict with keys that relate to the key of the string in the source dict.
Example of source dict:
{
'1' : 'the horse',
'2' : 'grass is green',
...
}
Example of desired output, where the integers are outputs of tokenizing and hashing process:
{
'1' : [[1342, 24352, 524354, 356345],
[35663, 53635, 25245, 457577]],
'2' : [[43412, 324423, 66546, 86887],
[398908, 46523, 432432, 9854],
[87667, 34423, 132132, 35454]],
...
}
As I iterate through my source dict, feed my tokenizing function the values and assign the key, tokenized value pair to a new dict, my new dict is using a massive amount of memory that's way larger than the actual space it should take up.
Here's some simulation code that illustrates my issue:
import gc
import numpy as np
import matplotlib.pyplot as plt
import sys
import json
import os
import psutil
pid = os.getpid()
py = psutil.Process(pid)
def memory_use():
memoryUse = py.memory_info()[0]/2.**30 # memory use in GB
return memoryUse
def tokenize_sim():
# returns a list of 30 lists of 4 random ints
# (simulates 30 tokenized words)
return [[int(n) for n in np.random.randint(low=0, high=1e6, size=4)] for i in range(31)]
memory = []
tokens = dict()
for i in range(800001):
tokens[i] = tokenize_sim()
if i % 50000 == 0:
memoryUse = memory_use()
print(i, '- memory use:', memoryUse)
memory.append(memoryUse)
plt.figure()
plt.plot(np.arange(17)*50, memory)
plt.grid()
plt.xlabel('Thousands of iterations')
plt.ylabel('Memory used in GB')
plt.show()
print('System size in bytes:', sys.getsizeof(tokens))
Here's the plot of memory usage:
sys.getsizeof(tokens) returns 41943144 bytes. I tried writing this dict to a json file and that used 821 MB. None of these are even close to the 6 GB of memory this is gobbling up.
What am I missing here? I'm guessing it's some memory allocation issue, but I haven't managed to find any solution. I need to process a source dictionary of about 12 million entries and my 64 GB of memory just don't seem to be enough to build a dict of lists. Any help would be much appreciated.

I can't really grasp what you're doing aside from plotting your memory usage (might want to scrap the code that isn't really related to the problem), but I can give you some general pointers how to handle lots of data and why it eats up all your memory.
The reasons. Python isn't really a high-performance language in terms of efficiency and speed. In Python, everything is an object and every object has their own properties. When you create a nested list of properties, then the parent list, all the nested lists and every little integer has their own stuff attached to them - their own metadata and description. You can visualize that when you call dir() on different objects - for example an integer in Python has 69 different methods and attributes attached to it. And all of those have to fit in the memory, this is why your memory is gobbled up much faster than the actual size of the JSON format which doesn't hold any metadata about the data.
How to combat that? There are languages out there that are much better with handling big amounts of data simply because they are not so developer friendly and look out for you every step of the way. You could switch to C and accomplish the task with 8 GB of RAM.
But I'm not recommending to switch languages, just use some better practices. Right now you're holding all your data in lists and dicts (?) which really isn't efficient. Look up what numpy and pandas can do for you - they are exactly meant for such use cases. They are implemented in C which provides much better performance while having a Python API that makes usage rather convenient.
If that isn't enough, just do what you're doing in chunks. If all you want is a huge list of lists with some integers in it, you can do it in iterations while saving all of it in intervals and you don't need it all in the memory at the same time. If Python native garbage collection isn't enough, you can try to force Python to release memory by deling the huge variables and calling garbage collection.
import gc
del big_bad_variable
gc.collect()

Reading .dat without delimiters into array in python

I have a .dat file with no delimiters that I am trying to read into an array. Say each new line represents one person, and variables in each line are defined in terms of a fixed number of characters, e.g the first variable "year" is the first four characters, the second variable "age" is the next 2 characters (no delimiters within the line) e.g.:
201219\n
201220\n
201256\n
Here is what I am doing right now:
data_file = 'filename.dat'
file = open(data_file, 'r')
year = []
age = []
for line in file:
year.append(line[0:4])
age.append(line[4:])
This works fine for a small number of lines and variables, but when I try loading the full data file (500Mb with 10 million lines and 20 variables) I get a MemoryError. Is there a more efficient way to load this type of data into arrays?

First off, you're probably better off with a list of class instances than a bunch of parallel lists, from a software engineering standpoint. If you try this, you probably should look into __slots__ to decrease the memory overhead.
You could also try pypy - it has some memory optimizations for homogeneous lists.
I'd probably use gdbm or bsddb rather than sqlite, if you want an on-disk solution. gdbm and bsddb look like dict's, except you index them (the keys) by a string and the values are strings too. So your class (the one I mentioned above) would have a __str__ and/or __repr__ method(s) that would convert to a string (could use pickle) for storage in the table. Then your constructor would be made to deal with reversing the process somehow.
If you ever get to such large data that a gdbm or bsddb is too slow, you could try just writing to a flat file - that'll not be as nice for jumping around obviously, but it eliminates a lot of seek()'ing which can be very advantageous sometimes.
HTH

The problem here doesn't appear to be that you're having problems reading it as fitting it into memory. When you're talking about 200 million anything in memory you're going to have some issues.
Try storing it as a list of strings (i.e. trade memory for CPU), or if you can just don't store it at all.
Another option to try is dumping it into a sqlite database. If you use an in-memory db you might end out with the same issue, but maybe not.
If you go for the string style, do something like this:
def get_age(person):
return int(person[4:])
people = file.readlines() # Wait a while....
for person in people:
print(get_age(person)*2) # Or something else
Here's an example of getting mean income for a particular age in a particular year:
def get_mean_income_by_age_and_year(people, target_age, target_year):
count = 0
total = 0.0
for person in people:
income, age, year = get_income(person), get_age(person), get_year(person)
if age == target_age and year == target_year:
total += income
count += 1
if count:
return total/count
else:
return 0.0
Really, though, this basically does what storing it in a sqlite database would do for you. If there are only a couple of very specific things you want to do, then going this way is probably reasonable. But it sounds like there are probably several things you want to be doing with this info - if so a sqlite database is probably what you want.

A more efficient data structure for lots of uniform numeric data is the array. Depending on how much memory you have, using an array may work.
import array
year = array.array('i') # int
age = array.array('i') # int
income = array.array('f') # float
with open('data.txt', 'r') as f:
for line in f:
year.append(int(line[0:4]))
age.append(int(line[4:6]))
income.append(float(line[6:12]))

Managing dictionary memory size in python

I have a program which imports a text file through standard input and aggregates the lines into a dictionary. However the input file is very large (1Tb order) and I wont have enough space to store the whole dictionary in memory (running on 64Gb ram machine). Currently Iv got a very simple clause which outputs the dictionary once it has reached a certain length (in this case 100) and clears the memory. The output can then be aggregated at later point.
So i want to: output the dictionary once memory is full. what is the best way of managing this? Is there a function which gives me the current memory usage? Is this costly to keep on checking? Am I using the right tactic?
import sys
X_dic = dict()
# Used to print the dictionary in required format
def print_dic(dic):
for key, value in dic.iteritems():
print "{0}\t{1}".format(key, value)
for line in sys.stdin:
value, key = line.strip().split(",")
if (not key in X_dic):
X_dic[key] = []
X_dic[key].append(value)
# Limit size of dic.
if( len(X_dic) == 100):
print_dic(X_dic) # Print and clear dictionary
X_dic = dict()
# Now output
print_dic(X_dic)

The module resource provides some information on how much resources (memory, etc.) you are using. See here for a nice little usage.
On a Linux system (I don't know where you are) you can watch the contents of the file /proc/meminfo. As part of the proc file system it is updated automatically.
But I object to the whole strategy of monitoring the memory and using it up as much as possible, actually. I'd rather propose to dump the dictionary regularly (after 1M entries have been added or such). It probably will speed up your program to keep the dict smaller than possible; also it presumably will have advantages for later processing if all dumps are of similar size. If you dump a huge dict which fit into your whole memory when nothing else was using memory, then you later will have trouble re-reading that dict if something else is currently using some of your memory. So then you would have to create a situation in which nothing else is using memory (e. g. reboot or similar). Not very convenient.

Storing a list of 1 million key value pairs in python

I need to store a list of 1 million key-value pairs in python. The key would be a string/integer while the value would be a list of float values. For example:
{"key":36520193,"value":[[36520193,16.946938],[26384600,14.44005],[27261307,12.467529],[16456022,11.316026],[26045102,8.891106],[148432817,8.043456],[36670593,7.111857],[43959215,7.0957513],[50403486,6.95],[18248919,6.8106747],[27563337,6.629243],[18913178,6.573106],[42229958,5.3193846],[17075840,5.266625],[17466726,5.2223654],[47792759,4.9141016],[83647115,4.6122775],[56806472,4.568034],[16752451,4.39949],[69586805,4.3642135],[23207742,3.9822476],[33517555,3.95],[30016733,3.8994896],[38392637,3.8642135],[16165792,3.6820507],[14895431,3.5713203],[48865906,3.45],[20878230,3.45],[17651847,3.3642135],[24484188,3.1820507],[74869104,3.1820507],[15176334,3.1571069],[50255841,3.1571069],[103712319,3.1571069],[20706319,2.9571068],[33542647,2.95],[17636133,2.95],[66690914,2.95],[19812372,2.95],[21178962,2.95],[37705610,2.8642135],[20812260,2.8642135],[25887809,2.8642135],[18815472,2.8642135],[17405810,2.8642135],[46598192,2.8642135],[20592734,2.6642137],[44971871,2.5],[27610701,2.45],[92788698,2.45],[52164826,2.45],[17425930,2.2],[60194002,2.1642137],[122136476,2.0660255],[205325522,2.0],[117521212,1.9820508],[33953887,1.9820508],[22704346,1.9571068],[26176058,1.9071068],[39512661,1.9071068],[43141485,1.8660254],[16401281,1.7],[31495921,1.7],[14599628,1.7],[74596964,1.5],[55821372,1.5],[109073560,1.4142135],[91897348,1.4142135],[25756071,1.25],[25683960,1.25],[17303288,1.25],[42065448,1.25],[72148532,1.2],[19192100,1.2],[85941613,1.2],[77325396,1.2],[18266218,1.2],[114005403,1.2],[16346823,1.2],[43441850,1.2],[60660643,1.2],[41463847,1.2],[33804454,1.2],[20757729,1.2],[18271440,1.2],[51507708,1.2],[104856807,1.2],[24485743,1.2],[16075381,1.2],[68991517,1.2],[96193545,1.2],[63675003,1.2],[70735999,1.2],[25708416,1.2],[80593161,1.2],[42982108,1.2],[120368215,1.2],[24379982,1.2],[14235673,1.2],[20172395,1.2],[161441314,1.2],[37996201,1.2],[35638883,1.2],[46164502,1.2],[74047763,1.2],[19681494,1.2],[95938476,1.2],[20443787,1.2],[87258609,1.2],[34784832,1.2],[30346151,1.2],[40885516,1.2],[197129344,1.2],[14266331,1.2],[15112466,1.2],[26867986,1.2],[82726479,1.2],[23825810,1.2],[14662121,1.2],[32707312,1.2],[17477917,1.2],[123462351,1.2],[5745462,1.2],[16544178,1.2],[23284384,1.2],[45526985,1.2],[23109303,1.2],[26046257,1.2],[53654203,1.2],[133026438,1.2],[25139051,1.2],[65077694,1.2],[17469289,1.2],[15130494,1.2],[148525895,1.2],[15176360,1.2],[44853617,1.2],[9115332,1.2],[16878570,1.2],[132421452,1.2],[6273762,1.2],[124360757,1.2],[21643452,1.2],[9890492,1.2],[16305494,1.2],[18484474,1.2],[22643607,1.2],[60753586,1.2],[9200012,1.2],[30042254,1.2],[8374622,1.2],[15894834,1.2],[18438022,1.2],[78038442,1.2],[22097386,1.2],[21018755,1.2],[20845703,1.2],[164462136,1.2],[19649167,1.2],[24746288,1.2],[27690898,1.2],[42822760,1.2],[160935289,1.2],[178814456,1.2],[53574205,1.2],[41473578,1.2],[82176632,1.2],[82918057,1.2],[102257360,1.2],[17504315,1.2],[18363508,1.2],[50735431,1.2],[80647070,1.2],[40879040,1.2],[17790497,1.2],[191364080,1.2],[14429823,1.2],[22078893,1.2],[121338184,1.2],[113341318,1.2],[48900101,1.2],[38547066,1.2],[20484157,1.2],[16228699,1.2],[21179292,1.2],[15317594,1.2],[55777010,1.2],[15318882,1.2],[182109160,1.2],[45238537,1.2],[19701986,1.2],[32484918,1.2],[18244358,1.2],[18479513,1.2],[19081775,1.2],[21117305,1.2],[19325724,1.2],[136844568,1.2],[32398651,1.2],[20482993,1.2],[14063937,1.2],[91324381,1.2],[20528275,1.2],[14803917,1.2],[16208245,1.2],[17419051,1.2],[31187903,1.2],[54043787,1.2],[167737676,1.2],[24431712,1.2],[24707301,1.2],[24420092,1.2],[15469536,1.2],[26322385,1.2],[77330594,1.2],[82925252,1.2],[28185335,1.0],[24510384,1.0],[24407244,1.0],[41229669,1.0],[16305330,1.0],[26246555,1.0],[28183026,1.0],[49880016,1.0],[104621640,1.0],[36880083,1.0],[19705747,1.0],[22830942,1.0],[21440766,1.0],[54639609,1.0],[49077908,1.0],[29588859,1.0],[23523447,1.0],[20803216,1.0],[20221159,1.0],[1416611,1.0],[3744541,1.0],[21271656,1.0],[68956490,1.0],[96851347,1.0],[39479083,1.0],[27778893,1.0],[18785448,1.0],[39010580,1.0],[65796371,1.0],[124631720,1.0],[27039286,1.0],[18208354,1.0],[51080209,1.0],[37388787,1.0],[18462037,1.0],[31335156,1.0],[21346320,1.0],[23911410,1.0],[73134924,1.0],[807095,1.0],[44465330,1.0],[16732482,1.0],[37344334,1.0],[734753,1.0],[23006794,1.0],[33549858,1.0],[102693093,1.0],[51219631,1.0],[20695699,1.0],[4081171,1.0],[27268078,1.0],[80116664,1.0],[32959253,1.0],[85772748,1.0],[27109019,1.0],[28706024,1.0],[59701568,1.0],[23559586,1.0],[15693493,1.0],[56908710,1.0],[6541402,1.0],[15855538,1.0],[126169000,1.0],[24044209,1.0],[80700514,1.0],[21500333,1.0],[18431316,1.0],[44496963,1.0],[68475722,1.0],[15202472,1.0],[19329393,1.0],[39706174,1.0],[22464533,1.0],[81945172,1.0],[22101236,1.0],[19140282,1.0],[31206614,1.0],[15429857,1.0],[27711339,1.0],[14939981,1.0],[62591681,1.0],[52551600,1.0],[40359919,1.0],[27828234,1.0],[21414413,1.0],[156132825,1.0],[21586867,1.0],[23456995,1.0],[25434201,1.0],[30107143,1.0],[34441838,1.0],[37908934,1.0],[47010618,1.0],[139903189,1.0],[17833574,1.0],[758608,1.0],[15823236,1.0],[37006875,1.0],[10302152,1.0],[40416155,1.0],[21813730,1.0],[18785600,1.0],[30715906,1.0],[428333,1.0],[22059385,1.0],[15155074,1.0],[11061902,1.0],[1177521,1.0],[20449160,1.0],[197117628,1.0],[42423692,1.0],[24963961,1.0],[19637934,1.0],[35960001,1.0],[43269420,1.0],[43283406,1.0],[20269113,1.0],[59409413,1.0],[25548759,1.0],[23779324,1.0],[21449197,1.0],[14327149,1.0],[15429316,1.0],[16159485,1.0],[18785846,1.0],[67651295,1.0],[28389815,1.0],[19780922,1.0],[23841181,1.0],[78391198,1.0],[60765383,1.0],[37689397,1.0],[6447142,1.0],[31332871,1.0],[30364057,1.0],[14120151,1.0],[16303064,1.0],[23023236,1.0],[103610974,1.0],[108382988,1.0],[19791811,1.0],[17121755,1.0],[46346811,1.0],[45618045,1.0],[25587721,1.0],[25362775,1.0],[20710218,1.0],[20223138,1.0],[21035409,1.0],[101894425,1.0],[38314814,1.0],[24582667,1.0],[21181713,1.0],[15901190,1.0],[18197299,1.0],[38802447,1.0],[19668592,1.0],[14515734,1.0],[16870853,1.0],[16488614,1.0],[95955871,1.0],[14780915,1.0],[21188490,1.0],[24243022,1.0],[27150723,1.0],[29425265,1.0],[36370563,1.0],[36528126,1.0],[43789332,1.0],[82773533,1.0],[19726043,1.0],[20888549,1.0],[30271564,1.0],[14874125,1.0],[121436823,1.0],[56405314,1.0],[46954727,1.0],[25675498,1.0],[12803352,1.0],[23888081,1.0],[18498684,1.0],[38536306,1.0],[22851295,1.0],[20140595,1.0],[22311506,1.0],[31121729,1.0],[53717630,1.0],[100101137,1.0],[24753205,1.0],[24523660,1.0],[19544133,1.0],[20823773,1.0],[22677790,1.0],[15227791,1.0],[57525419,1.0],[28562317,1.0],[9629222,1.0],[24047612,1.0],[30508215,1.0],[59084417,1.0],[71088774,1.0],[142157505,1.0],[15284851,1.0],[17164788,1.0],[17885166,1.0],[18420140,1.0],[19695929,1.0],[20572844,1.0],[23479429,1.0],[26642006,1.0],[43469093,1.0],[50835878,1.0],[172049453,1.0],[20604508,1.0],[21681591,1.0],[20052907,1.0],[21271938,1.0],[17842661,1.0],[6365162,1.0],[18130749,1.0],[19249062,1.0],[24193336,1.0],[25913173,1.0],[28647246,1.0],[26072121,1.0],[14522546,1.0],[16409683,1.0],[18785475,1.0],[28969818,1.0],[52757166,1.0],[7120172,1.0],[112237392,1.0],[116779546,1.0],[57107167,1.0],[26347170,1.0],[26565946,1.0],[44409004,1.0],[21105244,1.0],[14230524,1.0],[44711134,1.0],[101753075,1.0],[783214,1.0],[22885110,1.0],[39367703,1.0],[23042739,1.0],[682903,1.0],[38082423,1.0],[16194263,1.0],[2425151,1.0],[52544275,1.0],[21380763,1.0],[18948541,1.0],[34954261,1.0],[34848331,1.0],[29245563,1.0],[19499974,1.0],[16089776,1.0],[77040291,1.0],[18197476,1.0],[1704551,1.0],[15002838,1.0],[17428652,1.0],[20702626,1.0],[29049111,1.0],[34004383,1.0],[34900333,1.0],[48156959,1.0],[50906836,1.0],[15742480,1.0],[41073372,1.0],[37338814,1.0],[1344951,1.0],[8320242,1.0],[14719153,1.0],[20822636,1.0],[168841922,1.0],[19877186,1.0],[14681605,1.0],[15033883,1.0],[23121582,1.0],[23670204,1.0],[41466869,1.0],[18753325,1.0],[21358050,1.0],[78132538,1.0],[132386271,1.0],[86194654,1.0],[17225211,1.0],[107179714,1.0],[18785430,1.0],[19408059,1.0],[19671129,1.0],[24347716,1.0],[24444592,1.0],[25873045,1.0],[7871252,1.0],[14138300,1.0],[16873300,1.0],[14546496,1.0],[165964253,1.0],[15529287,1.0],[95956928,1.0],[19404587,1.0],[21506437,1.0],[22832029,1.0],[19542638,1.0],[30827536,1.0],[5748622,1.0],[22757990,1.0],[41259253,1.0],[23738945,1.0],[19030602,1.0],[21410102,1.0],[28206360,1.0],[136411179,1.0],[17499805,1.0],[26107245,1.0],[127311408,1.0],[77023233,1.0],[20448733,1.0],[20683840,1.0],[22482597,1.0],[15485441,1.0],[28220280,1.0],[55351351,1.0],[70942325,1.0],[9763482,1.0],[15732001,1.0],[27750488,1.0],[18286352,1.0],[122216533,1.0],[19562228,1.0],[5380672,1.0],[22293700,1.0],[59974874,1.0],[44455025,1.0],[90420314,1.0],[22657153,1.0],[16660662,1.0],[14583400,1.0],[16689545,1.0],[94242867,1.0],[44527648,1.0],[40366319,1.0],[33616007,1.0],[23438958,1.0],[15317676,1.0],[14075928,1.0],[1978331,1.0],[33347901,1.0],[16570090,1.0],[32347966,1.0],[26671992,1.0],[101907019,1.0],[24986014,1.0],[23235056,1.0],[40001164,1.0],[21891032,1.0],[18139329,1.0],[9648652,1.0],[16105942,1.0],[3004231,1.0],[20762929,1.0],[28061932,1.0],[39513172,1.0],[15012305,1.0],[18349404,1.0],[22196210,1.0],[110509537,1.0],[20318494,1.0],[21816984,1.0],[22456686,1.0],[62290422,1.0],[93472506,0.8660254],[52305889,0.70710677],[67337055,0.70710677],[122768292,0.5],[35060854,0.5],[43289205,0.5],[87271142,0.5],[28096898,0.5],[79297090,0.5],[24016107,0.5],[48736472,0.5],[109982897,0.5],[98367357,0.5],[21816847,0.5],[73129588,0.5],[23807734,0.5],[76724998,0.5],[63153228,0.5],[21628966,0.5],[14465428,0.5],[42609851,0.5],[30213342,0.5],[17021966,0.5],[96616361,0.5],[97546740,0.5],[67613930,0.5],[21234391,0.5],[87245558,0.5],[36841912,0.5]]}
I would be performing lookups on this data structure. What would be the most appropriate data structure to achieve my purpose? I have heard recommendations about Redis. Would it be worth looking into it rather than the traditional python data structure? If not, please suggest other mechanisms.
Edit
The 'value' field is a list of lists. Most cases, the list may be upto 1000 lists consisting of a size-2 list.

Redis would be appropriate if...
You want to share the queue between multiple processes or instances of your app.
You want the data to be persistent, so if your app goes down it can pick up where it left off.
You want a super fast, easy solution.
Memory usage is a concern.
I'm not sure on the last one, but I'm guessing using dict or some other collection type in Python is likely to have a higher memory footprint than storing all your key/values in a single Redis hash.
update
I tested the memory usage by storing the example array 1 million times both in memory and in redis. Storing all the values in a Redis hash requires serializing the array. I chose json serialization, but this could have easily been a more efficient binary format, which redis supports.
1 million copies of the array provided in a Ruby Hash (should be comparable to Python's dict) indexed using an integer key similar to the one shown. Memory usage increased by ~350mb (similar to the python results by #gnibbler).
1 million copies of the array, serialized to a JSON string in a redis hash indexed using the same numbers. Memory usage increased by ~250mb.
Both were very fast, with the Redis being slightly faster when I measured 10,000 random lookups vs random lookups against the native collection. I know it's not Python, but this should be at least illustrative.
Also, to answer the OPs other concern, Redis has no trouble handing very large string values. It's max string size is currently 512mb

Really shouldn't be a problem
>>> d=dict((str(n), range(20)) for n in range(1000000))
took ~350MB to create. Your keys/values may be much larger of course

I looked at storage in NumPy and also in redis.
First, NumPy:
>>> import numpy as NP
>>> K = NP.random.randint(1000, 9999, 1e6)
>>> V = 5 * NP.random.rand(2e6).reshape(-1, 2)
>>> kv = K.nbytes + V.nbytes
>>> '{:15,d}'.format(kv)
>>> ' 2,400,000' # 2.4 MB
Now redis:
I represented the values as strings, which should be very efficient in redis.
>>> from redis import Redis # using the python client for redis
>>> # w/ a server already running:
>>> r0 = Redis(db=0)
>>> for i in range(K.shape[0]) :
v = ' '.join(NP.array(V[i], dtype=str).tolist())
r0.set(K[i], v)
>>> # save db to disk asynchronously, then shut down the server
>>> r0.shutdown()
The redis database (.rdb file) is 2.9 MB
Of course, this is not an "apples-to-apples" comparison because i chose what i believed to be the most natural model to represent the OP's data in each library--i.e., redis (strings) than for NumPy (2-element NumPy array).

Searching for a string in a large text file - profiling various methods in python

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...
I have a 600 MB file with 6 million lines of strings (Category paths from DMOZ project).
The entry on each line is unique.
I want to load the file once & keep searching for matches in the data
The three methods that I tried below list the time taken to load the file, search time for a negative match & memory usage in the task manager
1) set :
(i) data = set(f.read().splitlines())
(ii) result = search_str in data
Load time ~ 10s, Search time ~ 0.0s, Memory usage ~ 1.2GB
2) list :
(i) data = f.read().splitlines()
(ii) result = search_str in data
Load time ~ 6s, Search time ~ 0.36s, Memory usage ~ 1.2GB
3) mmap :
(i) data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
(ii) result = data.find(search_str)
Load time ~ 0s, Search time ~ 5.4s, Memory usage ~ NA
4) Hash lookup (using code from #alienhard below):
Load time ~ 65s, Search time ~ 0.0s, Memory usage ~ 250MB
5) File search (using code from #EOL below):
with open('input.txt') as f:
print search_str in f #search_str ends with the ('\n' or '\r\n') as in the file
Load time ~ 0s, Search time ~ 3.2s, Memory usage ~ NA
6) sqlite (with primary index on url):
Load time ~ 0s, Search time ~ 0.0s, Memory usage ~ NA
For my use case, it seems like going with the set is the best option as long as I have sufficient memory available. I was hoping to get some comments on these questions :
A better alternative e.g. sqlite ?
Ways to improve the search time using mmap. I have a 64-bit setup.
[edit] e.g. bloom filters
As the file size grows to a couple of GB, is there any way I can keep using 'set' e.g. split it in batches ..
[edit 1] P.S. I need to search frequently, add/remove values and cannot use a hash table alone because I need to retrieve the modified values later.
Any comments/suggestions are welcome !
[edit 2] Update with results from methods suggested in answers
[edit 3] Update with sqlite results
Solution : Based on all the profiling & feeback, I think I'll go with sqlite. Second alternative being method 4. One downside of sqlite is that the database size is more than double of the original csv file with urls. This is due to the primary index on url

Variant 1 is great if you need to launch many sequential searches. Since set is internally a hash table, it's rather good at search. It takes time to build, though, and only works well if your data fit into RAM.
Variant 3 is good for very big files, because you have plenty of address space to map them and OS caches enough data. You do a full scan; it can become rather slow once your data stop to fit into RAM.
SQLite is definitely a nice idea if you need several searches in row and you can't fit the data into RAM. Load your strings into a table, build an index, and SQLite builds a nice b-tree for you. The tree can fit into RAM even if data don't (it's a bit like what #alienhard proposed), and even if it doesn't, the amount if I/O needed is dramatically lower. Of course, you need to create a disk-based SQLite database. I doubt that memory-based SQLite will beat Variant 1 significantly.

Custom hash table search with externalized strings
To get fast access time and a lower memory consumption you could do the following:
for each line compute a string hash and add it to a hash table, e.g., index[hash] = position (do not store the string). If there is a collision, store all file positions for that key in a list.
to look up a string, compute its hash and look it up in the table. If the key is found, read the string at position from the file to verify you really have a match. If there are multiple positions check each one until you find a match or none.
Edit 1: replaced line_number by position (as pointed out by a commenter, one obviously needs the actual position and not line numbers)
Edit 2: provide code for an implementation with a custom hash table, which shows that this approach is more memory efficient than the other approaches mentioned:
from collections import namedtuple
Node = namedtuple('Node', ['pos', 'next'])
def build_table(f, size):
table = [ None ] * size
while True:
pos = f.tell()
line = f.readline()
if not line: break
i = hash(line) % size
if table[i] is None:
table[i] = pos
else:
table[i] = Node(pos, table[i])
return table
def search(string, table, f):
i = hash(string) % len(table)
entry = table[i]
while entry is not None:
pos = entry.pos if isinstance(entry, Node) else entry
f.seek(pos)
if f.readline() == string:
return True
entry = entry.next if isinstance(entry, Node) else None
return False
SIZE = 2**24
with open('data.txt', 'r') as f:
table = build_table(f, SIZE)
print search('Some test string\n', table, f)
The hash of a line is only used to index into the table (if we used a normal dict, the hashes would also be stored as keys). The file position of the line is stored at the given index. Collisions are resolved with chaining, i.e., we create a linked list. However, the first entry is never wrapped in a node (this optimization makes the code a bit more complicated but it saves quite some space).
For a file with 6 million lines I chose a hash table size of 2^24. With my test data I got 933132 collisions. (A hash table of half the size was comparable in memory consumption, but resulted in more collisions. Since more collisions means more file access for searches, I would rather use a large table.)
Hash table: 128MB (sys.getsizeof([None]*(2**24)))
Nodes: 64MB (sys.getsizeof(Node(None, None)) * 933132)
Pos ints: 138MB (6000000 * 24)
-----------------
TOTAL: 330MB (real memory usage of python process was ~350MB)

You could also try
with open('input.txt') as f:
# search_str is matched against each line in turn; returns on the first match:
print search_str in f
with search_str ending with the proper newline sequence('\n' or '\r\n'). This should use little memory, as the file is read progressively. It should also be quite fast, since only part of the file is read.

I would guess many of the paths start out the same on DMOZ.
You should use a trie data structure and store the individual characters on nodes.
Tries have O(m) lookup time (where m is the key length) also save a lot of space, when saving large dictionaries or tree like data.
You could also store path parts on nodes to reduce node count — this is called Patricia Trie. But that makes the lookup slower by the average string length comparison time. See SO question Trie (Prefix Tree) in Python for more info about implementations.
There are a couple of trie implementations on Python Package Index, but they are not very good. I have written one in Ruby and in Common Lisp, which is especially well suited for this task – if you ask nicely, I could maybe publish it as open source... :-)

what about a text indexing solution ?
I would use Lucene in the Java world but there is a python engine called Whoosh
https://bitbucket.org/mchaput/whoosh/wiki/Home

Without building an index file your searching will be to slow, and this is not so simple task. So better to use already developed software. The best way will be use Sphinx Search Engine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.