Reparsing a file or unpickling - python

I have a 100Mb file with roughly 10million lines that I need to parse into a dictionary every time I run my code. This process is incredibly slow, and I am hunting for ways to speed it up. One thought that came to mind is to parse the file once and then use pickle to save it to disk. I'm not sure this would result in a speed up.
Any suggestions appreciated.
EDIT:
After doing some testing, I am worried that the slow down happens when I create the dictionary. Pickling does seem significantly faster, though I wouldn't mind doing better.
Lalit

MessagePack has in my experience been much faster for dumping/loading data in python then cPickle, even when using the highest protocol.
However, if you have a dictionary with 10 million entries in it, you might want to check that you're not hitting the upper limit of your computer's memory. The process will happen much slower if you run out of memory and have to use swap.

Depending on how you use the data, you could
divide it into many smaller files and load only what's needed
create an index into the file and lazy-load
store it to a database and then query the database
Can you give us a better idea just what your data looks like (structure)?
How you are using the data? Do you actually use every row on every execution? If you only use a subset on each run, could the data be pre-sorted?

Related

discord.py: too big variable?

I'm very new to python and programming in general, and I'm looking to make a discord bot that has a lot of hand-written chat lines to randomly pick from and send back to the user. Making a really huge variable full of a list of sentences seems like a bad idea. Is there a way that I can store the chatlines on a different file and have the bot pick from the lines in that file? Or is there anything else that would be better, and how would I do it?
I'll interpret this question as "how large a variable is too large", to which the answer is pretty simple. A variable is too large when it becomes a problem. So, how can a variable become a problem? The big one is that the machien could possibly run out of memory, and an OOM killer (out-of-memory killer) or similiar will stop your program. How would you know if your variable is causing these issues? Pretty simple, your program crashes.
If the variable is static (with a size fully known at compile-time or prior to interpretation), you can calculate how much RAM it will take. (This is a bit finnicky with Python, so it might be easier to load it up at runtime and figure it out with a profiler.) If it's more than ~500 megabytes, you should be concerned. Over a gigabyte, and you'll probably want to reconsider your approach[^0]. So, what do you do then?
As suggested by #FishballNooodles, you can store your data line-by-line in a file and read the lines to an array. Unfortunately, the code they've provided still reads the entire thing into memory. If you use the code they're providing, you've got a few options, non-exhaustively listed below.
Consume a random number of newlines from the file when you need a line of text. You would look at one character at a time, compare it to \n, and read the line if you've encountered the requested number of newlines. This is O(n) worst case with respect to the number of lines in the file.
Rather than storing the text you need at a given index, store its location in a file. Then, you can seek to the location (which is probably O(1)), and read the text. This requires an O(n) construction cost at the start of the program, but would work much better at runtime.
Use an actual database. It's usually better not to reinvent the wheel. If you're just storing plain text, this is probably overkill, but don't discount it.
[^0]: These numbers are actually just random. If you control the server environment on which you run the code, then you can probably come up with some more precise signposts.
You can store your data in a file, supposedly named response.txt
and retrieve it in the discord bot file as open("response.txt").readlines()

Python script doesn't terminates after a long time when it's finished

I have a weird problem.
I'm loading a huge file (3.5G) and making a dictionary out of it and do some processing.
After everything is finished, my script doesn't terminate immediately, it terminates after some time.
I think it might be due to memory freeing , what can be other reasons ?? I'd appreciate any opinion. And how can I make my script run faster?
Here's the corresponding code:
class file_processor:
def __init__(self):
self.huge_file_dict = self.upload_huge_file()
def upload_huge_file(self):
d = defaultdict(list)
f= codecs.open('huge_file', 'r', encoding='utf-8').readlines()
for line in f:
l = line.strip()
x,y,z,rb,t = l.split()
d[rb].append((x,y,z,t))
return d
def do_some_processing(self, word):
if word in self.huge_file_dict:
do sth with self.huge_file_dict[word]
My guess is that your horrible slowdown, which doesn't recover until after your program is finished, is caused by using more memory than you actually have, which causes your OS to start swapping VM pages in and out to disk. Once you get enough swapping happening, you end up in "swap hell", where a large percentage of your memory accesses involve a disk read and even a disk write, which takes orders of magnitude more time, and your system won't recover until a few seconds after you finally free up all that memory.
The obvious solution is to not use so much memory.
tzaman's answer, avoiding readlines(), will eliminate some of that memory. A giant list of all the lines in a 3.5GB file has to take at least 3.5GB on Python 3.4 or 2.7 (but realistically at least 20% more than that) and maybe 2x or 4x on 3.0-3.3.
But the dict is going to be even bigger than the list, and you need that, right?
Well, no, you probably don't. Keeping the dict on-disk and fetching the values as-needed may sound slow, but it may still be a lot faster than keeping it in virtual memory, if that virtual memory has to keep swapping back and forth to disk.
You may want to consider using a simple dbm, or a more powerful key-value database (google "NoSQL key value" for some options), or a sqlite3 database, or even a server-based SQL database like MySQL.
Alternatively, if you can keep everything in memory, but in a more compact form, that's the best of both worlds.
I notice that in your example code, the only thing you're doing with the dict is checking word in self.huge_file_dict. If that's true, then you can use a set instead of a dict and not keep all those values around in memory. That should cut your memory use by about 80%.
If you frequently need the keys, but occasionally need the values, you might want to consider a dict that just maps the keys to indices into something you can read off disk as needed (e.g., a file with fixed-length strings, which you can then mmap and slice).
Or you could stick the values in a Pandas frame, which will be a little more compact than native Python storage—maybe enough to make the difference—and use a dict mapping keys to indices.
Finally, you may be able to reduce the amount of swapping without actually reducing the amount of memory. Bisecting a giant sorted list, instead of accessing a giant dict, may—depending on the pattern of your words—give much better memory locality.
Don't call .readlines() -- that loads the entire file into memory beforehand. You can just iterate over f directly and it'll work fine.
with codecs.open('huge_file', 'r', encoding='utf-8') as f:
for line in f:
...

Optimization tips for reading/parsing large number of JSON.gz files

I have an interesting problem at hand. As someone who's a beginner when it comes to working with data at even a morderate scale, I'd love some tips from the veterans here.
I have around 6000 Json.gz files totalling around 5GB compressed and 20GB uncompressed.
I'm opening each file and reading them line by line using the gzip module; then using json.loads() loading each line and parsing the complicated JSON structure. Then I'm inserting lines from each file into a Pytable all at once before iterating to the next file.
All this is taking me around 3 hours. Bulk inserting into the Pytable didn't really help the speed at all. Much of the time is gone getting values from the parsed JSON line since they have a truly horrible structure. Some are straightforward like 'attrname':attrvalue, but some are complicated and time consuming structures like:
'attrarray':[{'name':abc, 'value':12},{'value':12},{'name':xyz, 'value':12}...]
...where I need to pick up the value of all those objects in the attr array which have some corresponding name, and ignore those that don't. So I need to iterate through the list and inspect each JSON object inside. (I'd be glad if you can point out any quicker clever way, if it exists)
So I suppose the actual parsing part of it doesn't have much scope of speedup. Where I think their might be scope of speedup is the actual reading the file part.
So I ran a few tests (I don't have the numbers with me right now) and even after removing the parsing part of my program; simply going through the files line by line itself was taking a considerable amount of time.
So I ask: Is there any part of this problem that you think I might be doing suboptimally?
for filename in filenamelist:
f = gzip.open(filename):
toInsert=[]
for line in f:
parsedline = json.loads(line)
attr1 = parsedline['attr1']
attr2 = parsedline['attr2']
.
.
.
attr10 = parsedline['attr10']
arr = parsedline['attrarray']
for el in arr:
try:
if el['name'] == 'abc':
attrABC = el['value']
elif el['name'] == 'xyz':
attrXYZ = el['value']
.
.
.
except KeyError:
pass
toInsert.append([attr1,attr2,...,attr10,attrABC,attrXYZ...])
table.append(toInsert)
One clear piece of "low-hanging fruit"
If you're going to be accessing the same compressed files over and over (it's not especially clear from your description whether this is a one-time operation), then you should decompress them once rather than decompressing them on-the-fly each time you read them.
Decompression is a CPU-intensive operation, and Python's gzip module is not that fast compared to C utilities like zcat/gunzip.
Likely the fastest approach is to gunzip all these files, save the results somewhere, and then read from the uncompressed files in your script.
Other issues
The rest of this is not really an answer, but it's too long for a comment. In order to make this faster, you need to think about a few other questions:
What are you trying to do with all this data?
Do you really need to load all of it at once?
If you can segment the data into smaller pieces, then you can reduce the latency of the program if not the overall time required. For example, you might know that you only need a few specific lines from specific files for whatever analysis you're trying to do... great! Only load those specific lines.
If you do need to access the data in arbitrary and unpredictable ways, then you should load it into another system (RDBMS?) which stores it in a format that is more amenable to the kinds of analyses you're doing with it.
If the last bullet point is true, one option is to load each JSON "document" into a PostgreSQL 9.3 database (the JSON support is awesome and fast) and then do your further analyses from there. Hopefully you can extract meaningful keys from the JSON documents as you load them.

Most efficient way to store data on drive

baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.
SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.
I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length
10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)

Data persistence for python when a lot of lookups but few writes?

I am working on a project that basically monitors a set remote directories (FTP, networked paths, and another), if the file is considered new and meets criteria we download it and process it. However i am stuck on what the best way is to keep track of the files we already downloaded. I don't want to download any duplicate files, so i need to keep track of what is already downloaded.
Orignally i was storing it as a tree:
server->directory->file_name
When the service shuts down it writes it to a file, and rereads it back when it starts up. However given that when there is around 20,000 or so files in the tree stuff starts to slow down alot.
Is there a better way to do this?
EDIT
The lookup times start to slowdown alot, my basic implementation is a dict of a dict. The storing stuff on the disk is fine, its more or less just the lookup time. I know i can optimize the tree and partition it. However that seems excessive for such a small project i was hoping python would have something like that.
I would create a set of tuples, then pickle it to a file. The tuples would be (server, directory, file_name), or even just (server, full_file_name_including_directory). There's no need for a multiple-level data structure. The tuples will hash into the set, and give you a O(1) lookup.
You mention "stuff starts to slow down alot," but you don't say if it's reading and writing time, or lookup times that are slowing down. If your lookup times are slowing down, you may be paging. Is your data structure approaching a significant fraction of your physical memory?
One way to get back some memory is to intern() the server names. This way, each server name will be stored only once in memory.
An interesting alternative is to use a Bloom filter. This will let you use far less memory, but will occasionally download a file that you didn't have to. This might be a reasonable trade-off, depending on why you didn't want to download the file twice.

Categories

Resources