I'm trying to learn Python by converting a crude bash script of mine.
It runs every 5 minutes, and basically does the the following:
Loops line-by-line through a .csv file, grabs first element ($URL),
Fetches $URL with wget and extracts $price from the page,
Adds $price to the end of the same line in the .csv file,
Continues looping for the remaining lines.
I use it to keep track of products prices on eBay and similar websites. It notifies me when a lower price is found and plots graphs with the product's price history.
It's simple enough that I could just replicate the algorithm in Python, however, as I'm trying to learn it, it seems there are several types of objects (lists, dicts, etc.) that could do the storing much more efficiently. My plan is using pickle or even a simple DB solution (like dataset) from the beggining, instead of messing around with .csv files and extracting the data via sketchy string manipulations.
One of the improvements I would also like to make is store the absolute time of each fetch alongside its price, so I can plot a "true timed" graph, instead of assuming each cycle is 5 minutes away from each other (which it never is).
Therefore, my question sums to...
Assuming I need to work with the following data structure:
List of Products, each with its respective
URL
And its list of Time<->Prices pairs
What would be the best strategy in Python for doing it so?
Should I use dictionaries, lists, sets or maybe even creating a custom class for products?
Firstly, if you plan on accessing the data structure which holds the URL-(time,Price) for specific URLs - use dictionary, since URLs are unique (the URL will be the key in the dictionary).
Otherwise u can keep a list of (URL,(time,Price)) tuples.
Secondly, a list of (Time, Price) Tuples since you don't need to sort them (they will be sorted already by the way you insert them).
{} - Dictionary
[] - List
() - tuple
Option 1:
[(URL, [(time, Price)])]
Option 2:
{URL, [(time, Price)]}
Related
I'm keeping a record of what I'm calling profiles.
Each profile is a tuple of dictionaries: (Dict,Dict), and we can associate to it a unique id(may include characters like _ or -).
I need to keep them in RAM memory for a certain time, because I'll need to search and updated some of them, and only at the end, when I no longer need the whole set of profiles will I flush them to persistent storage.
Currently, I'm using a dictionary/hash table to keep all of them (number of elements around 100K, but could be more), since I'll do many searches, using the id:profile as the key:value pair.
The data looks similar to this:
{
"Aw23_1adGF":({
"data":
{"contacts":[11,22],"address":"usa"},
"meta_data":
{"created_by":"John"}
},{"key_3":"yes"}),
"AK23_1adGF":({
"data":
{"contacts":[33,44],"address":"mexico"},
"meta_data":
{"created_by":"Juan"}
},{"key_3":"no"}),
# ...
}
Once this data structure is built, I don't need to add/delete any more elements. I only build it once, than search it many times. On certain occasions I'll need to update some element in the dictionaries that compose a profile. However, building this data object contributes to the peak RAM usage that I'm trying to diminish.
The problem is that the dictionary seems to use too much RAM.
What were my other options regarding data structures that could keep some of the search efficiency and with a small RAM footprint?
I thought of an ordered list, since the id seems to me to be orderable (maybe except for characters like _ or -).
What data structures are there that could help me in this predicament?
I have two documents that are mostly the same, but with some small differences I want to ignore. Specifically, I know that one has hex values written as "0xFFFFFFFF" while the other has them just as "FFFFFFFF"
Basically, these two documents are lists of variables, their values, their location in memeory, size, etc.
But another problem is that they are not in the same order either.
I tried a few things, one being to just pack them all up in two lists of lists, and compare if the lists of lists have counterparts in each other, but with the number of variables being almost 100,000 the time it takes to do this is ridiculous (on the order of nearly an hour) so that isn't going to work. I'm not very seasonsed in python, or even the pythonic way of doing things, so I'm sorry if there is a quick and easy way to do this.
I've read a few other similar questions, but they all assume 100% identicallity, and other things that arent true in my case.
Basically, I have two .txts that have series of lines that look like:
***************************************
Variable: Var_name1
Size: 4
Address: 0x00FF00F0 .. 0x00FF00F3
Description: An awesome variable
..
***************************************
I don't care if the Descriptions are different, I just want to make sure that every variable has the same length and is in the same place, address-wise, and if they are any difference, I want to see them. I also want to be sure that every variable in one is present in the other.
And again, the address in the first one are written with the hex radix and in the second one, without the hex radix. And they are in a different order
--- Output ---
I don't really care about the output's format as long as it is human readable. Ideally, it'd be a .txt document that said something like:
"Var_name1 does not exist in list two"
"Var_name2 has a different size. (Size1, Size2)"
"Var_name4 is located in a different place. (Loc1, Loc2)"
COMPLETE RE-EDIT
[My initial suggestion was to use sets, but further discussion via the comments made me realize that that was nonsense, and that a dictionary was the real solution.]
You want a dictionary; keyed on variable name; and where the value is a list or a tuple or a nested dictionary or even an object, containing size and address. You can add each variable name to the dictionary and update the values as needed.
For comparing the addresses, a regex would do it, but you can probably get by with less overhead with just a str.contains(_).
I have this somewhat big data structure that stores pairs of data. The individual data is tiny and readily hashable, and there are something like a few hundred thousand data points in there.
At first, this was a simple dict that was accessed only by keys. Later on however, I discovered that I also needed to access it by value, that is, get the key for a certain value. Since this was done somewhat less often (~1/10) than access by key, I naïvely implemented it by simply iterating over all the dicts items(). Which proved a bit sluggish at a few hundred thousand calls per second. It is about 500 times slower.
So my next idea was to just use save the reverse dict, too. This seems to be a rather inelegant solution however, so I turn to you guys for help.
Do you know any data structure in Python that stores data pairs that can be accessed by either data point of the pair?
You could try bidict.
I have two groups of files that contain data in CSV format with a common key (Timestamp) - I need to walk through all the records chronologically.
Group A: 'Environmental Data'
Filenames are in format A_0001.csv, A_0002.csv, etc.
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains environmental data in CSV/column format
Very large, several GBs worth of data
Group B: 'Event Data'
Filenames are in format B_0001.csv, B_0002.csv
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains event based data in CSV/column format
Relatively small compared to Group A files, < 100 MB
What is best approach?
Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
Real-time merge: Implement code to 'merge' the files in real-time
I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.
im thinking importing it into a db (mysql, sqlite, etc) will give better performance than merging it in script. the db typically has optimized routines for loading csv and the join will be probably be as fast or much faster than merging 2 dicts (one being very large) in python.
"YYYY-MM-DD HH:MM:SS" can be sorted with a simple ascii compare.
How about reusing external merge logic? If the first field is the key then:
for entry in os.popen("sort -m -t, -k1,1 file1 file2"):
process(entry)
This is a similar to a relational join. Since your timestamps don't have to match, it's called a non-equijoin.
Sort-Merge is one of several popular algorithms. For non-equijoins, it works well. I think this would be what you're called "pre-merge". I don't know what you mean by "merge in real time", but I suspect it's still a simple sort-merge, which is a fine technique, heavily used by real databases.
Nested Loops can also work. In this case, you read the smaller table in the outer loop. In the inner loop you find all of the "matching" rows from the larger table. This is effectively a sort-merge, but with an assumption that there will be multiple rows from the big table that will match the small table.
This, BTW, will allow you to more properly assign meaning to the relationship between Event Data and Environmental Data. Rather than reading the result of a massive sort merge and trying to determine which kind of record you've got, the nested loops handle that well.
Also, you can do "lookups" into the smaller table while reading the larger table.
This is hard when you're doing non-equal comparisons because you don't have a proper key to do a simple retrieval from a simple dict. However, you can easily extend dict (override __contains__ and __getitem__) to do range comparisons on a key instead of simple equality tests.
I would suggest pre-merge.
Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and read statements you will need.
This will improve the runtime of your algorithm, and I think that's a good enough reason in this scenario to decide to use this approach
Hope this helps
You could read from the files in chunks of, say, 10000 records (or whatever number further profiling tells you to be optimal) and merge on the fly. Possibly using a custom class to encapsulate the IO; the actual records could then be accessed through the generator protocol (__iter__ + next).
This would be memory friendly, probably very good in terms of total time to complete the operation and would enable you to produce output incrementally.
A sketch:
class Foo(object):
def __init__(self, env_filenames=[], event_filenames=[]):
# open the files etc.
def next(self):
if self._cache = []:
# take care of reading more records
else:
# return the first record and pop it from the cache
# ... other stuff you need ...
Sorry for the very general title but I'll try to be as specific as possible.
I am working on a text mining application. I have a large number of key value pairs of the form ((word, corpus) -> occurence_count) (everything is an integer) which I am storing in multiple python dictionaries (tuple->int). These values are spread across multiple files on the disk (I pickled them). To make any sense of the data, I need to aggregate these dictionaries Basically, I need to figure out a way to find all the occurrences of a particular key in all the dictionaries, and add them up to get a total count.
If I load more than one dictionary at a time, I run out of memory, which is the reason I had to split them in the first place. When I tried , I ran into performance issues. I am currently trying to store the values in a DB (mysql), processing multiple dictionaries at a time, since mysql provides row level locking, which is both good (since it means I can parallelize this operation) and bad (since it slows down the insert queries)
What are my options here? Is it a good idea to write a partially disk based dictionary so I can process the dicts one at a time? With an LRU replacement strategy? Is there something that I am completely oblivious to?
Thanks!
A disk-based dictionary-like exists -- see the shelve module. Keys into a shelf must be strings, but you could simply use str on your tuples to obtain equivalent string keys; plus, I read your Q as meaning that you want only word as the key, so that's even easier (either str -- or, for vocabularies < 4GB, a struct.pack -- will be fine).
A good relational engine (especially PostgreSQL) would serve you well, but processing one dictionary at a time to aggregate each word occurrences over all corpora into a shelf object should also be OK (not quite as fast, but simpler to code, since a shelf is so similar to a dict except for the type constraint on keys [[and a caveat for mutable values, but as your values are ints that need not concern you).
Something like this, if I understand your question correctly
from collections import defaultdict
import pickle
result = defaultdict(int)
for fn in filenames:
data_dict = pickle.load(open(fn))
for k,count in data_dict.items():
word,corpus = k
result[k]+=count
If I understood your question correctly and you have integer ids for the words and corpora, then you can gain some performance by switching from a dict to a list, or even better, a numpy array. This may be annoying!
Basically, you need to replace the tuple with a single integer, which we can call the newid. You want all the newids to correspond to a word,corpus pair, so I would count the words in each corpus, and then have, for each corpus, a starting newid. The newid of (word,corpus) will then be word + start_newid[corpus].
If I misunderstood you and you don't have such ids, then I think this advice might still be useful, but you will have to manipulate your data to get it into the tuple of ints format.
Another thing you could try is rechunking the data.
Let's say that you can only hold 1.1 of these monsters in memory. Then, you can load one, and create a smaller dict or array that only corresponds to the first 10% of (word,corpus) pairs. You can scan through the loaded dict, and deal with any of the ones that are in the first 10%. When you are done, you can write the result back to disk, and do another pass for the second 10%. This will require 10 passes, but that might be OK for you.
If you chose your previous chunking based on what would fit in memory, then you will have to arbitrarily break your old dicts in half so that you can hold one in memory while also holding the result dict/array.