Q: Which is quicker for this scenario?
My scenario: my application will be storing either in either an array or postgresql db a list of links, so it might look like:
1) mysite.com
a) /users/login
b) /users/registration/
c) /contact/
d) /locate/search
e) /priv/admin-login
The above entries under 1) - I will be doing string searches on these urls to find for example any path that contains:
'login'
for example.
The above letters a) through e) could maybe have anywhere from 5-100 more entries for a given domain.
*The usage: * This data structure can change potentially as much as everyday, but only once per day. Some key/values will be removed, others will be modified. An individual set like:
dict2 = { 'thesite.com': 123, 98.6: 37 };
Each key will represent 1 and only 1 domain.
I've tried searching a bit on this, but cannot seem to find a real good answer to : when should an array be used and when should a db like postgresql be used?
I've always used a db to handle data (using mysql, not postgresql), but I'm now trying to do it better from now on, so I wondered if an array or other data structure would work better within a loop, and while trying tomatch a given string while looping.
As always, thank you!
A full SQL database would probably be overkill. If you can fit everything in memory, put it all in a dict and then use the pickle module to serialize it and write it to the disk.
Another good option would be to use one of the dbm modules (dbm/dbm.ndbm, gdbm or anydbm) to store the data in a disk-bound hash table. It will have O(1) lookup times without the need to connect and form a query like in a bigger database.
edit: If you have multiple values per key and you don't want a full-blown database, SQLite would be a good choice. There is already a built-in module for it, sqlite3 (as mentioned in the comments)
Test it. It's your dataset, your hardware, your available disk and network IO, your usage pattern. There's no one true answer here. We don't even know how many queries are you planning - are we talking about one per minute or thousands per second?
If your data fits nicely in memory and doesn't take a massive amount of time to load the first time, sticking it into a dictionary in memory will probably be faster.
If you're always looking for full words (like in the login case), you will gain some speed too from splitting the url into parts and indexing those separately.
Related
I am building an image mosaic that detect if the user's selected area are taken or not.
My idea is to store the available_spots in a list, and I would just have to look through the list to check whether a spot is available or not.
The problem is that when I reload the website, avaliable_spots also gets reset to blank list,
so I want to store this array somewhere, that is fast to read and write to.
I am currently thinking about a text file to store this, but that might take forever to read since array length is over 1.4 million. Is there any other solutions that might be better?
You can't store the data in a file for a few reasons: (1) GAE standard won't let you, (2) the data is lost when your server is restarted, and (3) different instances will have different data.
Of course you can and should store the data in a database of your choice. Firestore is likely a better and cheaper option than SQL. It should be fast enough for you and you can implement caching if needed.
You might be able to store the data in a single Firestore entity and consider using compression if you are getting close to the max entity size.
If you want to store into a database you can use the "sqlite3" module.
Is a simple database that gets stored in a file so you dont have to install a database program. Is great for small projects.
If you want to do more complex stuff with databases you can use "sqlalchemy".
I have a City model and fixture data with list of cities, and currently doing cleanups for URL on view and template before loading them. So I do below in a template to have a URL like this: http://newcity.domain.com.
<a href="http://{{ city.name|lower|cut:" " }}.{{ SITE_URL}}">
The actual city.name is "New City"
Would it be better if I stored already cleaned data (newcity) in a new column (short_name) on MySQL db and just use city.short_name on templates and views?
This seems very opinion-oriented. Is it faster? The only way to know for sure is to measure. Is it faster to a degree that you care about? Probably not. In any event, it's better not to make schema design decisions based on performance unless you've observed measurably bad performance.
All other things being equal, it is generally best to store the data in different columns. It's easier to join it in controller or template code than it is to separate it out into its pieces.
storing the short name in a MySQL database requires I/O. I/O is always slow, for such an easy transormation of data, it should be faster to keep it, like it is and avoid I/O to a database.
If you really want to know the difference, use timeit (https://docs.python.org/2/library/timeit.html), probably accessing a database is much slower.
It really depends, if you have a fixed amount of cities in a list, just make them hardcoded (unless you have lots of cities that will actually put some stress on your server's resources - but I don't think that it's the case here), otherwise - you must use some type of persistent store for the cities and a database will come handy.
Apologies for the longish description.
I want to run a transform on every doc in a large-ish Mongodb collection with 10 million records approx 10G. Specifically I want to apply a geoip transform to the ip field in every doc and either append the result record to that doc or just create a whole other record linked to this one by say id (the linking is not critical, I can just create a whole separate record). Then I want to count and group by say city - (I do know how to do the last part).
The major reason I believe I cant use map-reduce is I can't call out to the geoip library in my map function (or at least that's the constraint I believe exists).
So I the central question is how do I run through each record in the collection apply the transform - using the most efficient way to do that.
Batching via Limit/skip is out of question as it does a "table scan" and it is going to get progressively slower.
Any suggestions?
Python or Js preferred just bec I have these geoip libs but code examples in other languages welcome.
Since you have to go over "each record", you'll do one full table scan anyway, then a simple cursor (find()) + maybe only fetching few fields (_id, ip) should do it. python driver will do the batching under the hood, so maybe you can give a hint on what's the optimal batch size (batch_size) if the default is not good enough.
If you add a new field and it doesn't fit the previously allocated space, mongo will have to move it to another place, so you might be better off creating a new document.
Actually I am also attempting another approach in parallel (as plan B) which is to use mongoexport. I use it with --csv to dump a large csv file with just the (id, ip) fields. Then the plan is to use a python script to do a geoip lookup and then post back to mongo as a new doc on which map-reduce can now be run for count etc. Not sure if this is faster or the cursor is. We'll see.
I have a Django app that uses django-piston to send out XML feeds to internal clients. Generally, these work pretty well but we have some XML feeds that currently run over 15 minutes long. This causes timeouts, and the feeds become unreliable.
I'm trying to ponder ways that I can improve this setup. If it requires some re-structuring of the data, that could be possible too.
Here is how the data collection currently looks:
class Data(models.Model)
# fields
class MetadataItem(models.Model)
data = models.ForeignKey(Data)
# handlers.py
data = Data.objects.filter(**kwargs)
for d in data:
for metaitem in d.metadataitem_set.all():
# There is usually anywhere between 55 - 95 entries in this loop
label = metaitem.get_label() # does some formatting here
data_metadata[label] = metaitem.body
Obviously, the core of the program is doing much more, but I'm just pointing out where the problem lies. When we have a data list of 300 it just becomes unreliable and times out.
What I've tried:
Getting a collection of all the data id's, then doing a single large query to get all the MetadataItem's. Finally, filtering those in my loop. This was to preserve some queries which it did reduce.
Using .values() to reduce model instance overhead, which did speed it up but not by much.
One idea I'm thinking one simpler solution to this is to write to a cache in steps. So to reduce time out; I would write the first 50 data sets, save to cache, adjust some counter, write the next 50, etc. Still need to ponder this.
Hoping someone can help lead me into the right direction with this.
The problem in the piece of code you posted is that Django doesn't include objects that are connected through a reverse relationship automatically, so you have to make a query for each object. There's a nice way around this, as Daniel Roseman points out in his blog!
If this doesn't solve your problem well, you could also have a look at trying to get everything in one raw sql query...
You could maybe further reduce the query count by first getting all Data id's and then using select_related to get the data and it's metadata in a single big query. This would greatly reduce the number of queries, but the size of the queries might be impractical/too big. Something like:
data_ids = Data.objects.filter(**kwargs).values_list('id', flat = True)
for i in data_ids:
data = Data.objects.get(pk = i).select_related()
# data.metadataitem_set.all() can now be called without quering the database
for metaitem in data.metadataitem_set.all():
# ...
However, I would suggest, if possible, to precompute the feeds from somewhere outside the webserver. Maybe you could store the result in memcache if it's smaller than 1 MB. Or you could be one of the cool new kids on the block and store the result in a "NoSQL" database like redis. Or you could just write it to a file on disk.
If you can change the structure of the data, maybe you can also change the datastore?
The "NoSQL" databases which allow some structure, like CouchDB or MongoDB could actually be useful here.
Let's say for every Data item you have a document. The document would have your normal fields. You would also add a 'metadata' field which is a list of metadata. What about the following datastructure:
{
'id': 'someid',
'field': 'value',
'metadata': [
{ 'key': 'value' },
{ 'key': 'value' }
]
}
You would then be able to easily get to a data record and get all it's metadata. For searching, add indexes to the fields in the 'data' document.
I've worked on a system in Erlang/OTP that used Mnesia which is basically a key-value database with some indexing and helpers. We used nested records heavily to great success.
I added this as a separate answer as it's totally different than the other.
Another idea is to use Celery (www.celeryproject.com) which is a task management system for python and django. You can use it to perform any long running tasks asynchronously without holding up your main app server.
I am building an application where I am trying to allow users to submit a list of company and date pairs and find out whether or not there was a news event on that date. The news events are stored in a dictionary with a company identifier and a date as a key.
newsDict('identifier','MM/DD/YYYY')=[list of news events for that date]
The dictionary turned out to be much larger than I thought-too big even to build it in memory so I broke it down into three pieces, each piece is limited to a particular range of company identifiers.
My plan was to take the user submitted list and using a dictionary group the user list of company identifiers to match the particular newsDict that the company events would be expected to be found and then load the newsDicts one after another to get the values.
Well now I am wondering if it would not be better to keep the news events in a list with each item of the list being a sublist list of a tuple and another list
[('identifier','MM/DD/YYYY'),[list of news events for that date]]
my thought then is that I would have a dictionary that would have the range of the list for each company identifier
companyDict['identifier']=(begofRangeinListforComp,endofRangeinListforComp)
I would use the user input to look up the ranges I needed and construct a list of the identifiers and ranges sorted by the ranges. Then I would just read the appropriate section of the list to get the data and construct the output.
The biggest reason I see for this is that even with the dictionary broken into thirds each section takes about two minutes to load on my machine and the dictionary ends up taking about 600 to 750 mb of ram.
I was surprised to note that a list of eight million lines took only about 15 seconds to load and used about 1/3 of the memory of the dictionary that had 1/3 the entries.
Further, since I can discard the lines in the list as I work through the list I will be freeing memory as I work down the user list.
I am surprised as I thought a dictionary would be the most efficient way to do this. but my poking at it suggests that the dictionary requires significantly more memory than a list. My reading of other posts on SO and elsewhere suggests that any other structure is going to require pointer allocations that are more expensive than list pointers. Am I missing something here and is there a better way to do this?
After reading Alberto's answer and response to my comment I spent some time trying to figure out how to write the function if I were to use a db. Now I might be hobbled here because I don't know much about db programming but
I think the code to implement using a db would be much more complicated than:
outList=[]
massiveFile=open('theFile','r')
for identifier in sortedUserList
# I get the list and sort it by the key of the dictionary
identifierList=massiveFile[theDict[identifier]['beginPosit']:theDict[identifier]['endPosit']+1]
for item in identifierList:
if item.startswith(manipulation of the identifier)
outList.append(item)
I have to wrap this in a function I didn't see anything that would be as comparably simple if I converted the list to a db.
Of course simpler was not the reason to bring me to this forum. I still don't see that using another structure will cost less memory. I have 30000 company identifiers and approximately 3600 dates. Each item in my list is an object in the parlance of OOD. That is where I am struggling I spent six hours this morning organizing the data for a dictionary before I gave up. Spending that amount of time to implement a database and then find that I am using half a gig or more of someone else's memory to load it seems problematic
With such a large amount of data, you should be using a database. This would be far better than looking at a list, and would be the most appropriate way of storing your data anyway. If you're using Python, it has SQLite built in I believe.
The dictionary will take more memory because it is effectively a hash.
You don't have to go so far as using a database, since your lookup requirements are so simple. Just use the file system.
Create a directory structure based on the company name (or ticker), with subdirectories for each date. To find whether data exists and load it up, just form the name of the subdirectory where the data would be, and see if it exists.
E.g., IBM news for May 21 would be in C:\db\IBM\20090521\news.txt, if in fact there were news for that day. You just check if the file exists; no searches.
If you want to try and boost speed from there, come up with a scheme to cache a limited amount of results that are likely to be frequently requested (assuming you're operating a server). For that, you'd use a hash.