StringListProperty limited to 500 char strings (Google App Engine / Python) - python

It seems that StringListProperty can only contain strings up to 500 chars each, just like StringProperty...
Is there a way to store longer strings than that? I don't need them to be indexed or anything. What I would need would be something like a "TextListProperty", where each string in the list can be any length and not limited to 500 chars.
Can I create a property like that? Or can you experts suggest a different approach? Perhaps I should use a plain list and pickle/unpickle it in a Blob field, or something like that? I'm a bit new to Python and GAE and I would greatly appreciate some pointers instead of spending days on trial and error...thanks!

Alex already answered long ago, but in case someone else comes along with the same issue:
You'd just make item_type equal to db.Text (as OP mentions in a comment).
Here's a simple example:
from google.appengine.ext import db
class LargeTextList(db.Model):
large_text_list = db.ListProperty(item_type=db.Text)
def post(self):
# get value from a POST request,
# split into list using some delimiter
# add to datastore
L = self.request.get('large_text_list').split() # your delimiter here
LTL = [db.Text(i) for i in L]
new = LargeTextList()
new.large_text_list = LTL
new.put()
def get(self):
# return one to make sure it's working
query = LargeTextList.all()
results = query.fetch(limit=1)
self.render('index.html',
{ 'results': results,
'title': 'LargeTextList Example',
})

You can use a generic ListProperty with an item_type as you require (str, or unicode, or whatever).

Related

What kind of data structure is this? Python

Studying Python, I am following an excellent Corey Schafer tutorial on Flask, he does this (I have extracted and summarized it for obvious reasons):
from folder_app import app # I did it to follow the structure and that the code is equal to the original
s = Serializer(app.config['SECRET_KEY'], 1800) # key, seconds
token = s.dumps({'user_id': 1}).decode('utf-8')
s = Serializer(app.config['SECRET_KEY'])
user_id = s.loads(token)['user_id'] # This is where I have the doubt
print(user_id)
print(type(s.loads(token)))
The code works, the problem I have is that although as you can see (s.loads (token)) is a dict, I expected to see something like this s.loads ({token ['user_id']}), or s.loads (token ['user_id']) or something like that. That is, it is a dict but it does not seem so. And my doubt goes in the sense if this comes from a greater concept of those they call "pythonic" (which I have not seen so far), or is something that only happens particularly as in this case. Incidentally, https://itsdangerous.palletsprojects.com/en/1.1.x/jws/ this appears: loads (self, s, salt = None, return_header = False) the arguments are in parentheses. I hope it is clear what my doubt is :)
I know this is not answer per say but just to add to my comment. This is an example of how the loads function works on dictionaries with the json module. https://docs.python.org/3/library/json.html#json.loads. What it does is take a json string and return the dictionary type object in Python. Your Serializer is doing something similar. It takes the token string and represents it as an object like dict
The s.dumps I am assuming is similar to json.dumps which gives you the json string representation of python dictionary.
import json
my_dict = json.loads('{"user_id": "Mane", "name": "Joe"}')
my_dict['user_id']
So you could just do json.loads('{"user_id": "Mane", "name": "Joe"}')['user_id'] which is just chaining the operations.

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Best practices method of implementing a django OR query from an iterable?

I am implementing a one off data importer where I need to search for existing slugs. The slugs are in an array. What is the accepted best practices way of converting an array to an OR query?
I came up with the following, which works, but feels like way too much code to accomplish something this simple.
# slug might be an array or just a string
# ex:
slug = [ "snakes", "snake-s" ] # in the real world this is generated from directory structure on disk
# build the query
query = MyModel.objects
if hasattr(slug, "__iter__"):
q = Q()
for s in slug:
q = q.__or__(Q(slug=s))
query = query.filter(q)
else:
query = query.filter(slug=slug)
slug = ["snakes", "snake-s" ] # in the real world this is generated from directory structure on disk
# build the query
query = MyModel.objects
if hasattr(slug, "__iter__"):
q_list = []
for s in slug:
q_list.append(Q(slug=s))
query = query.filter(reduce(operator.or_, q_list))
else:
query = query.filter(slug=slug)
q_list = [] create a list of Q clauses
reduce(operator.or_, q_list) implode the list with or operators
read this: http://www.michelepasin.org/techblog/2010/07/20/the-power-of-djangos-q-objects/
#MostafaR - sure we could crush my entire codeblock down to one line if we wanted (see below). Its not very readable anymore at that level though. saying code isn't "Pythonic" just because it hadn't become reduced and obsfucated is silly. Readable code is king IMHO. Its also important to keep in mind the purpose of my answer was to show the reduce by an operator technique. The rest of my answer was fluff to show that technique in context to the original question.
result = MyModel.objects.filter(reduce(operator.or_, [Q(slug=s) for s in slug])) if hasattr(slug, "__iter__") else MyModel.objects.filter(slug=slug)
result = MyModel.objects.filter(slug__in=slug).all() if isinstance(slug, list) else MyModel.objects.filter(slug=slug).all()
I believe in this case you should use django's __in field lookup like this:
slugs = [ "snakes", "snake-s" ]
objects = MyModel.objects.filter(slug__in=slugs)
The code that you posted will not work in many ways (but I am not sure if it should be more pseudocode?), but from what I understand, this might help:
MyModel.objects.filter(slug__in=slug)
should do the job.

How to implement full text search in Django?

I would like to implement a search function in a django blogging application. The status quo is that I have a list of strings supplied by the user and the queryset is narrowed down by each string to include only those objects that match the string.
See:
if request.method == "POST":
form = SearchForm(request.POST)
if form.is_valid():
posts = Post.objects.all()
for string in form.cleaned_data['query'].split():
posts = posts.filter(
Q(title__icontains=string) |
Q(text__icontains=string) |
Q(tags__name__exact=string)
)
return archive_index(request, queryset=posts, date_field='date')
Now, what if I didn't want do concatenate each word that is searched for by a logical AND but with a logical OR? How would I do that? Is there a way to do that with Django's own Queryset methods or does one have to fall back to raw SQL queries?
In general, is it a proper solution to do full text search like this or would you recommend using a search engine like Solr, Whoosh or Xapian. What are their benefits?
I suggest you to adopt a search engine.
We've used Haystack search, a modular search application for django supporting many search engines (Solr, Xapian, Whoosh, etc...)
Advantages:
Faster
perform search queries even without querying the database.
Highlight searched terms
"More like this" functionality
Spelling suggestions
Better ranking
etc...
Disadvantages:
Search Indexes can grow in size pretty fast
One of the best search engines (Solr) run as a Java servlet (Xapian does not)
We're pretty happy with this solution and it's pretty easy to implement.
Actually, the query you have posted does use OR rather than AND - you're using \ to separate the Q objects. AND would be &.
In general, I would highly recommend using a proper search engine. We have had good success with Haystack on top of Solr - Haystack manages all the Solr configuration, and exposes a nice API very similar to Django's own ORM.
Answer to your general question: Definitely use a proper application for this.
With your query, you always examine the whole content of the fields (title, text, tags). You gain no benefit from indexes, etc.
With a proper full text search engine (or whatever you call it), text (words) is (are) indexed every time you insert new records. So queries will be a lot faster especially when your database grows.
SOLR is very easy to setup and integrate with Django. Haystack makes it even simpler.
For full text search in Python, look at PyLucene. It allows for very complex queries. The main problem here is that you must find a way to tell your search engine which pages changed and update the index eventually.
Alternatively, you can use Google Sitemaps to tell Google to index your site faster and then embed a custom query field in your site. The advantage here is that you just need to tell Google the changed pages and Google will do all the hard work (indexing, parsing the queries, etc). On top of that, most people are used to use Google to search plus it will keep your site current in the global Google searches, too.
I think full text search on an application level is more a matter of what you have and how you expect it to scale. If you run a small site with low usage I think it might be more affordable to put some time into making an custom full text search rather than installing an application to perform the search for you. And application would create more dependency, maintenance and extra effort when storing data. By making your search yourself and you can build in nice custom features. Like for example, if your text exactly matches one title you can direct the user to that page instead of showing the results. Another would be to allow title: or author: prefixes to keywords.
Here is a method I've used for generating relevant search results from a web query.
import shlex
class WeightedGroup:
def __init__(self):
# using a dictionary will make the results not paginate
# but it will be a lot faster when storing data
self.data = {}
def list(self, max_len=0):
# returns a sorted list of the items with heaviest weight first
res = []
while len(self.data) != 0:
nominated_weight = 0
for item, weight in self.data.iteritems():
if weight > nominated_weight:
nominated = item
nominated_weight = weight
self.data.pop(nominated)
res.append(nominated)
if len(res) == max_len:
return res
return res
def append(self, weight, item):
if item in self.data:
self.data[item] += weight
else:
self.data[item] = weight
def search(searchtext):
candidates = WeightedGroup()
for arg in shlex.split(searchtext): # shlex understand quotes
# Search TITLE
# order by date so we get most recent posts
query = Post.objects.filter_by(title__icontains=arg).order_by('-date')
arg_hits = query.count() # count is cheap
if arg_hits > 1000:
continue # skip keywords which has too many hits
# Each of these are expensive as it would transfer data
# from the db and build a python object,
for post in query[:50]: # so we limit it to 50 for example
# more hits a keyword has the lesser it's relevant
candidates.append(100.0 / arg_hits, post.post_id)
# TODO add searchs for other areas
# Weight might also be adjusted with number of hits within the text
# or perhaps you can find other metrics to value an post higher,
# like number of views
# candidates can contain a lot of stuff now, show most relevant only
sorted_result = Post.objects.filter_by(post_id__in=candidates.list(20))

Does Django support multi-value cookies?

I'd like to set a cookie via Django with that has several different values to it, similar to .NET's HttpCookie.Values property. Looking at the documentation, I can't tell if this is possible. It looks like it just takes a string, so is there another way?
I've tried passing it an array ([10, 20, 30]) and dictionary ({'name': 'Scott', 'id': 1}) but they just get converted to their string format. My current solution is to just use an arbitrary separator and then parse it when reading it in, which feels icky. If multi-values aren't possible, is there a better way? I'd rather not use lots of cookies, because that would get annoying.
.NETs multi-value cookies work exactly the same way as what you're doing in django using a separator. They've just abstracted that away for you. What you're doing is fine and proper, and I don't think Django has anything specific to 'solve' this problem.
I will say that you're doing the right thing, in not using multiple cookies. Keep the over-the-wire overhead down by doing what you're doing.
If you're looking for something a little more abstracted, try using sessions. I believe the way they work is by storing an id in the cookie that matches a database record. You can store whatever you want in it. It's not exactly the same as what you're looking for, but it could work if you don't mind a small amount of db overhead.
(Late answer!)
This will be bulkier, but you call always use python's built in serializing.
You could always do something like:
import pickle
class MultiCookie():
def __init__(self,cookie=None,values=None):
if cookie != None:
try:
self.values = pickle.loads(cookie)
except:
# assume that it used to just hold a string value
self.values = cookie
elif values != None:
self.values = values
else:
self.values = None
def __str__(self):
return pickle.dumps(self.values)
Then, you can get the cookie:
newcookie = MultiCookie(cookie=request.COOKIES.get('multi'))
values_for_cookie = newcookie.values
Or set the values:
mylist = [ 1, 2, 3 ]
newcookie = MultiCookie(values=mylist)
request.set_cookie('multi',value=newcookie)
Django does not support it. The best way would be to separate the values with arbitrary separator and then just split the string, like you already said.

Categories

Resources