Currenty exist several full text search helpers for Google App Engine Python:
- google.appengine.ext.search.SearchableModel - created by Google, but only does OR search (with ordering if needed), only has English stop words removal. Works, but there are more effective and robust solutions.
- Searchable by Bill Katz - open source with English stemming, multi word exact search (but not OR) and no ordering as far as i know. In addition it uses a separe entity for index as described in Brett Slatkin's Building Scalable, Complex Apps talk at Google I/O
- gae full text search by AEP creators - only partly open source, has the same functions as previous + autocomplete, very simple sorting and additional filtering options.
As you can see none of the solutions allow both exact and OR search, and ordering by some properties. Below I'm going to explain the solution that from user perspective should work relatively nice, but still has the same characteristic as above solutions:
- uses additional entity for SearchIndex + properies used when sorting
- does not end up with exploding indexes, but may need more index storage because of sorting (it is unavoidable)
- use a simple trick for filtering (you know the Google site:x.com or Gmail in:inbox?)
- have a possibility to order in both exact and OR searches
- have multi word exact match in some cases even in OR searches
Creating searchable entry
First, we must prepare search index when saving the main entity. Creating the index is best to do in as deffered task using Task Queue. After saving the main entity, we can call:
deferred_lib.defer(deferred.update_index, recipe.key(), 'add')
where update_index is the function to be run by Task Queue, recipe.key() is the Datastore key, and add is the suboperation. In the update_index we will just load the recipe by key and call its update_index() method. The method is really simple:
RecipeIdx(parent=self, key_name='idx', idx=self.make_index(), created=self.created, title=self.title, views=self.views).put()
It creates a child index entity using always the same key name. In this way we can update or add in the same way. In addition to index created by separate method, we are also copying properties on which we want to sort. Below is the make_index() method, one of two most interesting ones:
def make_index(self):
idx = polish_word_segmenter.segment(self.title)
idx.add(self.title.lower().strip())
for x in [x.product for x in RecipeIngr.all().ancestor(self).fetch(20)]:
idx |= polish_word_segmenter.segment(x)
idx.add(x.lower().strip())
for x in self.tags:
idx |= polish_word_segmenter.segment(x)
idx.add(x.lower().strip())
idx |= set([
u'autor:"%s"' % self.parent_key().name(),
u'faza:"%s"' % self.phase,
u'kategoria:"%s"' % CATEGORIES_SV[self.category],
u'typ:"%s"' % self.type,
u'czas:"%s"' % self.time
])
idx -= set([u''])
return list(idx)
Method uses simple word segmenter for Polish language (no stemming, because for Polish; it is not so simple (see my master thesis) and not that important in my case). Segmenter returns important words as a set. But watch out for next line! In addition to segmented title, it adds full title, so we will get exact subsearch even in OR queries! The same trick is done for ingredients and tags.
The next trick is below: to make the possibility for user to limit the search, we use the same syntax as many other search systems. Yes, it will not work for OR queries, but has lower cost than additional filter properties in index (much more custom indexes, so it means more storage and more $). For example entering kategoria:"Zupy" will limit only to recipes that are soups.
We in addition remove empty string if it was returned. All repetitions were removed by using the set object. Because idx is stored in StringListProperty, code converts set to list (ordering does not matter that much).
Searching
Searching static method is a little more complicated, but only because it handles separatelly the OR and exact searches.
@classmethod
def search(cls, q, limit=201, order=None, exact=True):
if not q:
return []
# First handle the filters, eg. kategoria:"Zupy"
filters = [f.group(0) for f in FILTER_REGEXP.finditer(q)]
for f in filters:
q = q.replace(f, '')
# Handle exact muliword (eg. "Zupa borowikowa"), it must be in double quotes
multi = [multi.group(0)[1:-1].lower() for multi in MULTIWORD_REGEXP.finditer(q)]
for m in multi:
q = q.replace(m, '')
# Do the by word dividing.
keywords = set(DELIMETER_REGEXP.sub(' ', q).lower().split())
# Add previously removed filters and mulitoword
keywords |= set(multi)
keywords |= set(filters)
# Remove one letter words, we do not hame them in index.
keywords = filter(lambda x: len(x) >= 2, keywords)
# Limit the number of keywords to 10 (CPU saving for misbehaving users)
keywords = list(keywords)[:10] # Only do first 10 to limit filtering/queries
if not keywords:
return []
if not exact:
# For exact use the IN opearator, that will issue len(keywords) queries, but then join them.
if len(keywords) > 1:
query = RecipeIdx.all().filter('idx IN', keywords)
else:
query = RecipeIdx.all().filter('idx =', keywords[0])
if order:
query = query.order(order)
index_keys = [entry.parent_key() for entry in query.fetch(limit)]
else:
# If we do not do ordering, we can do it more efficiently by taking only keys!
if not order:
query = RecipeIdx.all(keys_only=True)
for keyword in keywords:
query = query.filter('idx =', keyword)
index_keys = [key.parent() for key in query.fetch(limit)]
else:
query = RecipeIdx.all()
for keyword in keywords:
query = query.filter('idx =', keyword)
results = query.fetch(limit)
# Sort in memory...
_key = order[1:] if order[0] == '-' else order
_reverse = True if order[0] == '-' else False
results = sorted(results, key=lambda x: getattr(x, _key), reverse=_reverse)
index_keys = [entry.parent_key() for entry in results]
return index_keys
The method will return keys for recipe entities. It is more efficient than taking entities itself here, because of pagination. We must take all results here (by default 201), because of Google App Engine limitations. Then we can limit them to exact page and take only 20 or 30 recipes we really need in one call using for example Recipe.get(index_keys[:20]).
Ordering for exact search must be done in memory, because otherwise we have to use composite indexes which will end up as exploding indexes pretty easily!
Final words
Provided solution is still far from ideal but the limitations won't be very visible to the user. He can do sorting, flitering and choose between exact and OR search, he can even look for exact title or ingredients.
Limitations:
- not usable filtering when using OR search
- max 1000 results (in practice 200-300), but how may times you look at 15 page of Google results?
- exact search with ordering may miss important results when results > limit,
- it will not be very usable for long documents needing a search functionality
3 komentarze: