At Google I/O 2010 there will be a session Next Gen Queries which will be partly on work done to get the full text on App Engine to be more usable. They are not telling what exactly will be proposed. One additional thing mentioned is filtering and sorting for zigzag queries.
I have to inform people not aware of it - exact search on PrzepisyMM.info is using this kind of queries and currently Datastore does not really allow to do sorting in this case because of exploding indexes. If they add sorting and filtering for zigzag, it will not solve all full text search problems but will be much, much more usable (no more zigzag queries sorting in memory which of course doesn't scale for results above 200, not to mention the 1000 limit).
16 January 2010
12 January 2010
PrzepisyMM.info wystartowały
Miło mi poinformować, że w niedzielę została uruchomiona witryna z Przepisami diety Montignac. Witryna pozwala zbierać i dzielić się przepisami ze znajomymi lub ze wszystkimi osobami i korzysta z technologii Google App Engine oraz Google Friend Connect.
Na tę chwilę witryna znajduje się w minimalistycznej fazie rozwojowej. Na razie nie znajdziesz tam wyrafinowanej grafiki ani wielu przepisów, ale mam nadzieję, że się to zmieni.
Witryna stanowi praktyczny eksperyment z serwerowej integracji z GFC (nie jest jeszcze zbyt głęboka), by utworzyć bardzo specjalistyczny portalik z modnym obecnie aspektem społecznościowym. Dodatkowo pozwoliła mi lepiej poznać kilka narzędzi wymienionych na stronie O aplikacji, między innymi framework Kay (a z nim Jinja2, Werkzeug i Zine), CleverCSS (framework CSS) i Smart Sprites (sprite'y CSS).
Moje boje z wyszukiwaniem pełnotekstowym w Google App Engine opisałem w jednym z wcześniejszych wpisów.
Na tę chwilę witryna znajduje się w minimalistycznej fazie rozwojowej. Na razie nie znajdziesz tam wyrafinowanej grafiki ani wielu przepisów, ale mam nadzieję, że się to zmieni.
Witryna stanowi praktyczny eksperyment z serwerowej integracji z GFC (nie jest jeszcze zbyt głęboka), by utworzyć bardzo specjalistyczny portalik z modnym obecnie aspektem społecznościowym. Dodatkowo pozwoliła mi lepiej poznać kilka narzędzi wymienionych na stronie O aplikacji, między innymi framework Kay (a z nim Jinja2, Werkzeug i Zine), CleverCSS (framework CSS) i Smart Sprites (sprite'y CSS).
Moje boje z wyszukiwaniem pełnotekstowym w Google App Engine opisałem w jednym z wcześniejszych wpisów.
6 January 2010
Simple nofollow filter for Markdown (Python)
I just want to share the simple Markdown extension that will add rel=nofollow to all <a> links generated by Markdown. The solution is simplistic, but it is meant to be used in user comments to not drain URL juice out of site or to lower position by bad urls mentioned by users (SEO). The code is based on the idea from Django Snippet 312.
import markdown
import re
R_NOFOLLOW = re.compile('<a (?![^>]*rel=["\']nofollow[\'"])')
S_NOFOLLOW = '<a rel="nofollow" '
class NofollowPostprocessor(markdown.postprocessors.Postprocessor):
def run(self, text):
return R_NOFOLLOW.sub(S_NOFOLLOW, text)
class NofollowExtension(markdown.Extension):
""" Add nofollow for links to Markdown. """
def extendMarkdown(self, md, md_globals):
md.postprocessors.add('nofollow', NofollowPostprocessor(md), '_end')
def makeExtension(configs={}):
return NofollowExtension(configs=configs)
22 December 2009
Google App Engine i wydajne korzystanie z API
Dwa tygodnie temu Guido van Rossum opublikował wersję testową aplikacji umożliwiającej przeglądanie statystyk użycia wywołań RPC na platformie Google App Engine (link). Ponieważ wywołania RPC to zawsze dodatkowe opóźnienie, a nie rzadko i dodatkowe pieniądze (szczególnie w systemach chmurowych), tego rodzaju narzędzia okazują się nieodzowne w poprawianiu aplikacji i znajdowaniu jej słabych punktów.
By jednak po uruchomieniu wspomnianej aplikacji nie zacząć płakać, bo trzeba przerobić część aplikacji, warto od początku pamiętać o kilku prostych zasadach optymalizacyjnych, szczególnie w odniesieniu do Datastore i memcache.
Oto one:
By jednak po uruchomieniu wspomnianej aplikacji nie zacząć płakać, bo trzeba przerobić część aplikacji, warto od początku pamiętać o kilku prostych zasadach optymalizacyjnych, szczególnie w odniesieniu do Datastore i memcache.
Oto one:
- stosować wersje multi poleceń get/put/delete dla Datastore i memcache
- wykorzystać cache procesu lub memcache dla gorących elementów, szczególnie jeśli możemy zamienić kilka wywołań RPC w jedno
- nie pobierać tego samego elementu kilkukrotnie
- w miarę możliwości zamieniać zapytania na wywołania db.get(), np. jeśli potrafimy wyliczyć klucz elementu i nie potrzebujemy sortowania
- pobierać tylko klucze w przypadku, gdy pozostała część jest zbędna
- uważać na transakcje i używać ich tylko wtedy, gdy są naprawdę niezbędne (każda transakcja to minimum 4 wywołania RPC - rozpoczęcie transakcji, pobranie, zapis, zatwierdzenie)
19 December 2009
Almost full text search on App Engine? Nothing hard!
It is a pity that Google App Engine does not come with build-in support for full text search (if you take into account that Google is a search web guru). But the case is not lost, especially if you are not interested in full text search on very log documents, but in more common product/element search. In this case each element has only relatively short number of words (title, description, etc.).
Currenty exist several full text search helpers for Google App Engine Python:
As you can see none of the solutions allow both exact and OR search, and ordering by some properties. Below I'm going to explain the solution that from user perspective should work relatively nice, but still has the same characteristic as above solutions:
Creating searchable entry
First, we must prepare search index when saving the main entity. Creating the index is best to do in as deffered task using Task Queue. After saving the main entity, we can call:
deferred_lib.defer(deferred.update_index, recipe.key(), 'add')
where update_index is the function to be run by Task Queue, recipe.key() is the Datastore key, and add is the suboperation. In the update_index we will just load the recipe by key and call its update_index() method. The method is really simple:
RecipeIdx(parent=self, key_name='idx', idx=self.make_index(), created=self.created, title=self.title, views=self.views).put()
It creates a child index entity using always the same key name. In this way we can update or add in the same way. In addition to index created by separate method, we are also copying properties on which we want to sort. Below is the make_index() method, one of two most interesting ones:
Method uses simple word segmenter for Polish language (no stemming, because for Polish; it is not so simple (see my master thesis) and not that important in my case). Segmenter returns important words as a set. But watch out for next line! In addition to segmented title, it adds full title, so we will get exact subsearch even in OR queries! The same trick is done for ingredients and tags.
The next trick is below: to make the possibility for user to limit the search, we use the same syntax as many other search systems. Yes, it will not work for OR queries, but has lower cost than additional filter properties in index (much more custom indexes, so it means more storage and more $). For example entering kategoria:"Zupy" will limit only to recipes that are soups.
We in addition remove empty string if it was returned. All repetitions were removed by using the set object. Because idx is stored in StringListProperty, code converts set to list (ordering does not matter that much).
Searching
Searching static method is a little more complicated, but only because it handles separatelly the OR and exact searches.
The method will return keys for recipe entities. It is more efficient than taking entities itself here, because of pagination. We must take all results here (by default 201), because of Google App Engine limitations. Then we can limit them to exact page and take only 20 or 30 recipes we really need in one call using for example Recipe.get(index_keys[:20]).
Ordering for exact search must be done in memory, because otherwise we have to use composite indexes which will end up as exploding indexes pretty easily!
Final words
Provided solution is still far from ideal but the limitations won't be very visible to the user. He can do sorting, flitering and choose between exact and OR search, he can even look for exact title or ingredients.
Limitations:
Currenty exist several full text search helpers for Google App Engine Python:
- google.appengine.ext.search.SearchableModel - created by Google, but only does OR search (with ordering if needed), only has English stop words removal. Works, but there are more effective and robust solutions.
- Searchable by Bill Katz - open source with English stemming, multi word exact search (but not OR) and no ordering as far as i know. In addition it uses a separe entity for index as described in Brett Slatkin's Building Scalable, Complex Apps talk at Google I/O
- gae full text search by AEP creators - only partly open source, has the same functions as previous + autocomplete, very simple sorting and additional filtering options.
As you can see none of the solutions allow both exact and OR search, and ordering by some properties. Below I'm going to explain the solution that from user perspective should work relatively nice, but still has the same characteristic as above solutions:
- uses additional entity for SearchIndex + properies used when sorting
- does not end up with exploding indexes, but may need more index storage because of sorting (it is unavoidable)
- use a simple trick for filtering (you know the Google site:x.com or Gmail in:inbox?)
- have a possibility to order in both exact and OR searches
- have multi word exact match in some cases even in OR searches
Creating searchable entry
First, we must prepare search index when saving the main entity. Creating the index is best to do in as deffered task using Task Queue. After saving the main entity, we can call:
deferred_lib.defer(deferred.update_index, recipe.key(), 'add')
where update_index is the function to be run by Task Queue, recipe.key() is the Datastore key, and add is the suboperation. In the update_index we will just load the recipe by key and call its update_index() method. The method is really simple:
RecipeIdx(parent=self, key_name='idx', idx=self.make_index(), created=self.created, title=self.title, views=self.views).put()
It creates a child index entity using always the same key name. In this way we can update or add in the same way. In addition to index created by separate method, we are also copying properties on which we want to sort. Below is the make_index() method, one of two most interesting ones:
def make_index(self):
idx = polish_word_segmenter.segment(self.title)
idx.add(self.title.lower().strip())
for x in [x.product for x in RecipeIngr.all().ancestor(self).fetch(20)]:
idx |= polish_word_segmenter.segment(x)
idx.add(x.lower().strip())
for x in self.tags:
idx |= polish_word_segmenter.segment(x)
idx.add(x.lower().strip())
idx |= set([
u'autor:"%s"' % self.parent_key().name(),
u'faza:"%s"' % self.phase,
u'kategoria:"%s"' % CATEGORIES_SV[self.category],
u'typ:"%s"' % self.type,
u'czas:"%s"' % self.time
])
idx -= set([u''])
return list(idx)
Method uses simple word segmenter for Polish language (no stemming, because for Polish; it is not so simple (see my master thesis) and not that important in my case). Segmenter returns important words as a set. But watch out for next line! In addition to segmented title, it adds full title, so we will get exact subsearch even in OR queries! The same trick is done for ingredients and tags.
The next trick is below: to make the possibility for user to limit the search, we use the same syntax as many other search systems. Yes, it will not work for OR queries, but has lower cost than additional filter properties in index (much more custom indexes, so it means more storage and more $). For example entering kategoria:"Zupy" will limit only to recipes that are soups.
We in addition remove empty string if it was returned. All repetitions were removed by using the set object. Because idx is stored in StringListProperty, code converts set to list (ordering does not matter that much).
Searching
Searching static method is a little more complicated, but only because it handles separatelly the OR and exact searches.
@classmethod
def search(cls, q, limit=201, order=None, exact=True):
if not q:
return []
# First handle the filters, eg. kategoria:"Zupy"
filters = [f.group(0) for f in FILTER_REGEXP.finditer(q)]
for f in filters:
q = q.replace(f, '')
# Handle exact muliword (eg. "Zupa borowikowa"), it must be in double quotes
multi = [multi.group(0)[1:-1].lower() for multi in MULTIWORD_REGEXP.finditer(q)]
for m in multi:
q = q.replace(m, '')
# Do the by word dividing.
keywords = set(DELIMETER_REGEXP.sub(' ', q).lower().split())
# Add previously removed filters and mulitoword
keywords |= set(multi)
keywords |= set(filters)
# Remove one letter words, we do not hame them in index.
keywords = filter(lambda x: len(x) >= 2, keywords)
# Limit the number of keywords to 10 (CPU saving for misbehaving users)
keywords = list(keywords)[:10] # Only do first 10 to limit filtering/queries
if not keywords:
return []
if not exact:
# For exact use the IN opearator, that will issue len(keywords) queries, but then join them.
if len(keywords) > 1:
query = RecipeIdx.all().filter('idx IN', keywords)
else:
query = RecipeIdx.all().filter('idx =', keywords[0])
if order:
query = query.order(order)
index_keys = [entry.parent_key() for entry in query.fetch(limit)]
else:
# If we do not do ordering, we can do it more efficiently by taking only keys!
if not order:
query = RecipeIdx.all(keys_only=True)
for keyword in keywords:
query = query.filter('idx =', keyword)
index_keys = [key.parent() for key in query.fetch(limit)]
else:
query = RecipeIdx.all()
for keyword in keywords:
query = query.filter('idx =', keyword)
results = query.fetch(limit)
# Sort in memory...
_key = order[1:] if order[0] == '-' else order
_reverse = True if order[0] == '-' else False
results = sorted(results, key=lambda x: getattr(x, _key), reverse=_reverse)
index_keys = [entry.parent_key() for entry in results]
return index_keys
The method will return keys for recipe entities. It is more efficient than taking entities itself here, because of pagination. We must take all results here (by default 201), because of Google App Engine limitations. Then we can limit them to exact page and take only 20 or 30 recipes we really need in one call using for example Recipe.get(index_keys[:20]).
Ordering for exact search must be done in memory, because otherwise we have to use composite indexes which will end up as exploding indexes pretty easily!
Final words
Provided solution is still far from ideal but the limitations won't be very visible to the user. He can do sorting, flitering and choose between exact and OR search, he can even look for exact title or ingredients.
Limitations:
- not usable filtering when using OR search
- max 1000 results (in practice 200-300), but how may times you look at 15 page of Google results?
- exact search with ordering may miss important results when results > limit,
- it will not be very usable for long documents needing a search functionality
Subscribe to:
Posts (Atom)