Miscellany

Word Search Plus Backend

When playing Word Search Plus, users are sometimes confronted with the main drawback of the game, the reliance on Wikipedia. This problem manifests itself by suggesting topics for which no puzzle can be generated. If an article has few words, it becomes technically impossible to properly fill the puzzle (currently at least 90% of letter-positions need to be filled by actual words).

Analytics show us that currently this (GenFailed) happens around 20% of all times a player selects a topic (TopicChosen) :

This of course is a bad user experience, frustrating players and inciting them to uninstall the app and leave bad ratings.

In order to fix this problem, we need to know from which topics we will be able to distill a valid puzzle.

Solution

Brute force it.

Luckily, Wikipedia offers its complete database for download. We are interested in the archive containing the latest version of all articles, currently, for English, this is an xml file, 44 GB in size. We can download a zipped version weighing in at around 10 GB.

We process this file using a python script, extracting for every article its title, and content. This content is then analyzed to extract the relevant words, the same way the mobile apps would do. Using this list of words, a search starts for a possible puzzle. Using the same method as the mobile apps, trying random combinations until at least one puzzle is found, or until a certain number of tries has been reached without result. This is done for each of the (currently) three possible puzzletypes (8x6, 10x8, 12x10). Ultimately these puzzles are stored along with the list of words in a MySql database.

The script makes use of multiple threads, taking advantage of as many logical cores as exist in the computer it's run on.

Also instead of the default python runtime, we use PyPy, an alternative python implementation that allows for a sixfold speed increase.

Processing the entire english Wikipedia, on a p4 3.2Ghz quadcore (8 logical cores), using a maximum count of 10.000 tries per puzzlesize per topic, takes around 2 days.

Interface

The MySql database is stored in Google Cloud SQL, and can be accessed by an App Engine app, which converts urls to database queries.

Urls of the form "/s/<language>/<size>/<query>" are split up, resulting in three parameters for the database: language, gridsize and partial query.

An sql query is constructed (using prepared statements) and the database is queried. This will return a number of results, which get packed into a json structure along with some statistics (for example how long the query took).

This json then is sent as the reply to the app's http request, where it will be processed, much like when Wikipedia was queried.

Puzzle generation

When the player has selected a topic, the appropriate wordlist is fetched from the database, along with 1 guaranteed puzzle. If the app is unable to find a puzzle, for instance because the random generation process is unlucky, it can default to the 1 puzzle that was found by the python script. This ensures that there will always be at least 1 puzzle, and we do not have to let down the player.

Optimization

App Engine allows the use of memcache, which will limit the amount of database queries. If a certain url has been requested before, it might be found in memcache, and the content of this value is returned.

Puzzle Suggestions

When typing a query in Wikipedia's search field, the website autocompletes possible topics. These are topics that are generally important and viewed a lot.

Before finalizing the MySql database, a seperate python script reads the wordlists from the dumpfile, and counts how many incoming links a certain topic has. This is an approximation of how 'important' an article is. Looking at the results, it's probably how Wikipedia generates these lists, since they appear to match. So every topic also has an 'importance' which determines how high in the suggestionlist it appears.

This has one drawback, though. It's possible for topics with very few incoming links to not appear, when their exact text is searched. For example "A", has very few incoming links, and will never appear above any article starting with an A, where it competes with topics like "Animal" and "Australia". In order to still be able to play these puzzles, the query is modified to also do an exact stringmatch in the database. These hits are then appended at the bottom of the list.

Random suggestions

What's more fun than discovering things? When the player has entered no searchquery, the database is queried for 15 random topics. This ensures the player can always find something new, and does not get stuck trying to find interesting topics.

The apps themselves allow for cleaning of the searchquery (and thus finding the random topics) by shaking the device.