Use-cases
How I cracked Chinese Wordle using knowledge graph
Wordle is going viral these days on social media. The game made by Josh Wardle allows players to try six times to guess a five-letter word, with feedback given for each guess in the form of colored tiles indicating when letters match or occupy the correct position.
We have seen many Wordle variants for languages that use the Latin script, such as the Spanish Wordle, French Wordle, and German Wordle. However, for non-alphabetic languages like Chinese, a simple adaptation of the English Wordle’s rules just won’t work.
In China, where most people are more familiar with Chinese characters, or hanzi, Wordle fans have invented a localized version of Wordle with a very clever name: Handle. (And you guessed it right, it is a combination of hanzi and Wordle.)
Like Wordle, Handle allows players to try 10 times to guess a four-character Chinese idiom, or chengyu. While English Wordle uses letters and their positions to indicate whether players have made the correct guess, Handle uses pinyin, a romanization system for Simplified Chinese in mainland China, to give players feedback.
After every guess in Handle, each hanzi and their pinyin is marked as either cyan, orange, or gray: cyan indicates the hanzi or the pinyin partial (the initial or the final) is correct and in the correct position, orange means that it is in the answer but not in the right position, while gray indicates it is not in the answer at all. Of course, each game is given a hint to indicate one of the character’s pinyin or hanzi (depending on how hard you want the game to be).
Here is an example of how Handle is successfully played.
The Handle helper: Your second brain
Of all magics that were used to solve Wordle, I was most impressed by Grant Sanderson (or 3Blue1Brown as he is known on YouTube), who provided an elegant and delightful way to solve Wordle using information theory.
Today, I’d like to write about how I solved Handle (the Chinese Wordle) using a different approach – knowledge graph. The idea behind the solution is that I believe a knowledge graph is the best way to mimic how we search for the final answer in our minds in games like Wordle and Handle.
Imagining a knowledge graph of five-letter words and letters that are connected using edges representing their relations (contain or made up of), each try in the game will give you some clues (hopefully!) about how to approach the hidden answer in the network of words and letters.
The same theory applies to Handle. We can also create a knowledge graph of chengyu (four-character Chinese idiom) and Chinese characters that are connected using edges to represent their relations.
Before we dive into how to solve Handle using a knowledge graph, I’d like to go through how to play Handle without the help of computers.
I have mentioned that Handle uses pinyin and hanzi to give players feedback. But pinyin is complicated, it consists of initials (声母; shēngmǔ), finals (韵母; yùnmǔ), and one in four tones. For example, the pinyin of the hanzi "声" (sheng1) is made up of the initial "sh" and the final "eng," and its tone is the first tone (tone 1).
For each guess, each character may have the correct initial, but the wrong final. Sometimes if you are guessing the initials and finals right, the tone might be wrong. For example, the pinyin is "sheng1" could be “声” (sound) but the pinyin "sheng4" could be “圣” (sacred).
Let’s see what it is like to play Handle:
- Players are given 10 times to guess the correct 4-character Chengyu.
- Characters are the most basic element to be considered:
- For example, in the first line in the screenshot below, the character "门" colored in green in position 2 is the correct hanzi and in the correct position.
- In the second line, the character "仓" colored in orange is the correct hanzi but not in the correct position.
- pinyin of the character provides further information. However, we should also know that sometimes more than one character may share one exact pinyin.
- In the third line of the picture, the pinyin "qiao" colored in green means that the first character of the idiom is pronounced as "qiao" but the tone should not be the third tone as the guess implies.
- In the third line, the final "uo" colored in orange means that there is at least one character in the idiom that has the final "ou" in their pinyin but the character is not in position 2.
The Handle knowledge graph
I’m not going to create a fully automated Handle solver. That will just kill the fun of the game. Instead, I’m going to make a Handle helper, which I call a second brain, that will help people reach the hidden four-character idiom.
When playing the English Wordle, people can search for five-letter words based on clues they already have. For example, they can search for five-letter words with the most vowels, five-letter words starting with "sau," or five-letter words ending with "e".
In Handle(Chinese), it’s almost impossible to search based on hints like tones and initials of pinyin in search engines, because most Chinese webpages are simply made of hanzi, not their pinyin, or tones.
As I mentioned, the key idea of this helper is that it should work as a second brain to help people locate the answer in the sea of Chinese idioms, which are estimated to have an amount of up to 20,000. Then the question is: How does our brain work while handling the knowledge of Handle (And yes, pun intended 😏)?
Thus, why not do it in a graph/neural-network way? And here we go, let’s create a knowledge graph of Chinese idioms and see how it goes with the Handle game.
What is a knowledge graph?
Simply put, the Knowledge Graph is a network of connected relationships between entities. It was originally proposed by Google and was used to answer search queries that are only possible to be answered via knowledge-based reasoning, rather than the inverted indexing of web pages. For example: “How many championships have the Houston Rockets won?” and “Who was married to Elvis Presley? ”
How to build a knowledge graph for Handle?
A knowledge graph is composed of entities (vertices) and relationships (edges), and a graph database management system can be easily used to index, query, and explore the knowledge graph.
In this article, I will use the open-source graph database NebulaGraph to build the knowledge graph for solving Handle. Let’s start with the modeling of the Handle knowledge graph using NebulaGraph.
Setting up the Handle knowledge graph
Modeling the knowledge graph
The modeling of a Handle knowledge graph is actually quite straightforward: I only need to index entities in the game as vertices and connect them using their relationships.
Oftentimes, you will have to come back to optimize the schema after playing with the knowledge graph afterward. But the main principle is to not over-design it: Just do it in an intuitive way.
For my practice, the Handle knowledge graph has the following types of vertices:
- idiom (four-character Chinese words)
- character
- pinyin
- tone (1, 2, 3, 4)
- pinyin_part
- type (initial and final)
There are three types of edges:
- with_char
- with_pinyin
- with_pinyin_part
Of course, each type of vertex and edge will have its own properties. For example, the vertex "idiom" will have a VID, which is a unique identifier in NebulaGraph; its pinyin represented by initials, finals, and tone numbers (like the pinyin "sheng4" mentioned above).
The following sketch is a rough representation of the schema of the Handle knowledge graph.
After modeling, what we need to do is to collect, clean, and index the data.
I extracted the universal set of idioms used in Handle from the game’s Github repo. I used PyPinyin, an open-source Python library, to convert idioms into their pinyin. PyPinyin can also be used to split pinyin entities into their initials and finals.
Here is the Github repo for the project: https://github.com/wey-gu/chinese-graph
Deploy NebulaGraph
You can use Nebula-UP to deploy NebulaGraph using only one line of code.
git clone https://github.com/wey-gu/chinese-graph.git && cd chinese-graph
Load the data
# Get the code from the Github repo and load generate the data
git clone https://github.com/wey-gu/chinese-graph.git && cd chinese-graph
python3 graph_data_generator.py # generate the Handle knowledge graph data
# load data into NebulaGraph with Nebula-Importer
docker run --rm -ti \
--network=nebula-docker-compose_nebula-net \
-v ${PWD}/importer_conf.yaml:/root/importer_conf.yaml \
-v ${PWD}/output:/root \
vesoft/nebula-importer:v3.0.0 \
--config /root/importer_conf.yaml
Play Handle with knowledge graph
With all the setup ready, let’s start playing the game with our second brain.
Now let’s visit the Handle game. Say we use the idiom "爱憎分明" as our first guess. After typing the idiom, we get our first batch of hints:
Not bad, now we have a few informative hints:
- There is one character with the final of "ai" in tone 4, but it is not in position 1 and it is not "爱".
- There is one character in tone 1 but it is not in position 2.
- There is one character with the final "ing", but the character is not in position 4.
- The 4th character is in tone 2.
Let’s translate these hints into NebulaGraph’s nGQL graph query language:
# There is one character with the final of *"ai"* in tone 4, but it is not in position 1 (starting from 0 in the query) and it is not *"爱"*.
(char0:character)<-[with_char_0:with_character]-(x:idiom)-[with_pinyin_0:with_pinyin]->(pinyin_0:character_pinyin)-[:with_pinyin_part]->(final_part_0:pinyin_part{part_type: "final"})
WHERE id(final_part_0) == "ai" AND pinyin_0.character_pinyin.tone == 4 AND with_pinyin_0.position != 0 AND with_char_0.position != 0 AND id(char0) != "爱"
# There is one character in tone 1 but it is not in position 2.
MATCH (x:idiom) -[with_pinyin_1:with_pinyin]->(pinyin_1:character_pinyin)
WHERE pinyin_1.character_pinyin.tone == 1 AND with_pinyin_1.position != 1
#There is one character with the final “ing”, but the character is not in position 4.
MATCH (x:idiom) -[with_pinyin_2:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(final_part_2:pinyin_part{part_type: "final"})
WHERE id(final_part_2) == "ing" AND with_pinyin_2.position != 3
# The 4th character is in tone 2.
MATCH (x:idiom) -[with_pinyin_3:with_pinyin]->(pinyin_3:character_pinyin)
WHERE pinyin_3.character_pinyin.tone == 2 AND with_pinyin_3.position == 3
RETURN x, count(x) as c ORDER BY c DESC
After inputting those queries into the NebulaGraph instance that runs our Handle knowledge graph, now we have seven alternative idioms that can be used in the second guess!
("惊愚骇俗" :idiom{pinyin: "['jing1', 'yu2', 'hai4', 'su2']"})
("惊世骇俗" :idiom{pinyin: "['jing1', 'shi4', 'hai4', 'su2']"})
("惊见骇闻" :idiom{pinyin: "['jing1', 'jian4', 'hai4', 'wen2']"})
("沽名卖直" :idiom{pinyin: "['gu1', 'ming2', 'mai4', 'zhi2']"})
("惊心骇神" :idiom{pinyin: "['jing1', 'xin1', 'hai4', 'shen2']"})
("荆棘载途" :idiom{pinyin: "['jing1', 'ji2', 'zai4', 'tu2']"})
("出卖灵魂" :idiom{pinyin: "['chu1', 'mai4', 'ling2', 'hun2']"})
Let’s give the idiom "惊世骇俗" a try.
And here we go, we got the final answer. It is "惊世骇俗".
Let’s try again with another day’s Handle (Mar 1).
My first guess was "一言为定". And we got the following feedback:
This can be translated into the following nGQL statements:
# There is one character that is not in the first position whose pinyin final is "i" in the first tone, but its pinyin is not "yi"
MATCH (x:idiom) -[with_pinyin_0:with_pinyin]->(char_pinyin_0:character_pinyin)-[:with_pinyin_part]->(final_part_0:pinyin_part{part_type: "final"})
WHERE id(final_part_0) == "i" AND char_pinyin_0.character_pinyin.tone == 1 AND with_pinyin_0.position != 0 AND id(char_pinyin_0) != "yi1"
# There is one character whose pinyin initial is "d," but the character is not in the 4th position
MATCH (x:idiom) -[with_pinyin_1:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(initial_part_1:pinyin_part{part_type: "initial"})
WHERE id(initial_part_1) == "d" AND with_pinyin_1.position != 3
# The third character is in tone 2, but its pinyin is not "wei"
MATCH (x:idiom) -[with_pinyin_2:with_pinyin]->(char_pinyin_2:character_pinyin)
WHERE char_pinyin_2.character_pinyin.tone == 2 AND id(char_pinyin_2) != "wei2" AND with_pinyin_2.position == 2
RETURN x
Here is what we get:
("堆积如山" :idiom{pinyin: "['dui1', 'ji1', 'ru2', 'shan1']"})
("丹漆随梦" :idiom{pinyin: "['dan1', 'qi1', 'sui2', 'meng4']"})
("植党营私" :idiom{pinyin: "['zhi2', 'dang3', 'ying2', 'si1']"})
("结党营私" :idiom{pinyin: "['jie2', 'dang3', 'ying2', 'si1']"})
("堆案盈几" :idiom{pinyin: "['dui1', 'an4', 'ying2', 'ji1']"})
("涓滴成河" :idiom{pinyin: "['juan1', 'di1', 'cheng2', 'he2']"})
("当之无愧" :idiom{pinyin: "['dang1', 'zhi1', 'wu2', 'kui4']"})
("荡析离居" :idiom{pinyin: "['dang4', 'xi1', 'li2', 'ju1']"})
("路断人稀" :idiom{pinyin: "['lu4', 'duan4', 'ren2', 'xi1']"})
("地广人稀" :idiom{pinyin: "['di4', 'guang3', 'ren2', 'xi1']"})
("地广人希" :idiom{pinyin: "['di4', 'guang3', 'ren2', 'xi1']"})
("地旷人稀" :idiom{pinyin: "['di4', 'kuang4', 'ren2', 'xi1']"})
("大失人望" :idiom{pinyin: "['da4', 'shi1', 'ren2', 'wang4']"})
("得不酬失" :idiom{pinyin: "['de2', 'bu4', 'chou2', 'shi1']"})
("得失荣枯" :idiom{pinyin: "['de2', 'shi1', 'rong2', 'ku1']"})
("独木难支" :idiom{pinyin: "['du2', 'mu4', 'nan2', 'zhi1']"})
("不得而知" :idiom{pinyin: "['bu4', 'de2', 'er2', 'zhi1']"})
("班师得胜" :idiom{pinyin: "['ban1', 'shi1', 'de2', 'sheng4']"})
("是非得失" :idiom{pinyin: "['shi4', 'fei1', 'de2', 'shi1']"})
("鸡虫得失" :idiom{pinyin: "['ji1', 'chong2', 'de2', 'shi1']"})
("锋镝余生" :idiom{pinyin: "['feng1', 'di1', 'yu2', 'sheng1']"})
("心到神知" :idiom{pinyin: "['xin1', 'dao4', 'shen2', 'zhi1']"})
("小大由之" :idiom{pinyin: "['xiao3', 'da4', 'you2', 'zhi1']"})
("水滴石穿" :idiom{pinyin: "['shui3', 'di1', 'shi2', 'chuan1']"})
("天打雷劈" :idiom{pinyin: "['tian1', 'da3', 'lei2', 'pi1']"})
Let’s try the idiom "首当其冲" this time.
Again, let’s try to translate the clues into nGQL:
# There is one character whose pinyin initial is "ch"
MATCH (x:idiom) -[:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(initial_part_0:pinyin_part{part_type: "initial"})
WHERE id(initial_part_0) == "ch"
# There is one character whose pinyin initial is "d"
MATCH (x:idiom) -[:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(initial_part_1:pinyin_part{part_type: "initial"})
WHERE id(initial_part_1) == "d"
# There is one character whose *pinyin* initial is "sh"
MATCH (x:idiom) -[:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(initial_part_2:pinyin_part{part_type: "initial"})
WHERE id(initial_part_2) == "sh"
# The third character is in tone 2
MATCH (x:idiom) -[with_pinyin3:with_pinyin]->(char_pinyin3:character_pinyin)
WHERE char_pinyin3.character_pinyin.tone == 2 AND with_pinyin3.position == 2
# The fourth character is in tone 1
MATCH (x:idiom) -[with_pinyin4:with_pinyin]->(char_pinyin4:character_pinyin)
WHERE char_pinyin4.character_pinyin.tone == 1 AND with_pinyin4.position == 3
# There is one character whose Pinyin final is "ang"
MATCH (x:idiom) -[:with_pinyin]->(:character_pinyin)-[:with_pinyin_part]->(final_part_5:pinyin_part{part_type: "final"})
WHERE id(final_part_5) == "ang"
RETURN x
We got three possible results:
("适当其冲" :idiom{pinyin: "['shi4', 'dang1', 'qi2', 'chong1']"})
("得不偿失" :idiom{pinyin: "['de2', 'bu4', 'chang2', 'shi1']"})
("首当其冲" :idiom{pinyin: "['shou3', 'dang1', 'qi2', 'chong1']"})
Let’s try the idiom "首当其冲".
And bingo! We’ve got it.
What’s Next
If you happen to be interested in graph databases, you can check out the NebulaGraph project on Github.
NebulaGraph will soon roll out a Visual Builder to enable users to generate nGQL queries in a drag and drop interface. With the no-code tool, you can explore the Handle knowledge graph more easily if you aren’t already familiar with graph query languages like nGQL. If you are interested in the new feature, please join our Slack channel to get alerted when it’s ready.
I will also share more visualized ways to play Handle in a follow-up article, please stay tuned.
Also, an easier way to try NebulaGraph out is its fully managed service in the cloud NebulaGraph is now available on Microsoft’s Azure Marketplace. The service is currently in a beta period and is offering a generous 70% off for beta users. Sign up here to get the offer if you are interested!
Happy graphing!
About the Author
Wey Gu is the Developer Advocate of NebulaGraph. He is passionate about spreading the graph technology to the developer community and trying his best to make distributed graph database more accessible. Follow him on Twitter or visit his blog for more fun stuff.