github.com/dgraph-io/dgraph@v1.2.8/wiki/content/tutorial-6/index.md (about) 1 +++ 2 title = "Getting started with Dgraph - Advanced text search on social graphs" 3 +++ 4 5 **Welcome to the sixth tutorial of getting started with Dgraph.** 6 7 In the [previous tutorial]({{< relref "tutorial-5/index.md" >}}), we learned about building social graphs in Dgraph, by modeling tweets as an example. 8 We queried the tweets using the 'hash' and 'exact' indices, and implemented a keyword-based search to find your favorite tweets using the 'term' index and its functions. 9 10 In this tutorial, we'll continue from where we left off and learn about advanced text search features in Dgraph. 11 12 Specifically, we'll focus on two advanced feature: 13 14 - Searching for tweets using Full-text search. 15 - Searching for hashtags using the regular expression search. 16 17 The accompanying video of the tutorial will be out shortly, so stay tuned to [our YouTube channel](https://www.youtube.com/channel/UCghE41LR8nkKFlR3IFTRO4w). 18 19 Before we dive in, let's do a quick recap of how to model the tweets in Dgraph. 20 21 {{% load-img "/images/tutorials/5/a-graph-model.jpg" "tweet model" %}} 22 23 In the previous tutorial, we took three real tweets as a sample dataset and stored them in Dgraph using the above graph as a model. 24 25 In case you haven't stored the tweets from the [previous tutorial]({{< relref "tutorial-5/index.md" >}}) into Dgraph, here's the sample dataset again. 26 27 Copy the mutation below, go to the mutation tab and click Run. 28 29 ```json 30 { 31 "set": [ 32 { 33 "user_handle": "hackintoshrao", 34 "user_name": "Karthic Rao", 35 "uid": "_:hackintoshrao", 36 "authored": [ 37 { 38 "tweet": "Test tweet for the fifth episode of getting started series with @dgraphlabs. Wait for the video of the fourth one by @francesc the coming Wednesday!\n#GraphDB #GraphQL", 39 "tagged_with": [ 40 { 41 "uid": "_:graphql", 42 "hashtag": "GraphQL" 43 }, 44 { 45 "uid": "_:graphdb", 46 "hashtag": "GraphDB" 47 } 48 ], 49 "mentioned": [ 50 { 51 "uid": "_:francesc" 52 }, 53 { 54 "uid": "_:dgraphlabs" 55 } 56 ] 57 } 58 ] 59 }, 60 { 61 "user_handle": "francesc", 62 "user_name": "Francesc Campoy", 63 "uid": "_:francesc", 64 "authored": [ 65 { 66 "tweet": "So many good talks at #graphqlconf, next year I'll make sure to be *at least* in the audience!\nAlso huge thanks to the live tweeting by @dgraphlabs for alleviating the FOMO😊\n#GraphDB ♥️ #GraphQL", 67 "tagged_with": [ 68 { 69 "uid": "_:graphql" 70 }, 71 { 72 "uid": "_:graphdb" 73 }, 74 { 75 "hashtag": "graphqlconf" 76 } 77 ], 78 "mentioned": [ 79 { 80 "uid": "_:dgraphlabs" 81 } 82 ] 83 } 84 ] 85 }, 86 { 87 "user_handle": "dgraphlabs", 88 "user_name": "Dgraph Labs", 89 "uid": "_:dgraphlabs", 90 "authored": [ 91 { 92 "tweet": "Let's Go and catch @francesc at @Gopherpalooza today, as he scans into Go source code by building its Graph in Dgraph!\nBe there, as he Goes through analyzing Go source code, using a Go program, that stores data in the GraphDB built in Go!\n#golang #GraphDB #Databases #Dgraph ", 93 "tagged_with": [ 94 { 95 "hashtag": "golang" 96 }, 97 { 98 "uid": "_:graphdb" 99 }, 100 { 101 "hashtag": "Databases" 102 }, 103 { 104 "hashtag": "Dgraph" 105 } 106 ], 107 "mentioned": [ 108 { 109 "uid": "_:francesc" 110 }, 111 { 112 "uid": "_:dgraphlabs" 113 } 114 ] 115 }, 116 { 117 "uid": "_:gopherpalooza", 118 "user_handle": "gopherpalooza", 119 "user_name": "Gopherpalooza" 120 } 121 ] 122 } 123 ] 124 } 125 ``` 126 127 _Note: If you're new to Dgraph, and this is the first time you're running a mutation, we highly recommend reading the [first tutorial of the series before proceeding.]({{< relref "tutorial-1/index.md" >}})_ 128 129 Voilà! Now you have a graph with 'tweets', 'users', and 'hashtags'. It is ready for us to explore. 130 131 {{% load-img "/images/tutorials/5/x-all-tweets.png" "tweet graph" %}} 132 133 _Note: If you're curious to know how we modeled the tweets in Dgraph, refer to [the previous tutorial.]({{< relref "tutorial-5/index.md" >}})_ 134 135 Let's start by finding your favorite tweets using the full-text search feature first. 136 137 ## Full text search 138 139 Before we learn how to use the Full-text search feature, it's important to understand when to use it. 140 141 The length and the number of words in a string predicate value vary based on what the predicates represent. 142 143 Some string predicate values have only a few terms (words) in them. 144 Predicates representing `names`, `hashtags`, `twitter handle`, `city names` are a few good examples. These predicates are easy to query using their exact values. 145 146 147 For instance, here is an example query. 148 149 _Give me all the tweets where the user name is equal to `John Campbell`_. 150 151 You can easily compose queries like these after adding either the `hash` or an `exact` index to the string predicates. 152 153 154 But, some of the string predicates store sentences. Sometimes even one or more paragraphs of text data in them. 155 Predicates representing a tweet, a bio, a blog post, a product description, or a movie review are just some examples. 156 It's relatively hard to query these predicates. 157 158 It's not practical to query such predicates using the `hash` or `exact` string indices. 159 A keyword-based search using the `term` index is a good starting point to query such predicates. 160 We used it in our [previous tutorial]({{< relref "tutorial-5/index.md" >}}) to find the tweets with an exact match for keywords like `GraphQL`, `Graphs`, and `Go`. 161 162 But, for some of the use cases, just the keyword-based search may not be sufficient. 163 You might need a more powerful search capability, and that's when you should consider using Full-text search. 164 165 Let's write some queries and understand Dgraph's Full-text search capability in detail. 166 167 To be able to do a Full-text search, you need to first set a `fulltext` index on the `tweet` predicate. 168 169 Creating a `fulltext` index on any string predicate is similar to creating any other string indices. 170 171 {{% load-img "/images/tutorials/6/a-set-index.png" "full text" %}} 172 173 _Note: Refer to the [previous tutorial]({{< relref "tutorial-5/index.md" >}}) if you're not sure about creating an index on a string predicate._ 174 175 Now, let's do a Full-text search query to find tweets related to the following topic: `graph data and analyzing it in graphdb`. 176 177 You can do so by using either of `alloftext` or `anyoftext` in-built functions. 178 Both functions take two arguments. 179 The first argument is the predicate to search. 180 The second argument is the space-separated string values to search for, and we call these as the `search strings`. 181 182 ```sh 183 - alloftext(predicate, "space-separated search strings") 184 - anyoftext(predicate, "space-separated search strings") 185 ``` 186 187 We'll look at the difference between these two functions later. For now, let's use the `alloftext` function. 188 189 Go to the query tab, paste the query below, and click Run. 190 Here is our search string: `graph data and analyze it in graphdb`. 191 192 ```graphql 193 { 194 search_tweet(func: alloftext(tweet, "graph data and analyze it in graphdb")) { 195 tweet 196 } 197 } 198 ``` 199 200 {{% load-img "/images/tutorials/6/b-full-text-query-1.png" "tweet graph" %}} 201 202 Here's the matched tweet, which made it to the result. 203 204 {{< tweet 1192822660679577602>}} 205 206 If you observe, you can see some of the words from the search strings are not present in the matched tweet, but the tweet has still made it to the result. 207 208 To be able to use the Full-text search capability effectively, we must understand how it works. 209 210 Let's understand it in detail. 211 212 Once you set a `fulltext` index on the tweets, internally, the tweets are processed, and `fulltext` tokens are generated. 213 These `fulltext` tokens are then indexed. 214 215 The search string also goes through the same processing pipeline, and `fulltext` tokens generated them too. 216 217 Here are the steps to generate the `fulltext` tokens: 218 219 - Split the tweets into chunks of words called tokens (tokenizing). 220 - Convert these tokens to lowercase. 221 - [Unicode-normalize](http://unicode.org/reports/tr15/#Norm_Forms) the tokens. 222 - Reduce the tokens to their root form, this is called [stemming](https://en.wikipedia.org/wiki/Stemming) (running to run, faster to fast and so on). 223 - Remove the [stop words](https://en.wikipedia.org/wiki/Stop_words). 224 225 You would have seen in [the fourth tutorial]({{< relref "tutorial-4/index.md" >}}) that Dgraph allows you to build multi-lingual apps. 226 227 The stemming and stop words removal are not supported for all the languages. 228 Here is [the link to the docs](https://docs.dgraph.io/query-language/#full-text-search) that contains the list of languages and their support for stemming and stop words removal. 229 230 Here is the table with the matched tweet and its search string in the first column. 231 The second column contains their corresponding `fulltext` tokens generated by Dgraph. 232 233 | Actual text data | fulltext tokens generated by Dgraph | 234 |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------| 235 | Let's Go and catch @francesc at @Gopherpalooza today, as he scans into Go source code by building its Graph in Dgraph!\nBe there, as he Goes through analyzing Go source code, using a Go program, that stores data in the GraphDB built in Go!\n#golang #GraphDB #Databases #Dgraph | [analyz build built catch code data databas dgraph francesc go goe golang gopherpalooza graph graphdb program scan sourc store todai us] | 236 | graph data and analyze it in graphdb | [analyz data graph graphdb] | 237 238 From the table above, you can see that the tweets are reduced to an array of strings or tokens. 239 240 Dgraph internally uses [Bleve package](https://github.com/blevesearch/bleve) to do the stemming. 241 242 Here are the `fulltext` tokens generated for our search string: [`analyz`, `data`, `graph`, `graphdb`]. 243 244 As you can see from the table above, all of the 'fulltext' tokens generated for the search string exist in the matched tweet. 245 Hence, the `alloftext` function returns a positive match for the tweet. 246 It would not have returned a positive match even if one of the tokens in the search string is missing for the tweet. But, the 'anyoftext' function would've returned a positive match as long as the tweets and the search string have at least one of the tokens in common. 247 248 If you're interested to see Dgraph's `fulltext` tokenizer in action, [here is the gist](https://gist.github.com/hackintoshrao/0e8d715d8739b12c67a804c7249146a3) containing the instructions to use it. 249 250 Dgraph generates the same `fulltext` tokens even if the words in a search string is differently ordered. 251 Hence, using the same search string with different order would not impact the query result. 252 253 As you can see, all three queries below are the same for Dgraph. 254 255 ```graphql 256 { 257 search_tweet(func: alloftext(tweet, "graph analyze and it in graphdb data")) { 258 tweet 259 } 260 } 261 ``` 262 263 ```graphql 264 { 265 search_tweet(func: alloftext(tweet, "data and data analyze it graphdb in")) { 266 tweet 267 } 268 } 269 ``` 270 271 ```graphql 272 { 273 search_tweet(func: alloftext(tweet, "analyze data and it in graph graphdb")) { 274 tweet 275 } 276 } 277 ``` 278 279 Now, let's move onto the next advanced text search feature of Dgraph: regular expression based queries. 280 281 Let's use them to find all the hashtags containing the following substring: `graph`. 282 283 ## Regular expression search 284 285 [Regular expressions](https://www.geeksforgeeks.org/write-regular-expressions/) are powerful ways of expressing search patterns. 286 Dgraph allows you to search for string predicates based on regular expressions. 287 You need to set the `trigram` index on the string predicate to be able to perform regex-based queries. 288 289 Using regular expression based search, let's match all the hashtags that have this particular pattern: `Starts and ends with any characters of indefinite length, but with the substring graph in it`. 290 291 Here is the regex expression we can use: `^.*graph.*$` 292 293 Check out [this tutorial](https://www.geeksforgeeks.org/write-regular-expressions/) if you're not familiar with writing a regular expression. 294 295 Let's first find all the hashtags in the database using the `has()` function. 296 297 ```graphql 298 { 299 hash_tags(func: has(hashtag)) { 300 hashtag 301 } 302 } 303 ``` 304 305 {{% load-img "/images/tutorials/6/has-hashtag.png" "The hashtags" %}} 306 307 _If you're not familiar with using the `has()` function, refer to [the first tutorial]({{< relref "tutorial-1/index.md" >}}) of the series._ 308 309 You can see that we have six hashtags in total, and four of them have the substring `graph` in them: `Dgraph`, `GraphQL`, `graphqlconf`, `graphDB`. 310 311 We should use the built-in function `regexp` to be able to use regular expressions to search for predicates. 312 This function takes two arguments, the first is the name of the predicate, and the second one is the regular expression. 313 314 Here is the syntax of the `regexp` funtion: `regexp(predicate, /regular-expression/)` 315 316 Let's execute the following query to find the hashtags that have the substring `graph`. 317 318 Go to the query tab, type in the query, and click Run. 319 320 321 ```graphql 322 { 323 reg_search(func: regexp(hashtag, /^.*graph.*$/)) { 324 hashtag 325 } 326 } 327 ``` 328 329 Oops! We have an error! 330 It looks like we forgot to set the `trigram` index on the `hashtag` predicate. 331 332 {{% load-img "/images/tutorials/6/trigram-error.png" "The hashtags" %}} 333 334 Again, setting a `trigram` index is similar to setting any other string index, let's do that for the `hashtag` predicate. 335 336 {{% load-img "/images/tutorials/6/set-trigram.png" "The hashtags" %}} 337 338 _Note: Refer to the [previous tutorial]({{< relref "tutorial-5/index.md" >}}) if you're not sure about creating an index on a string predicate._ 339 340 Now, let's re-run the `regexp` query. 341 342 {{% load-img "/images/tutorials/6/regex-query-1.png" "regex-1" %}} 343 344 _Note: Refer to [the first tutorial]({{< relref "tutorial-1/index.md" >}}) if you're not familiar with the query structure in general_ 345 Success! 346 347 But we only have the following hashtags in the result: `Dgraph` and `graphqlconf`. 348 349 That's because `regexp` function is case-sensitive by default. 350 351 Add the character `i` at the the end of the second argument of the `regexp` function to make it case insensitive: `regexp(predicate, /regular-expression/i)` 352 353 {{% load-img "/images/tutorials/6/regex-query-2.png" "regex-2" %}} 354 355 Now we have the four hashtags with substring `graph` in them. 356 357 Let's modify the regular expression to match only the `hashtags` which have a prefix called `graph`. 358 359 ```graphql 360 { 361 reg_search(func: regexp(hashtag, /^graph.*$/i)) { 362 hashtag 363 } 364 } 365 ``` 366 367 {{% load-img "/images/tutorials/6/regex-query-3.png" "regex-3" %}} 368 369 ## Summary 370 371 In this tutorial, we learned about Full-text search and regular expression based search capabilities in Dgraph. 372 373 Did you know that Dgraph also offers fuzzy search capabilities, which can be used to power features like `product` search in an e-commerce store? 374 375 Let's learn about the fuzzy search in our next tutorial. 376 377 Sounds interesting? 378 Then see you all soon in the next tutorial. Till then, happy Graphing! 379 380 ## What's Next? 381 382 - Go to [Clients]({{< relref "clients/index.md" >}}) to see how to communicate 383 with Dgraph from your application. 384 - Take the [Tour](https://tour.dgraph.io) for a guided tour of how to write queries in Dgraph. 385 - A wider range of queries can also be found in the [Query Language]({{< relref "query-language/index.md" >}}) reference. 386 - See [Deploy]({{< relref "deploy/index.md" >}}) if you wish to run Dgraph 387 in a cluster. 388 389 ## Need Help 390 391 * Please use [discuss.dgraph.io](https://discuss.dgraph.io) for questions, feature requests and discussions. 392 * Please use [Github Issues](https://github.com/dgraph-io/dgraph/issues) if you encounter bugs or have feature requests. 393 * You can also join our [Slack channel](http://slack.dgraph.io).