github.com/dgraph-io/dgraph@v1.2.8/wiki/content/tutorial-6/index.md

github.com/dgraph-io/dgraph@v1.2.8/wiki/content/tutorial-6/index.md (about)

     1  +++
     2  title = "Getting started with Dgraph - Advanced text search on social graphs"
     3  +++
     4  
     5  **Welcome to the sixth tutorial of getting started with Dgraph.**
     6  
     7  In the [previous tutorial]({{< relref "tutorial-5/index.md" >}}), we learned about building social graphs in Dgraph, by modeling tweets as an example.
     8  We queried the tweets using the 'hash' and 'exact' indices, and implemented a keyword-based search to find your favorite tweets using the 'term' index and its functions.
     9  
    10  In this tutorial, we'll continue from where we left off and learn about advanced text search features in Dgraph.
    11  
    12  Specifically, we'll focus on two advanced feature:
    13  
    14  - Searching for tweets using Full-text search.
    15  - Searching for hashtags using the regular expression search.
    16  
    17  The accompanying video of the tutorial will be out shortly, so stay tuned to [our YouTube channel](https://www.youtube.com/channel/UCghE41LR8nkKFlR3IFTRO4w).
    18  
    19  Before we dive in, let's do a quick recap of how to model the tweets in Dgraph.
    20  
    21  {{% load-img "/images/tutorials/5/a-graph-model.jpg" "tweet model" %}}
    22  
    23  In the previous tutorial, we took three real tweets as a sample dataset and stored them in Dgraph using the above graph as a model.
    24  
    25  In case you haven't stored the tweets from the [previous tutorial]({{< relref "tutorial-5/index.md" >}}) into Dgraph, here's the sample dataset again.
    26  
    27  Copy the mutation below, go to the mutation tab and click Run.
    28  
    29  ```json
    30  {
    31    "set": [
    32      {
    33        "user_handle": "hackintoshrao",
    34        "user_name": "Karthic Rao",
    35        "uid": "_:hackintoshrao",
    36        "authored": [
    37          {
    38            "tweet": "Test tweet for the fifth episode of getting started series with @dgraphlabs. Wait for the video of the fourth one by @francesc the coming Wednesday!\n#GraphDB #GraphQL",
    39            "tagged_with": [
    40              {
    41                "uid": "_:graphql",
    42                "hashtag": "GraphQL"
    43              },
    44              {
    45                "uid": "_:graphdb",
    46                "hashtag": "GraphDB"
    47              }
    48            ],
    49            "mentioned": [
    50              {
    51                "uid": "_:francesc"
    52              },
    53              {
    54                "uid": "_:dgraphlabs"
    55              }
    56            ]
    57          }
    58        ]
    59      },
    60      {
    61        "user_handle": "francesc",
    62        "user_name": "Francesc Campoy",
    63        "uid": "_:francesc",
    64        "authored": [
    65          {
    66            "tweet": "So many good talks at #graphqlconf, next year I'll make sure to be *at least* in the audience!\nAlso huge thanks to the live tweeting by @dgraphlabs for alleviating the FOMO😊\n#GraphDB ♥️ #GraphQL",
    67            "tagged_with": [
    68              {
    69                "uid": "_:graphql"
    70              },
    71              {
    72                "uid": "_:graphdb"
    73              },
    74              {
    75                "hashtag": "graphqlconf"
    76              }
    77            ],
    78            "mentioned": [
    79              {
    80                "uid": "_:dgraphlabs"
    81              }
    82            ]
    83          }
    84        ]
    85      },
    86      {
    87        "user_handle": "dgraphlabs",
    88        "user_name": "Dgraph Labs",
    89        "uid": "_:dgraphlabs",
    90        "authored": [
    91          {
    92            "tweet": "Let's Go and catch @francesc at @Gopherpalooza today, as he scans into Go source code by building its Graph in Dgraph!\nBe there, as he Goes through analyzing Go source code, using a Go program, that stores data in the GraphDB built in Go!\n#golang #GraphDB #Databases #Dgraph ",
    93            "tagged_with": [
    94              {
    95                "hashtag": "golang"
    96              },
    97              {
    98                "uid": "_:graphdb"
    99              },
   100              {
   101                "hashtag": "Databases"
   102              },
   103              {
   104                "hashtag": "Dgraph"
   105              }
   106            ],
   107            "mentioned": [
   108              {
   109                "uid": "_:francesc"
   110              },
   111              {
   112                "uid": "_:dgraphlabs"
   113              }
   114            ]
   115          },
   116          {
   117            "uid": "_:gopherpalooza",
   118            "user_handle": "gopherpalooza",
   119            "user_name": "Gopherpalooza"
   120          }
   121        ]
   122      }
   123    ]
   124  }
   125  ```
   126  
   127  _Note: If you're new to Dgraph, and this is the first time you're running a mutation, we highly recommend reading the [first tutorial of the series before proceeding.]({{< relref "tutorial-1/index.md" >}})_
   128  
   129  Voilà! Now you have a graph with 'tweets', 'users', and 'hashtags'. It is ready for us to explore.
   130  
   131  {{% load-img "/images/tutorials/5/x-all-tweets.png" "tweet graph" %}}
   132  
   133  _Note: If you're curious to know how we modeled the tweets in Dgraph, refer to [the previous tutorial.]({{< relref "tutorial-5/index.md" >}})_
   134  
   135  Let's start by finding your favorite tweets using the full-text search feature first.
   136  
   137  ## Full text search
   138  
   139  Before we learn how to use the Full-text search feature, it's important to understand when to use it.
   140  
   141  The length and the number of words in a string predicate value vary based on what the predicates represent.
   142  
   143  Some string predicate values have only a few terms (words) in them.
   144  Predicates representing `names`, `hashtags`, `twitter handle`, `city names` are a few good examples. These predicates are easy to query using their exact values.
   145  
   146  
   147  For instance, here is an example query.
   148  
   149  _Give me all the tweets where the user name is equal to `John Campbell`_.
   150  
   151  You can easily compose queries like these after adding either the `hash` or an `exact` index to the string predicates.
   152  
   153  
   154  But, some of the string predicates store sentences. Sometimes even one or more paragraphs of text data in them.
   155  Predicates representing a tweet, a bio, a blog post, a product description, or a movie review are just some examples.
   156  It's relatively hard to query these predicates.
   157  
   158  It's not practical to query such predicates using the `hash` or `exact` string indices.
   159  A keyword-based search using the `term` index is a good starting point to query such predicates.
   160  We used it in our [previous tutorial]({{< relref "tutorial-5/index.md" >}}) to find the tweets with an exact match for keywords like `GraphQL`, `Graphs`, and `Go`.
   161  
   162  But, for some of the use cases, just the keyword-based search may not be sufficient.
   163  You might need a more powerful search capability, and that's when you should consider using Full-text search.
   164  
   165  Let's write some queries and understand Dgraph's Full-text search capability in detail.
   166  
   167  To be able to do a Full-text search, you need to first set a `fulltext` index on the `tweet` predicate.
   168  
   169  Creating a `fulltext` index on any string predicate is similar to creating any other string indices.
   170  
   171  {{% load-img "/images/tutorials/6/a-set-index.png" "full text" %}}
   172  
   173  _Note: Refer to the [previous tutorial]({{< relref "tutorial-5/index.md" >}}) if you're not sure about creating an index on a string predicate._
   174  
   175  Now, let's do a Full-text search query to find tweets related to the following topic: `graph data and analyzing it in graphdb`.
   176  
   177  You can do so by using either of `alloftext` or `anyoftext` in-built functions.
   178  Both functions take two arguments.
   179  The first argument is the predicate to search.
   180  The second argument is the space-separated string values to search for, and we call these as the `search strings`.
   181  
   182  ```sh
   183  - alloftext(predicate, "space-separated search strings")
   184  - anyoftext(predicate, "space-separated search strings")
   185  ```
   186  
   187  We'll look at the difference between these two functions later. For now, let's use the `alloftext` function.
   188  
   189  Go to the query tab, paste the query below, and click Run.
   190  Here is our search string: `graph data and analyze it in graphdb`.
   191  
   192  ```graphql
   193  {
   194    search_tweet(func: alloftext(tweet, "graph data and analyze it in graphdb")) {
   195      tweet
   196    }
   197  }
   198  ```
   199  
   200  {{% load-img "/images/tutorials/6/b-full-text-query-1.png" "tweet graph" %}}
   201  
   202  Here's the matched tweet, which made it to the result.
   203  
   204  {{< tweet 1192822660679577602>}}
   205  
   206  If you observe, you can see some of the words from the search strings are not present in the matched tweet, but the tweet has still made it to the result.
   207  
   208  To be able to use the Full-text search capability effectively, we must understand how it works.
   209  
   210  Let's understand it in detail.
   211  
   212  Once you set a `fulltext` index on the tweets, internally, the tweets are processed, and `fulltext` tokens are generated.
   213  These `fulltext` tokens are then indexed.
   214  
   215  The search string also goes through the same processing pipeline, and `fulltext` tokens generated them too.
   216  
   217  Here are the steps to generate the `fulltext` tokens:
   218  
   219  - Split the tweets into chunks of words called tokens (tokenizing).
   220  - Convert these tokens to lowercase.
   221  - [Unicode-normalize](http://unicode.org/reports/tr15/#Norm_Forms) the tokens.
   222  - Reduce the tokens to their root form, this is called [stemming](https://en.wikipedia.org/wiki/Stemming) (running to run, faster to fast and so on).
   223  - Remove the [stop words](https://en.wikipedia.org/wiki/Stop_words).
   224  
   225  You would have seen in [the fourth tutorial]({{< relref "tutorial-4/index.md" >}}) that Dgraph allows you to build multi-lingual apps.
   226  
   227  The stemming and stop words removal are not supported for all the languages.
   228  Here is [the link to the docs](https://docs.dgraph.io/query-language/#full-text-search) that contains the list of languages and their support for stemming and stop words removal.
   229  
   230  Here is the table with the matched tweet and its search string in the first column.
   231  The second column contains their corresponding `fulltext` tokens generated by Dgraph.
   232  
   233  | Actual text data | fulltext tokens generated by Dgraph |
   234  |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
   235  | Let's Go and catch @francesc at @Gopherpalooza today, as he scans into Go source code by building its Graph in Dgraph!\nBe there, as he Goes through analyzing Go source code, using a Go program, that stores data in the GraphDB built in Go!\n#golang #GraphDB #Databases #Dgraph | [analyz build built catch code data databas dgraph francesc go goe golang gopherpalooza graph graphdb program scan sourc store todai us] |
   236  | graph data and analyze it in graphdb | [analyz data graph graphdb] |
   237  
   238  From the table above, you can see that the tweets are reduced to an array of strings or tokens.
   239  
   240  Dgraph internally uses [Bleve package](https://github.com/blevesearch/bleve) to do the stemming.
   241  
   242  Here are the `fulltext` tokens generated for our search string: [`analyz`, `data`, `graph`, `graphdb`].
   243  
   244  As you can see from the table above, all of the 'fulltext' tokens generated for the search string exist in the matched tweet.
   245  Hence, the `alloftext` function returns a positive match for the tweet.
   246  It would not have returned a positive match even if one of the tokens in the search string is missing for the tweet. But, the 'anyoftext' function would've returned a positive match as long as the tweets and the search string have at least one of the tokens in common.
   247  
   248  If you're interested to see Dgraph's `fulltext` tokenizer in action, [here is the gist](https://gist.github.com/hackintoshrao/0e8d715d8739b12c67a804c7249146a3) containing the instructions to use it.
   249  
   250  Dgraph generates the same `fulltext` tokens even if the words in a search string is differently ordered.
   251  Hence, using the same search string with different order would not impact the query result.
   252  
   253  As you can see, all three queries below are the same for Dgraph.
   254  
   255  ```graphql
   256  {
   257    search_tweet(func: alloftext(tweet, "graph analyze and it in graphdb data")) {
   258      tweet
   259    }
   260  }
   261  ```
   262  
   263  ```graphql
   264  {
   265    search_tweet(func: alloftext(tweet, "data and data analyze it graphdb in")) {
   266      tweet
   267    }
   268  }
   269  ```
   270  
   271  ```graphql
   272  {
   273    search_tweet(func: alloftext(tweet, "analyze data and it in graph graphdb")) {
   274      tweet
   275    }
   276  }
   277  ```
   278  
   279  Now, let's move onto the next advanced text search feature of Dgraph: regular expression based queries.
   280  
   281  Let's use them to find all the hashtags containing the following substring: `graph`.
   282  
   283  ## Regular expression search
   284  
   285  [Regular expressions](https://www.geeksforgeeks.org/write-regular-expressions/) are powerful ways of expressing search patterns.
   286  Dgraph allows you to search for string predicates based on regular expressions.
   287  You need to set the `trigram` index on the string predicate to be able to perform regex-based queries.
   288  
   289  Using regular expression based search, let's match all the hashtags that have this particular pattern: `Starts and ends with any characters of indefinite length, but with the substring graph in it`.
   290  
   291  Here is the regex expression we can use: `^.*graph.*$`
   292  
   293  Check out [this tutorial](https://www.geeksforgeeks.org/write-regular-expressions/) if you're not familiar with writing a regular expression.
   294  
   295  Let's first find all the hashtags in the database using the `has()` function.
   296  
   297  ```graphql
   298  {
   299    hash_tags(func: has(hashtag)) {
   300      hashtag
   301    }
   302  }
   303  ```
   304  
   305  {{% load-img "/images/tutorials/6/has-hashtag.png" "The hashtags" %}}
   306  
   307  _If you're not familiar with using the `has()` function, refer to [the first tutorial]({{< relref "tutorial-1/index.md" >}}) of the series._
   308  
   309  You can see that we have six hashtags in total, and four of them have the substring `graph` in them: `Dgraph`, `GraphQL`, `graphqlconf`, `graphDB`.
   310  
   311  We should use the built-in function `regexp` to be able to use regular expressions to search for predicates.
   312  This function takes two arguments, the first is the name of the predicate, and the second one is the regular expression.
   313  
   314  Here is the syntax of the `regexp` funtion: `regexp(predicate, /regular-expression/)`
   315  
   316  Let's execute the following query to find the hashtags that have the substring `graph`.
   317  
   318  Go to the query tab, type in the query, and click Run.
   319  
   320  
   321  ```graphql
   322  {
   323    reg_search(func: regexp(hashtag, /^.*graph.*$/)) {
   324      hashtag
   325    }
   326  }
   327  ```
   328  
   329  Oops! We have an error!
   330  It looks like we forgot to set the `trigram` index on the `hashtag` predicate.
   331  
   332  {{% load-img "/images/tutorials/6/trigram-error.png" "The hashtags" %}}
   333  
   334  Again, setting a `trigram` index is similar to setting any other string index, let's do that for the `hashtag` predicate.
   335  
   336  {{% load-img "/images/tutorials/6/set-trigram.png" "The hashtags" %}}
   337  
   338  _Note: Refer to the [previous tutorial]({{< relref "tutorial-5/index.md" >}}) if you're not sure about creating an index on a string predicate._
   339  
   340  Now, let's re-run the `regexp` query.
   341  
   342  {{% load-img "/images/tutorials/6/regex-query-1.png" "regex-1" %}}
   343  
   344  _Note: Refer to [the first tutorial]({{< relref "tutorial-1/index.md" >}}) if you're not familiar with the query structure in general_
   345  Success!
   346  
   347  But we only have the following hashtags in the result: `Dgraph` and `graphqlconf`.
   348  
   349  That's because `regexp` function is case-sensitive by default.
   350  
   351  Add the character `i` at the the end of the second argument of the `regexp` function to make it case insensitive: `regexp(predicate, /regular-expression/i)`
   352  
   353  {{% load-img "/images/tutorials/6/regex-query-2.png" "regex-2" %}}
   354  
   355  Now we have the four hashtags with substring `graph` in them.
   356  
   357  Let's modify the regular expression to match only the `hashtags` which have a prefix called `graph`.
   358  
   359  ```graphql
   360  {
   361    reg_search(func: regexp(hashtag, /^graph.*$/i)) {
   362      hashtag
   363    }
   364  }
   365  ```
   366  
   367  {{% load-img "/images/tutorials/6/regex-query-3.png" "regex-3" %}}
   368  
   369  ## Summary
   370  
   371  In this tutorial, we learned about Full-text search and regular expression based search capabilities in Dgraph.
   372  
   373  Did you know that Dgraph also offers fuzzy search capabilities, which can be used to power features like `product` search in an e-commerce store?
   374  
   375  Let's learn about the fuzzy search in our next tutorial.
   376  
   377  Sounds interesting?
   378  Then see you all soon in the next tutorial. Till then, happy Graphing!
   379  
   380  ## What's Next?
   381  
   382  - Go to [Clients]({{< relref "clients/index.md" >}}) to see how to communicate
   383  with Dgraph from your application.
   384  - Take the [Tour](https://tour.dgraph.io) for a guided tour of how to write queries in Dgraph.
   385  - A wider range of queries can also be found in the [Query Language]({{< relref "query-language/index.md" >}}) reference.
   386  - See [Deploy]({{< relref "deploy/index.md" >}}) if you wish to run Dgraph
   387    in a cluster.
   388  
   389  ## Need Help
   390  
   391  * Please use [discuss.dgraph.io](https://discuss.dgraph.io) for questions, feature requests and discussions.
   392  * Please use [Github Issues](https://github.com/dgraph-io/dgraph/issues) if you encounter bugs or have feature requests.
   393  * You can also join our [Slack channel](http://slack.dgraph.io).