Sunday, 30 June 2019

Leveraging complex data to build advanced search applications with Azure Search

Data is rarely simple. Not every piece of data we have can fit nicely into a single Excel worksheet of rows and columns. Data has many diverse relationships such as the multiple locations and phone numbers for a single customer or multiple authors and genres of a single book. Of course, relationships typically are even more complex than this, and as we start to leverage AI to understand our data the additional learnings we get only add to the complexity of relationships. For that reason, expecting customers to have to flatten the data so it can be searched and explored is often unrealistic. We heard this often and it quickly became our number one most requested Azure Search feature. Because of this we were excited to announce the general availability of complex types support in Azure Search. In this post, I want to take some time to explain what complex types adds to Azure Search and the kinds of things you can build using this capability.

Azure Certifications, Azure Guides, Azure Learning, Azure Study Materials

Azure Search is a platform as a service that helps developers create their own cloud search solutions.

What is complex data?


Complex data consists of data that includes hierarchical or nested substructures that do not break down neatly into a tabular rowset. For example a book with multiple authors, where each author can have multiple attributes, can’t be represented as a single row of data unless there is a way to model the authors as a collection of objects. Complex types provide this capability, and they can be used when the data cannot be modeled in simple field structures such as strings or integers.

Complex types applicability


At Microsoft Build 2019,  we demonstrated how complex types could be leveraged to build out an effective search application. In the session we looked at the Travel Stack Exchange site, one of the many online communities supported by StackExchange.

The StackExchange data was modeled in a JSON structure to allow easy ingestion it into Azure Search. If we look at the first post made to this site and focus on the first few fields, we see that all of them can be modeled using simple datatypes, including tags which can be modeled as a collection, or array of strings.

{
   "id": "1",
    "CreationDate": "2011-06-21T20:19:34.73",
    "Score": 8,
    "ViewCount": 462,
    "BodyHTML": "<p>My fiancée and I are looking for a good Caribbean cruise in October and were wondering which
    "Body": "my fiancée and i are looking for a good caribbean cruise in october and were wondering which islands
    "OwnerUserId": 9,
    "LastEditorUserId": 101,
    "LastEditDate": "2011-12-28T21:36:43.91",
    "LastActivityDate": "2012-05-24T14:52:14.76",
    "Title": "What are some Caribbean cruises for October?",
    "Tags": [
        "caribbean",
        "cruising",
        "vacations"
    ],
    "AnswerCount": 4,
    "CommentCount": 4,
    "CloseDate": "0001-01-01T00:00:00",​

Azure Certifications, Azure Guides, Azure Learning, Azure Study Materials
However, as we look further down this dataset we see that the data quickly gets more complex and cannot be mapped into a flat structure. For example, there can be numerous comments and answers associated with a single document.  Even votes is defined here as a complex type (although technically it could have been flattened, but that would add work to transform the data).

"CloseDate": "0001-01-01T00:00:00",
    "Comments": [
        {
            "Score": 0,
            "Text": "To help with the cruise line question: Where are you located? My wife and I live in New Orlea
            "CreationDate": "2011-06-21T20:25:14.257",
           "UserId": 12
        },
        {
            "Score": 0,
            "Text": "Toronto, Ontario. We can fly out of anywhere though.",
            "CreationDate": "2011-06-21T20:27:35.3",
            "UserId": 9
        },
        {
            "Score": 3,
            "Text": "\"Best\" for what?  Please read [this page](http://travel.stackexchange.com/questions/how-to
            "UserId": 20
        },
        {
            "Score": 2,
            "Text": "What do you want out of a cruise? To relax on a boat? To visit islands? Culture? Adventure?
            "CreationDate": "2011-06-24T05:07:16.643",
            "UserId": 65
        }
    ],
    "Votes": {
        "UpVotes": 10,
        "DownVotes": 2
    },
    "Answers": [
        {
            "IsAcceptedAnswer": "True",
            "Body": "This is less than an answer, but more than a comment…\n\nA large percentage of your travel b
            "Score": 7,
            "CreationDate": "2011-06-24T05:12:01.133",
            "OwnerUserId": 74

All of this data is important to the search experience. For example, you might want to:

◈ Search for and highlight phrases not only in the original question, but also in any of the comments.

◈ Limit documents to those where an answer was provided by a specific user.

◈ Boost certain documents higher in the search results when they have a higher number of up votes.

In fact, we could even improve on the existing StackExchange search interface by leveraging Cognitive Search to extract key phrases from the answers to supply potential phrases for autocomplete as the user types in the search box.

All of this is now possible because not only can you map this data to a complex structure, but the search queries can support this enhanced structure to help build out a better search experience.

Related Posts

0 comments:

Post a Comment