Evaluating and updating messaging triggers

This page is designed to answer (or help you answer) the following questions:

  1. How well are the triggers configured on my study working?

    1. Are they catching things they shouldn’t? (false positives)

    2. Are they missing things they should have caught? (false negatives)

  2. What configuration changes can help the triggers perform better?

    1. Should I add triggers to the study?

    2. Should I make configuration changes to the study’s triggers?

    3. Are there configuration changes I need to make to the study’s triggers?

    4. Should we switch any keyword triggers into NLU triggers or vice versa?

    5. For NLU triggers, should we change anything about the model’s training or confidence thresholds?

  3. If I make changes to the study’s messaging triggers or NLU models, what will the impact be?

    1. Will it solve the issues?

    2. Will it create different issues?

Message Classification App

We’ve created an app at http://message-classification.waytohealth.org/ which can help with much of this process. Someone from the W2H team can give you access, and then you can log in with your Penn Medicine credentials.

Within the app, the W2H team can import “message sets” where you can view messages received by w2h, annotate whether w2h classified them correctly and how they should have been classified.

How to use the app

  • Vote as / / by selecting rows and then optionally entering what the message actually should have been classified as.

  • You can also vote by filling in the Evaluation field with / / and the classification column with your opinion on what the message actually meant. If you want, you can also fill in notes and ideal response, or flag a message as “priority to solve”.

  • By default the app shows one column for how w2h classified the message. Expand that column group with the > button (screenshot) to see more details including how w2h responded, which NLU model it used, what the confidence threshold was.

  • By default the app groups rows by w2h’s classification, but you can group/sort/filter in whatever ways are most helpful to you.

    • To sort: click on a column header

    • To group: either click the 3 lines/menu button on the column header and click “Group by <Column Name>”, or drag the column header into the top of the table where grouped columns are shown.

    • To filter: Click the 3 lines/menu button on the column header, and then the filter button. You can either select/unselect the checkboxes, or type in what you want to filter for and hit Enter.

    • To show/hide columns: Click the 3 lines/menu button on the column header and then the vertical ||| menu. Select or unselect the

  • You can change the width of any column by dragging the border just like in Excel. You can also right-click anywhere in the table and click “Autosize all columns” to fit the columns to what’s currently visible in the table.

Updating an existing program

Here is the process broadly for evaluating and updating message triggers on a live Way to Health program.

image-20240229-194450.png

1. Import data

The first step is to import a set of text messages from Way to Health into the Message Classification tool. Currently, this step is limited to app admins, i.e. the Way to Health team. Typically we import recent (e.g. 90 days of) messages from a study that weren’t part of a conversation (i.e. that went through the normal messaging triggers), but this process could also be used to evaluate the performance of a multiple choice question or other scenarios.

2. Evaluate system classifications

The next step is to paint the spreadsheet green/yellow/red by evaluating whether w2h classified and handled each incoming message correctly.

The point here isn’t to evaluate whether the patient message made sense but whether Way to Health handled it correctly. So if a patient sent in gibberish and Way to Health said “I don’t know what you mean”, that should be flagged as “ Approve”.

There can be a lot of subjectivity here - if a patient said something that makes sense from their standpoint (e.g. “A, B, and C” to a multiple choice question) but which understandably didn’t work with the algorithm, you could label it as “ It’s complicated” if you want to go back and review it, or as “ Approve” if you’re set on keeping the algorithm changes to a minimum, or as “ Reject” if you’re focused on really getting the patient experience as smooth as possible. It really comes down to what your goals are.

You don’t necessarily have to do this comprehensively across the whole dataset. If you’re trying to fix a certain keyword trigger, I recommend scanning through the other classifications (especially the “Any other message from a participant”) to find other messages that should have matched that intent, but you don’t need to evaluate the entire dataset just to fix one piece.

Sometimes it’ll be useful to review the messages in the context of the patient’s inbox to see what their intent actually was in context.

3. Manually classify messages

This step will usually be done in conjunction with the previous.

Once we’ve identified that there are messages that weren’t handled ideally, the question is, how should the system have classified and handled them? This can manifest in a few ways:

  • Messages were classified as one intent when they should have been another (e.g. a patient sent in “thanks but that doesn’t answer my question” which we misclassified as “thanks” and responded with a friendly “You’re welcome!”)

  • There’s a single category which could be broken into multiple (e.g. “cuff” could be separated into missing cuff, broken cuff, and questions about when they’ll be given a cuff)

  • Categorizing messages which weren’t handled by the system.

Like the previous step, this doesn’t have to be done comprehensively across the whole dataset. If you look through the unclassified messages and find that half of them can fit into neat buckets, we can leave the other half unclassified to start while we work on what’s easier to classify.

4. Prioritize what to fix

Not everything identified as “ Reject” in the previous steps necessarily needs to be fixed. We recognize that algorithms can only get so far - as long as the patient conversation worked out OK we don’t need to fix everything. But if the same mistake happens many times

This can be done in the app by selecting the “Priority to fix” checkbox. Sometimes, notes might be useful to explain why

The analysis to this point (steps 2-4) can be done by a clinical or research team member without needing much technical w2h knowledge. The below steps (5-7) will require more knowledge from a Way to Health analyst or developer to understand why the system is currently behaving how it is, and what changes might induce it to behave differently.

5. Determine config changes

By examining the messages, you should be able to identify patterns in what was captured and what was missed. Depending on the types of messages that are not being categorized accurately and the underlying issues, this issue can be resolved in various ways:

  • Creating new triggers (whether keyword, NLU, Regex, or others) to capture messages that weren’t classified originally.

  • Changing the order of triggers to prioritize some triggers over others.

  • Adding keywords to an existing trigger

  • Tweaking the confidence threshold of an NLU trigger

  • Adding utterances (either positive or negative) to an NLU model in CLU or Luis.

6. Test config changes

For now, this will be mentally by looking at the messages and seeing which ones did or didn’t contain a keyword or had a certain NLU confidence score.

In the future, we intend to add this functionality to the app, so you can see across the entire dataset which messages would have been classified differently based on config changes.

In either method what you want to be reviewing is:

  • Are there messages that we flagged as wrong now be classified correctly? → Yay!

  • Are there messages that we flagged as wrong that aren’t helped by our config change? → This might be fine - an 80% fix might be good enough. But maybe a different configuration change would solve more messages.

  • Are there other messages which the system classified correctly but which would be misclassified as a result of this configuration change?

7. Deploy changes

Once we’ve determined the changes that we want to make and confirmed that they won’t have unintended side effects, we can make those changes. This could mean making the planned tweaks to Way to Health production, or changing something in the CLU configuration and publishing that. (More details on the CLU process are at https://waytohealth.atlassian.net/wiki/spaces/BG/pages/2427387919 .)