Thursday, November 24, 2016

Basic word prediction in c#


In this little post I will go through a small and very basic prediction engine written in C# for one of my projects.
Basically what it does is the following:
  • It will collect data in the form of lists of strings
  • Given an input, it will give back a list of predictions of the next item. In falling probability order.
So no rocket science, but a good and basic prediction engine that can be build upon depending on what scenarios you want to solve.


First, some unit tests:


[TestMethod]
public void PredictorTest_NoMatch()
{
    var target = new Predictor();
    var actual = target.Predict("The rabbit-hole went straight on like a tunnel for some way");
    Assert.AreEqual(0, actual.Count);
}
[TestMethod]
public void PredictorTest_SingleMatch()
{
    var target = new Predictor();
    target.Predict("Alice opened the door and found that it led into a small passage");
    var actual = target.Predict("Alice opened the door");
    Assert.AreEqual(1, actual.Count);
    Assert.AreEqual("and", actual.First());
}
[TestMethod]
public void PredictorTest_SingleMatch_OtherNoise()
{
    var target = new Predictor();
    target.Predict("Alice opened the door and found that it led into a small passage");
    target.Predict("Alice took up the fan and gloves");
    var actual = target.Predict("Alice opened the door");
    Assert.AreEqual(1, actual.Count);
    Assert.AreEqual("and", actual.First());
}
[TestMethod]
public void PredictorTest_SingleMatch_TwoResultsInOrder()
{
    var target = new Predictor();
    target.Predict("Alice thought she might as well wait");
    target.Predict("Alice thought this a very curious thing");
    target.Predict("Alice thought she had never seen such a curious");
    var actual = target.Predict("Alice thought");
    Assert.AreEqual(2, actual.Count);
    Assert.AreEqual("she", actual.First());
    Assert.AreEqual("this", actual.Last());
}

So nothing too fancy. You provide it with input and as expected it returns a list of possible next values. Examples are from Alice's Adventures in Wonderland by Lewis Carroll, the Project Gutenberg edition.


Predictor class

public class Predictor
{
    private readonly Dictionary<string, Dictionary<string, int>> _items = new Dictionary<string, Dictionary<string, int>>();
    private readonly char[] _tokenDelimeter = {' '};
    public List<string> Predict(string input)
    {
        var tokens = input.Split(_tokenDelimeter, StringSplitOptions.RemoveEmptyEntries);
        var previousBuilder = new StringBuilder();
        Dictionary<string, int> nextFullList;
        foreach (var token in tokens)
        {
            nextFullList = GetOrCreate(_items, previousBuilder.ToString());
            if (nextFullList.ContainsKey(token))
                nextFullList[token] += 1;
            else
                nextFullList.Add(token, 1);

            if (previousBuilder.Length > 0)
                previousBuilder.Append(" ");
            previousBuilder.Append(token);
        }
        nextFullList = GetOrCreate(_items, previousBuilder.ToString());
        var prediction = (from x in nextFullList
            orderby x.Value descending
            select x.Key).ToList();

        return prediction;
    }
    private static T GetOrCreate<T>(Dictionary<string, T> d, string key)
    {
        if (d.ContainsKey(key))
        {
            return d[key];
        }
        var t = Activator.CreateInstance<T>();
        d.Add(key, t);
        return t;
    }
}

It's not smart, and it requires having seen quite a lot of samples to work well.
An additional improvement could be to add storage of previous/next token combinations so that if the full search doesn't yield any result, a guess could be made based on only the previous word. But I guess that is up to you.

All code provided as-is. This is copied from my own code-base, May need some additional programming to work.



1 comment: