Hi,
First thanks for the site and this forum, it’s invaluable source of chatbot information (this is my first post here, thought it’s the best place for this “thanks”).
I have created recently a chatbot program and implemented various algorithms which can be used to generate chatbot output based on user input. Some algorithms generate less possible responses which are more accurate while the other generate more responses which are less accurate (but still they sometimes surprise me very positevely).
I’d like to compare these algorithms and then build an optimal chain of them configured with the most suitable options. I have digged through some articles on chatbot evaluation methods (found the articles at CiteSeer as well as some university sites) and what has struck me is that all these methods are subjective. All of them are more or less based on Turing test (with specific scenario or not) and they all rely on user experience which is unstable (even the same person can have different experience depending on mood, etc.).
I thought about an algorithm which compares bot answers to model answers using some kind of “distance” (cosine, Levenshtein, whatever)—I would have a test scenario with questions and model answers and use this scenario for every algorithm. Unfortunately I didn’t find such a method described anywhere and I don’t think it’s so visionary. So I started to think that there are no such methods because they wouldn’t be feasible for such a comparison, because the nature of the problem is so indefinite—it’s hard to tell programmatically if the response is good or not (maybe that’s why my less accurate algorithms sometimes amuse me).
So, are there any standard numerical/statistical chatbot evaluation methods similar to the described above? If not, maybe I’ll try to formulate one ).
Thanks,
Seweryn