Nishit Asnani - Jun 14, 2015:
My friends and I are making a chatbot as part of a 40 day summer project at our institute. Ten of those days are left now. So far, we’ve done an online course on Machine Learning and have also read major parts of the book on Natural Language Toolkit (NLTK) in Python. The current bot works this way:
When it receives input, it checks in an SQL database of sentences whether it has encountered an exact match before. If yes, it responds according to the recorded response. If no, it tags the words of the sentence according to Part of Speech and groups them in another SQL table, “Words.” Now, we’ve assigned weights to each word in English by taking a reciprocal of the number of their occurrences in many NLTK corpora, multiplied by weight values decided according to which part of speech the word is. Then, it finds the best match with this approach, among the inputs encountered before, and prints the corresponding output.
This is not giving us a good result, though.
Sounds like your flow is:
IF: exact match->respond
Else: use SUM((corpora/response TFIDF) * (word POS-weight)) to select response
Problems can result if the corpus does not match your inputs (which yours doesn’t), and that responses should not be chosen by the words in them. As an example; the input, “Who are you?” would typically not generate a response that includes the words; who, are, or you.
Nishit Asnani - Jun 14, 2015:
One of our seniors suggested that we should focus on making the bot learn from a huge database, the kind of sentence types that can come as inputs and how to respond to them. We have no clue how to make that happen. How will the bot automatically learn this? Any suggestions on this?
10 days is not much time in the chatbot world, and if it was easy to build a bot from a big database, we would have many more bots. It can take a lot of compute power to process the words into a usable form. Deep learning via word2vec’s “skip-gram and CBOW models”, using either hierarchical softmax or negative sampling is one way to explore this, but it will not give you a chatbot. https://radimrehurek.com/gensim/models/word2vec.html
Suggestions:
In the time you have left, you need lots of inputs/responses. Convert the AIML set into your format for direct matches.
Expand your direct matches to a fuzzy set by using NLTK/Wordnet and looking up synonyms of your input phrases to match outputs.
When finished, you should also consider publishing/open sourcing your code so others in the community can learn from your approach.