AI Zone Admin Forum Add your forum

NEWS: survey on 3000 US and UK consumers shows it is time for chatbot integration in customer service!read more..

Loebner Prize Report 2017

The 2017 Loebner Prize has come to a close and so I thought I’d write about my experience on the day.

I arrived at Bletchley Park just before 10am, reported to the visitor’s reception and was met by the event manager who showed me to the building which had been set aside for this year’s contest. It was the first time an event manager had been allocated and the personal touch made the experience a pleasant one.

The contest was held in the same area as last year. The human confederates and chatbots were together in one room, the judge computers and public area were up a small flight of stairs nearby. I had a quick look round both areas, said hello to Bertie from the AISB and had a quick chat with Dr Wallace who was already there.

The public area consisted of the 4 judge computers, a large projection screen which displayed slides about the Loebner Prize and also 4 laptops set aside for visitors to use and interact with chatbots. There was also a cool NAO robot walking around interacting with people. On one laptop, Will Rayner had arrived and was setting up a copy of Uberbot. I had set up a link to the Turing Test part of Mitsuku’s website on a second laptop. On a third, I installed the old Loebner Prize judge program and a standalone copy of Mitsuku. The 4th one was left unused.

I then went into the private area and met Andrew Martin, this year’s organiser and Nir Oran, the tech guy. Andrew showed me the laptop allocated to Mitsuku and I installed her successfully in a few minutes. Andrew was following the instructions in a text file to configure Rose to work and had already installed Midge with no problem. Will arrived in the room and began setting up Uberbot. The bots were connecting to a test server and the plan was to see if they worked before moving it to a live server. There was no more I could do for now, so I explored Bletchley Park until we were ready to test.

The organisers were soon ready to test. All bots/confederates showed green on the console, which indicated all was working ok and a few test messages were sent from the judge computers which seemed to arrive and be processed ok.

Once happy, we amended our config files to point to the live server, sent a few test messages again and waited for the contest to start.

After lunch had finished, the judges sat at their computers, the human confederates at theirs and the contest started pretty much on time at 1:30pm
I am unaware of who the judges were or their background, as there was no mention of them in the handout.
The large projector screen displayed the webcast for the public to watch. The screen of each judge was displayed in rotation for a few minutes at a time on it.

Here is a round by round account of how things went. You’ve probably seen the webcast here: but I’ll include which judge was talking to which bot.


  [ # 1 ]

Round 1
Judge 1 - Midge
Midge performed flawlessly with the new protocol throughout the competition. I was watching some of its responses through the contest and was impressed. A definite one to watch for in future contests. Nice work Merlin.

Judge 2 - Mitsuku
Mitsuku was receiving and replying to the judge’s messages but these were not being displayed on the judge console. I was unaware of this until I saw the webcast. I restarted her and she missed the first 5 minutes or so of this round but then communicated with the judge ok. The failure appeared to be due to each round having a “new round” message and also a “start round” message. I had started Mitsuku before the current round had been started and so she thought it was round -1

Judge 3 - Rose
Rose worked ok for the first few minutes and then began replying “huh?” to every input. Andrew did some work on Rose’s laptop and got her working again towards the last few minutes of the round.

Judge 4 - Uberbot
Uberbot failed to respond to any of the judge’s messages, Will tried to rectify the problem but was unsuccessful.

Round 2
Judge 1 - Mitsuku
Mitsuku performed well in this round with no technical hitches.

Judge 2 - Rose
A similar performance to the last round. Rose started ok then replied “huh?” to everything. Andrew managed to get it working again and it replied sensibly for most of the round but then started with “huh?” towards the end of the round.

Judge 3 - Uberbot
Will had made some changes and was looking at Uberbot along with Nir Oran. For this round, Uberbot replied “undefined” to everything. Towards the end of the round, Will and Nir found the problem. Due to last minute changes with the new protocol, Uberbot was looking for a variable with the name of “content” but this had been changed to “contents” a week or so before the contest. Uberbot was working from an old copy of the github repository.

Judge 4 - Midge
Another solid performance from Midge.

Round 3
Judge 1 - Uberbot
Now the code had been amended, Uberbot performed well and responded to the judge’s messages.

Judge 2 - Midge
No problems as usual.

Judge 3 - Mitsuku
Similar to round 1, Mitsuku receives and replies to the judges messages but doesn’t display them on the judge’s screen. I was unaware of this until Nir came in from the public room to say that a bot wasn’t responding. This appeared to be due to the organisers saying the round had started but hadn’t clicked on “start round” on the controller. I restarted Mitsuku and had to wait until the judge started talking again. One of the rules says the bots must wait for the judges to start speaking before they can respond but the judge had given up with Mitsuku and was talking to the human instead. After 10-15 minutes, the judge spoke again and Mitsuku completed the round with no further issues.

Judge 4 - Rose
Same as other rounds. It worked ok for a while before replying “huh?” to everything. Andrew reset it and Rose worked for a few minutes before replying
“huh?” again.

Round 4

Judge 1 - Rose
Rose started ok but then replied “huh?”. Andrew got it working again and then the “huh?” responses happened again midway through the contest. On investigating, Andrew found that the bridge program had stopped running. He restarted it and Rose completed most of the rest of the round but started with “huh?” again as the round came to a close. Strangely enough, it started working in the last minute or so but Andrew may have restarted it.

Judge 2 - Uberbot
Judge 3 - Midge
Judge 4 - Mitsuku
All these bots worked ok with no hitches in round 4.

End of Contest
After the final round, there was a 20 minute break while the judges worked out their scores and the results were collaborated. The scoring system has changed. Each judge ranked each bot from being 1 (the worst) to 4 (the best). So the maximum score would be 16.

The results were announced by the NAO robot:

1 - Mitsuku (13 points)
2 - Midge (12 points)
3 - Uberbot (8 points)
4 - Rose (7 points)

Two of the judges ranked Mitsuku in top spot. The other two gave top spot to Midge and Uberbot.
I received my medal, said a few words and the contest ended.

There was a TV crew filming for a documentary about artificial intelligence and a reporter from


  [ # 2 ]

Personal thoughts

The message by message new protocol worked well and is definitely a lot better than character by character where the entrants had to guess when the judge’s input had ended. It could be improved by making the implementation a lot simpler though. According to the AISB, the majority of this year’s entries used the old protocol. I would like to see a simple text file containing the judge’s message and chatbot’s replies rather than messing about with sockets or it’s just going to be the same people entering each year. Let’s make it accessible to all.

This should be an AI contest rather than a programming one and the simpler the protocol can be, the better. The lengths at which Rose had to go to just to receive a message and reply it to was always tempting fate.

While I am naturally overjoyed to have won my 3rd Loebner Prize, it was pretty much a Mitsuku v Midge contest, at least for the first 2 rounds until Uberbot started working. Any new protocol is bound to have teething troubles but there were a few things that can be done in future to reduce the risk.

The protocol was being amended right up until 2 days before the contest. This crippled Uberbot, as Will was working to an old spec. The bridge program wasn’t available until a few days before. I think we need some kind of cut off date for amending the interface apart from resolving a show stopping issue.

One of the points Will raised on the day would be to have a dummy run through of the contest. Rather than a quick “hello”, let’s run a quick 4 round contest to check things work between rounds.

Also, when the bots stopped functioning, we had to wait for the judges to talk before the bots could respond. It would be useful if we were allowed to make the bots send a “hi” message to the judge to prompt them into talking again rather than having to wait 5 or 10 minutes for the judge.

Bertie was saying in his closing remarks that the money from Hugh’s estate has stopped now and the AISB are looking for a sponsor for the contest for future years but he hopes it will continue in one form or another.


  [ # 3 ]

Thanks for taking the time to write all this down. It sounds like things were worse than it looked, and only looked better because so many people behind the screens were fixing problems on the fly. The last-minute change from “content” to “contents” is particularly worth a facepalm.

I agree with a lot of your points. There are some unnecessary steps in the protocol, like “new round” and “start round” being separate buttons and requiring manual confirmation on the client computers. I think most of these things can be improved without tampering with the core functionality. I am also quite in favour of ditching the new-to-old bridge, as it is just asking for twice as much trouble.

I noticed code in the bridge that hooks up node.js’ file system to call mkdir() for the old LPP. This file system is also capable of reading, creating, updating and renaming text files, so that’s what I would advise to integrate instead. Merlin’s entry (or anyone else’s) could still forego this and communicate directly with the server, so that needn’t be any objection.

I also agree with Will’s suggestion. Only testing the setup with “hello” played a part in the malcommunication with my entry going unnoticed last year (as per my instructions, I should note). It isn’t until communication traffic gets heavy that problems arise, like when the webcast is hooked up, messages start crossing, or the operating system starts to “optimise” data traffic.

To end on a positive note, the setup for the public really sounds like a class act.


  [ # 4 ]
Steve Worswick - Sep 22, 2017:

Bertie was saying in his closing remarks that the money from Hugh’s estate has stopped now and the AISB are looking for a sponsor for the contest for future years but he hopes it will continue in one form or another.

I’m going to add something for the sake of frankness: As Turing Tests are frowned upon in the field of A.I., I’m not optimistic about the chances of finding a new sponsor unless the focus of the contest is directed away from “pretending to be human”(and failing) to “being proficient at natural conversation”. The latter could at least interest a number of chatbot companies, both as sponsors and participants, and is also the focus of another contest sponsored by Facebook. It would still serve Alan Turing’s “what if computers could converse as well as humans” argument, but without a call for trickery.


  [ # 5 ]

The story of MIDGE (My Intelligent Digital Grandmother Emulator)

Thanks Steve and Don for your kind words.
I believe MIDGE is the first JavaScript bot to enter the competition and make it to the finals. 
It is based on similar technology that I use for Skynet-AI. I included new modules (some of which never got triggered), and re-did the personality for the competition. This was my second time in the finals (each time with a different bot).

We all suffered a bit because of the evolving interface. I passed the following along to the AISB:

Software used in each round should be frozen at least 2 weeks before the round so developers have a chance to test.
In the finals this year, there was a change in the protocol between preliminary and finals.
I did not realize it until testing (the weekend before the final).
It cost me a day to debug what happened, (a change from ‘contents’ to ‘content’ in the JSON message meant the bot did not get messages).
I had hoped to be adding in additional functions (not debugging) at that time.

As Steve knows, during the second round, I thought Midge had stopped responding because of a bug I thought I had fixed the weekend before. It turns out that Will had run into the same bug during the contest.

If the software used in the contest is frozen soon enough before the beginning of the round, most developers will have a chance to test and debug. Although a full test contest before the actual contest would uncover issues, it may not be needed if the software is available early enough. Everyone was under extra stress because the LPP github was changing even after some of us had submitted our bots.

The message protocol is better, and the interaction between server and bot could be even simpler. The server always knows who is talking to who and the round number. There is no need to have the client keep or send that info.

The Loebner Prize competition is unique (the requirement to act like a human), and although it has caused me to create new bots for the competition instead of using Skynet-AI (which never tries to act human), it fills a niche and is still the best open competition Turing test.

If you look at other contests like “The Conversational Intelligence Challenge”, besides requiring that you open source a winning entry, early results have been poor. Only about 60% of the bot interactions have >2 volleys. Although they provide a starter paragraph few conversations use it as context. It will be interesting to see the quality of the finalists.

Although we could use more, diverse, competitions; my fear is that we will end up with fewer, and that would be bad for everyone.




  [ # 6 ]

Please excuse me for raising an old topic, but this seemed the most appropriate place to post this:
It’s a podcast about the 2017 Loebner Prize with an interview with one of last year’s judges, sketching the proceedings from his point of view, and it also interviews Bruce Wilcox who shares his views on the contest and machine learning in general. I found it quite interesting.


  [ # 7 ]

Thanks for that, Don. I found the podcast informative and entertaining. smile


  [ # 8 ]

Thanks Don,

Bruce Wilcox was brilliant in the podcast.


  login or register to react