In an existing project, a system called YouTute streams video of tutorials to students, who would usually like to view short segments devoted to specific topics. In the long run, this objective requires us to have an automated system for segmenting the videos into topics. At present, we are developing and evaluating computational methods for identifying topic boundaries in tutorial dialogue, and we need to do this initially with the help of transcripts. In the future, we expect to be able to use automated speech recognition, but a corpus of human-transcribed dialogues is an essential resource in developing that capability.
Who am I?:
John Lee and Johanna Moore
How is it novel? What is exciting about it?:
No such corpus exists for tutorial dialogue (or rather tutorial meetings, since there are more than two participants in these settings). There are corpora of similar data for other kinds of meetings, which are useful but ultimately not adequate to address the issues specific to tutorials. One value of this project will be to allow comparison between tutorial and other meetings at a fine level of detail.
What will I do next? What opportunities will it open up?:
The outcome will be immediately important for developing algorithms for topic segmentation. We will seek further funding to develop this work in the context of extending the use of the YouTute system, as well as more generally for addressing issues in automated analysis of a range of types of meetings.
What constitutes success? How risky is it?:
Initial success is simply the creation of the corpus. This is not very risky. Development of effective segmentation algorithms is the real goal. There has been good progress on this in the context of other kinds of meetings, but the technology is still limited. We hope (but can't guarantee) that we will be able to take the techniques further. THe main risk perhaps is that tutorial dialogue may turn out to present different and more difficult challenges than other types of meeting.
What resources do I bring to the project?:
We currently have a PhD student working on the segmentation project, and hence this proposal seeks support only for the transcription of a reasonable corpus of dialogues.
What resources and expertise do I need?:
We hope to transcribe 30-50 hours of recorded tutorials, using casual but highly skilled transcribers, for which we request £5,000.
What shared resources, if any, will the project create?:
A substantial corpus of transcribed tutorial videos will be the initial result, which in itself could be valuable to other researchers. However, the project will further annotate this corpus with topic boundaries, and will enhance the quality of the material both for research purposes and for use in learning by future students.