Generative AI (GenAI) chatbots have quickly become a popular and powerful tool. While state-of-the-art chatbots excel at general purpose tasks, they sometimes struggle to provide useful responses to domain-specific prompts. To provide students with course-specific help, educators have started to build custom chatbots with limited-scope datasets containing resources relevant to their courses. How does the scope of the training data affect the helpfulness of the responses?
The goal of this study is to analyze differences in bot performance with different training data scopes. In the context of student learning augmented by Generative AI, is a specialist or a generalist custom GenAI chatbot:
- Less likely to hallucinate?
- More helpful to students as a course resource? Additionally, is it more helpful in comparison to a general purpose chatbot (like ChatGPT)?
We created a suite of specialist and generalist custom GenAI chatbots: one with a wide scope encompassing all programming projects in a course and five with limited scope, each specialized to one project. Each chatbot was trained on a dataset containing the project specification, tutorials, lecture content, lab materials, and past course forum posts. The course is a high-enrollment Web Systems at a large public research university. The project topics cover full-stack web development, distributed systems, and search engines. The intent for the chatbots is to provide an additional course resource for students that would assist with projects comparable to interacting with course instructors on the course forum or in office hours.
We plan to measure the performance of our bot suite (our 6 custom chatbots and ChatGPT Pro using GPT 4) with an evaluation by a team of expert instructors. We will gather a subset of sample student questions from the course forum to use as prompts for the bots. The instructors will compare the quality of responses between the project-specific bot, the all-project bot, and the general purpose bot and look for relationships to the quality of the prompts. The instructors will be trained to evaluate the quality of prompts and responses using their knowledge of the course. We will collect the accuracy and helpfulness for each bot using these evaluations to measure hallucination.
Our experience can guide educators using custom Generative AI in a specific course context to enhance the student experience. It is also an exploration of how the quality of LLM responses, which we quantify using the usefulness and hallucination rate of the responses, is affected by the scope of the training data provided for a given context. This allows us to observe the performance of the bots before justifying the extra work required to design a custom dataset and releasing the bots to students.
Are you a researcher? Would you like to cite this paper? Visit the ASEE document repository at peer.asee.org for more tools and easy citations.