DecodingTrust aims at providing a thorough assessment of trustworthiness in GPT models.
This research endeavor is designed to help researchers and practitioners better understand the capabilities, limitations, and potential risks involved in deploying these state-of-the-art Large Language Models (LLMs).
This project is organized around the following eight primary perspectives of trustworthiness, including:
⚠️ WARNING: our data contains model outputs that may be considered offensive.
DecodingTrust aims to provide a comprehensive trustworthiness evaluation on the recent large language model GPT-4, in comparison to GPT-3.5, from different perspectives, including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness under different settings.
We have built a few resources to help you get started with the dataset.
Explore our dataset (distributed under the CC BY-SA 4.0
license):
To evaluate your models, we have also made available the evaluation script for different facets of trustworthiness. To replicate our evaluation, go the corresponding subdirectories for detailed evaluation scripts.
While our evaluation mainly focuses on GPT-3.5 and GPT-4, once you have a built a model that works to your expectations, you can also submit the mode generations and their corresponding scores via submitting pull requests in GitHub. Because DecodingTrust is an ongoing effort, we expect the dataset to evolve.
Ask us questions at boxinw2@illinois.edu.
We thank the SQuAD team for allowing us to use their website template and submission tutorials.