Apple Defends Its Ethical Approach To AI Model Training Using Public And Licensed Data

The training dataset consisted of data that was publicly accessible, open-source, and licensed.

Apple has published a technical paper that outlines the development of its Apple Intelligence generative AI features, which are scheduled for integration into iOS, macOS, and iPadOS. The document addresses ethical issues related to the data utilized for training its models, highlighting that Apple refrained from using private user data. The training dataset consisted of data that was publicly accessible, open-source, and licensed.

In light of recent allegations that Apple utilized data from The Pile, including YouTube subtitles, for its AI, Apple has clarified that such models were never meant for AI functionalities in its products. The company's technical paper emphasizes that Apple's Foundation Models (AFM) were developed using responsibly sourced data, which encompasses licensed data from publishers and data that is publicly available on the web.

Apple's training data set, which is about 6.3 trillion tokens, includes open-source code from platforms like GitHub, filtered to include only repositories with minimal usage restrictions. The company also used publicly available math questions and high-quality datasets, ensuring they had appropriate licenses for training models.

Apple maintains that it has integrated responsible AI principles throughout its model development process, utilizing human feedback and synthetic data to refine the models and reduce unwanted behaviors. The publication of this paper is a move by Apple to position itself as an ethical AI pioneer, amid the continuing discussions and legal issues surrounding the employment of public web data to train AI models.