Sanitisation of Ingested Data

CCC.GenAI.C04: Sanitisation of Ingested Data

Control ID:CCC.GenAI.C04

Title:Sanitisation of Ingested Data

Objective:Validate and sanitise all data ingested by GenAI systems from extenal sources or internal knowledge bases, whether for training, conversion to vector embeddings, or real-time retireval, in order to remove or redact poisoned or sensitive data before further processing.

Control Family:

Data

Related Threats

ID	Title	Description	External Mappings	Capability Mappings	Control Mappings
CCC.GenAI.TH02	Data Poisoning	Data poisoning occurs when training, fine-tuning or embedding data is tampered with in order to modify the model's behaviour, for example steering it towards specific outputs, degrading performance or introducing backdoors.	4	1	0
CCC.GenAI.TH03	Sensitive Information Disclosure	Sensitive data can be memorised by the model from user interaction or training and may then be leaked to unintended and unauthorised parties by querying the model, for example through crafted prompts.	4	1	0

Related Capabilities

ID	Title	Description
CCC.Core.F02	Encryption at Rest Enabled by Default	The service automatically encrypts all data using industry-standard cryptographic protocols prior to being written to a storage medium.
CCC.Core.F06	Access Control	The service automatically enforces user configurations to restrict or allow access to a specific component or a child resource based on factors such as user identities, roles, groups, or attributes.
CCC.GenAI.F03	Embedding Model Selection	Ability to select a foundation model used for tasks like semantic search, clustering, and document similarity by converting text into vector embeddings.
CCC.GenAI.F06	Customizable Model Selection	Provide users the ability to fine-tune models with their own data.
CCC.GenAI.F21	Generate Content	Ability to generate a response given a foundation model, parameter values, and a prompt.
CCC.GenAI.F22	Data Control	Ensures prompts, model outputs, embeddings, and training data fed by customers are not used to train foundation models.
CCC.GenAI.F24	Content Moderation	Ensure the service detects and filters abusive, harmful, and sensitive information to ensure responsible and safe use of the service.

Guideline Mappings

Reference ID	Entry ID	Remarks
FINOS-AIGF	AIR-PREV-002	Data Filtering From External Knowledge Bases
SAIF	Training Data Sanitization	-
MITRE-ATLAS	AML.M0007	Sanitize Training Data

Assessment Requirements

ID	Description	Applicability
CCC.GenAI.C04.TR01	When data is ingested for training, fine-tuning or conversion to vector embeddings, it MUST be validated for sensitive information or malicious content.	tlp-clear tlp-green tlp-amber tlp-red
CCC.GenAI.C04.TR02	If sensitive data or malicious content is detected, it must be rejected, redacted or flagged for manual review.	tlp-clear tlp-green tlp-amber tlp-red