Skip to main content

AI/ML / Gen AI / Controls / DEV

Sanitisation of Ingested Data

CCC.GenAI.CN04 · MachineLearning

Validate and sanitise all data ingested by GenAI systems from extenal sources or internal knowledge bases, whether for training, conversion to vector embeddings, or real-time retireval, in order to remove or redact poisoned or sensitive data before further processing.

Related Capabilities

IDTitleDescription
CCC.Core.CP02Encryption at Rest Enabled by DefaultThe service automatically encrypts all data using industry-standard cryptographic protocols prior to being written to a storage medium.
CCC.Core.CP06Access ControlThe service automatically enforces user configurations to restrict or allow access to a specific component or a child resource based on factors such as user identities, roles, groups, or attributes.
CCC.GenAI.CP03Embedding Model SelectionAbility to select a foundation model used for tasks like semantic search, clustering, and document similarity by converting text into vector embeddings.
CCC.GenAI.CP06Customizable Model SelectionProvide users the ability to fine-tune models with their own data.
CCC.GenAI.CP21Generate ContentAbility to generate a response given a foundation model, parameter values, and a prompt.
CCC.GenAI.CP22Data ControlEnsures prompts, model outputs, embeddings, and training data fed by customers are not used to train foundation models.
CCC.GenAI.CP24Content ModerationEnsure the service detects and filters abusive, harmful, and sensitive information to ensure responsible and safe use of the service.

Related Threats

IDTitleDescription
CCC.GenAI.TH02Data PoisoningData poisoning occurs when training, fine-tuning or embedding data is tampered with in order to modify the model's behaviour, for example steering it towards specific outputs, degrading performance or introducing backdoors.
CCC.GenAI.TH03Sensitive Information DisclosureSensitive data can be memorised by the model from user interaction or training and may then be leaked to unintended and unauthorised parties by querying the model, for example through crafted prompts.

Assessment Requirements

IDTextApplicability
CCC.GenAI.CN04.AR01When data is ingested for training, fine-tuning or conversion to vector embeddings, it MUST be validated for sensitive information or malicious content.tlp-clear, tlp-green, tlp-amber, tlp-red
CCC.GenAI.CN04.AR02If sensitive data or malicious content is detected, it must be rejected, redacted or flagged for manual review.tlp-clear, tlp-green, tlp-amber, tlp-red

Guideline Mappings

FrameworkIDRemarks
FINOS-AIGFAIR-PREV-002Data Filtering From External Knowledge Bases
SAIFTraining Data Sanitization
MITRE-ATLASAML.M0007Sanitize Training Data