Approach
Prompt Injection Detection:
-
Use rule-based heuristics to identify suspicious patterns (e.g., keywords like “ignore instructions,” “bypass,” or encoded inputs).
-
Leverage a pre-trained language model to score input prompts for malicious intent.
-
Optionally, use embeddings to detect semantic anomalies in prompts.
Anomalous Output Detection:
- Monitor model outputs for unusual patterns, such as high perplexity, toxic content, or deviation from expected topics.
- Use a secondary classifier to flag outputs as anomalous based on embeddings or sentiment analysis.
- Incorporate statistical anomaly detection (e.g., Isolation Forest) for numerical features like output length or token distribution.
Implementation:
- Use transformers for language model interactions and embeddings.
- Use textblob or vaderSentiment for sentiment/toxicity analysis.
- Use scikit-learn for anomaly detection.
- Log flagged inputs/outputs for further analysis.