Anonymizers are available in both cloud and self-hosted installations of Opik.

Anonymizers help you protect sensitive information in your LLM applications by automatically detecting and replacing personally identifiable information (PII) and other sensitive data before it’s logged to Opik. This ensures compliance with privacy regulations and prevents accidental exposure of sensitive information in your trace data.

How it works

Anonymizers work by processing all data that flows through Opik’s tracing system - including inputs, outputs, and metadata - before it’s stored or displayed. They apply a set of rules to detect and replace sensitive information with anonymized placeholders.

The anonymization happens automatically and transparently:

Data Ingestion: When you log traces and spans to Opik
Rule Application: Registered anonymizers scan the data using their configured rules
Replacement: Sensitive information is replaced with anonymized placeholders
Storage: Only the anonymized data is stored in Opik

Types of Anonymizers

Rules-based Anonymizer

The most common type of anonymizer uses pattern-matching rules to identify and replace sensitive information. Rules can be defined in several formats:

Regex Rules

Use regular expressions to match specific patterns:

1 import opik
2 from opik.anonymizer import create_anonymizer
3 
4 # Dictionary format
5 email_rule = {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"}
6 
7 # Tuple format
8 phone_rule = (r"\b\d{3}-\d{3}-\d{4}\b", "[PHONE]")
9 
10 # Create anonymizer with multiple rules
11 anonymizer = create_anonymizer([email_rule, phone_rule])
12 
13 # Register globally
14 opik.hooks.add_anonymizer(anonymizer)

Function Rules

Use custom Python functions for more complex anonymization logic:

1 import opik
2 from opik.anonymizer import create_anonymizer
3 
4 def mask_api_keys(text: str) -> str:
5     """Custom function to anonymize API keys"""
6     import re
7     # Match common API key patterns
8     api_key_pattern = r'\b(sk-[a-zA-Z0-9]{32,}|pk_[a-zA-Z0-9]{24,})\b'
9     return re.sub(api_key_pattern, '[API_KEY]', text)
10 
11 def anonymize_with_hash(text: str) -> str:
12     """Replace emails with consistent hashes for tracking without exposing PII"""
13     import re
14     import hashlib
15     
16     def hash_replace(match):
17         email = match.group(0)
18         hash_val = hashlib.md5(email.encode()).hexdigest()[:8]
19         return f"[EMAIL_{hash_val}]"
20     
21     email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
22     return re.sub(email_pattern, hash_replace, text)
23 
24 # Create anonymizer with function rules
25 anonymizer = create_anonymizer([mask_api_keys, anonymize_with_hash])
26 opik.hooks.add_anonymizer(anonymizer)

Mixed Rules

Combine different rule types for comprehensive anonymization:

1 import opik
2 import opik.hooks
3 from opik.anonymizer import create_anonymizer
4 
5 # Mix of dictionary, tuple, and function rules
6 mixed_rules = [
7     {"regex": r"\b\d{3}-\d{2}-\d{4}\b", "replace": "[SSN]"},  # Social Security Numbers
8     (r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "[CARD]"),  # Credit Cards
9     lambda text: text.replace("CONFIDENTIAL", "[REDACTED]"),  # Custom replacements
10 ]
11 
12 anonymizer = create_anonymizer(mixed_rules)
13 opik.hooks.add_anonymizer(anonymizer)

Custom Anonymizers

For advanced use cases, create custom anonymizers by extending the Anonymizer base class.

Understanding Anonymizer Arguments

When implementing custom anonymizers, you need to implement the anonymize() method with the following signature:

1 def anonymize(self, data, **kwargs):
2     # Your anonymization logic here
3     return anonymized_data

The kwargs parameters:

The anonymize() method also receives additional context through **kwargs:

field_name: Indicates which field is being anonymized ("input", "output", "metadata", or nested field names in dots notation such as "metadata.email")
object_type: The type of the object being processed ("span", "trace")

When are kwargs available?

These kwargs are automatically passed by Opik’s internal data processors when anonymizing trace and span data before sending it to the backend. This allows you to apply different anonymization strategies based on the field being processed.

Example: Field-specific anonymization

1 from opik.anonymizer import Anonymizer
2 import opik.hooks
3 
4 class FieldAwareAnonymizer(Anonymizer):
5     def anonymize(self, data, **kwargs):
6         field_name = kwargs.get("field_name", "")
7         
8         # Only anonymize the output field, leave input as-is for debugging
9         if field_name == "output" and isinstance(data, str):
10             import re
11             # More aggressive anonymization for outputs
12             data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', data)
13             data = re.sub(r'\b\d{3}-\d{3}-\d{4}\b', '[PHONE]', data)
14         elif field_name == "metadata" and isinstance(data, dict):
15             # Remove specific metadata fields entirely
16             sensitive_keys = ["user_id", "session_token", "api_key"]
17             for key in sensitive_keys:
18                 if key in data:
19                     data[key] = "[REDACTED]"
20         
21         return data
22 
23 # Register the field-aware anonymizer
24 opik.hooks.add_anonymizer(FieldAwareAnonymizer())

The field_name and object_type kwargs are primarily useful for implementing context-aware anonymization logic. If you don’t need field-specific behavior, you can safely ignore these kwargs.

Example: Anonymization of nested data structures

Also, you can extend the RecursiveAnonymizer base class to work with nested data structures. This allows you to apply the same anonymization logic to all nested fields. In this case you need to implement the anonymize_text() method instead of anonymize().

1 from typing import Any, Optional
2 
3 from opik.anonymizer import RecursiveAnonymizer
4 import opik.hooks
5 
6 class SSNAnonymizer(RecursiveAnonymizer):
7     def anonymize_text(self, data: str, field_name: Optional[str] = None, **kwargs: Any) -> str:
8         if field_name == "metadata.ssn":
9             return "[SSN_REMOVED]"
10 
11         return data

Advanced Custom Anonymizer Example

1 import opik
2 import opik.hooks
3 from opik.anonymizer import Anonymizer
4 
5 class AdvancedPIIAnonymizer(Anonymizer):
6     def anonymize(self, data, **kwargs):
7         """Custom anonymizer with advanced PII detection and removal."""
8         field_name = kwargs.get("field_name")
9         object_type = kwargs.get("object_type")
10 
11         # Handle different data types
12         if isinstance(data, dict):
13             # Remove sensitive keys entirely
14             if "api_key" in data:
15                 del data["api_key"]
16             if "password" in data:
17                 del data["password"]
18 
19             # Anonymize specific fields
20             for key, value in data.items():
21                 if key.lower() in ["email", "user_email"]:
22                     data[key] = "[EMAIL_REDACTED]"
23                 elif key.lower() in ["phone", "telephone", "mobile"]:
24                     data[key] = "[PHONE_REDACTED]"
25 
26         elif isinstance(data, str):
27             # Apply string-based anonymization
28             import re
29             # Names (simple heuristic)
30             data = re.sub(r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', '[NAME]', data)
31             # Addresses
32             data = re.sub(r'\d+\s+\w+\s+(Street|St|Avenue|Ave|Road|Rd|Drive|Dr)\b', '[ADDRESS]', data)
33 
34         return data
35 
36 # Register the custom anonymizer
37 opik.hooks.add_anonymizer(AdvancedPIIAnonymizer())

Usage Examples

Basic Setup

Here’s a complete example showing how to set up anonymization for a simple LLM application:

1 import opik
2 import opik.hooks
3 from opik.anonymizer import create_anonymizer
4 
5 # Define PII anonymization rules
6 pii_rules = [
7     # Email addresses
8     {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"},
9     # Phone numbers (US format)
10     {"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"},
11     # Social Security Numbers
12     {"regex": r"\b\d{3}-\d{2}-\d{4}\b", "replace": "[SSN]"},
13     # Credit card numbers
14     {"regex": r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "replace": "[CARD]"},
15 ]
16 
17 # Create and register anonymizer
18 anonymizer = create_anonymizer(pii_rules)
19 opik.hooks.add_anonymizer(anonymizer)
20 
21 # Now all traced functions will automatically anonymize PII
22 @opik.track
23 def process_customer_data(customer_info):
24     """This function processes customer data with automatic PII anonymization"""
25     # The input and output will be automatically anonymized
26     return f"Processed customer: {customer_info}"
27 
28 # Example usage - PII will be automatically anonymized in traces
29 result = process_customer_data("John Doe, email: john@example.com, phone: 555-123-4567")

Advanced Configuration

For more sophisticated anonymization scenarios:

1 import opik
2 import opik.hooks
3 from opik.anonymizer import create_anonymizer, Anonymizer
4 
5 class ComplianceAnonymizer(Anonymizer):
6     """Enterprise-grade anonymizer for compliance requirements"""
7 
8     def __init__(self, compliance_level="standard"):
9         self.compliance_level = compliance_level
10         self.sensitive_fields = {
11             "standard": ["email", "phone", "ssn"],
12             "strict": ["email", "phone", "ssn", "name", "address", "dob"],
13             "minimal": ["ssn", "password"]
14         }
15 
16     def anonymize(self, data, **kwargs):
17         field_name = kwargs.get("field_name", "")
18 
19         if isinstance(data, dict):
20             # Process dictionary fields
21             for key, value in list(data.items()):
22                 if key.lower() in self.sensitive_fields[self.compliance_level]:
23                     data[key] = f"[{key.upper()}_REDACTED]"
24 
25         elif isinstance(data, str):
26             # Apply string-level anonymization based on the compliance level
27             if self.compliance_level == "strict":
28                 # More aggressive anonymization
29                 import re
30                 data = re.sub(r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', '[NAME]', data)
31                 data = re.sub(r'\b\d{1,4}\s+\w+\s+\w+\b', '[ADDRESS]', data)
32 
33         return data
34 
35 # Set up multi-layer anonymization
36 opik.hooks.clear_anonymizers()  # Clear any existing anonymizers
37 
38 # Layer 1: Basic PII patterns
39 basic_rules = [
40     (r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]"),
41     (r"\b\d{3}-\d{3}-\d{4}\b", "[PHONE]"),
42 ]
43 opik.hooks.add_anonymizer(create_anonymizer(basic_rules))
44 
45 # Layer 2: Compliance-specific anonymization
46 opik.hooks.add_anonymizer(ComplianceAnonymizer(compliance_level="standard"))
47 
48 # Layer 3: Custom business logic
49 def remove_internal_identifiers(text):
50     """Remove company-specific internal identifiers"""
51     import re
52     return re.sub(r'\bEMP-\d{6}\b', '[EMPLOYEE_ID]', text)
53 
54 opik.hooks.add_anonymizer(create_anonymizer(remove_internal_identifiers))

Using third-party PII libraries

In addition to regex and custom Python functions, you can reuse existing PII detection / redaction tools such as Microsoft Presidio or cloud APIs (AWS Comprehend, Google Cloud DLP, Azure AI Language). These tools can be wrapped inside an Opik anonymizer so that all trace data is pre-redacted before it’s logged. You typically integrate third-party tools in one of two ways:

Local open-source libraries running inside your app or self-hosted Opik deployment (e.g. Microsoft Presidio, scrubadub).
Managed cloud services called via their SDKs from your anonymizer (e.g. AWS Comprehend PII, Google Cloud DLP, Azure AI Language PII).

Third-party anonymizers are just custom anonymizers under the hood. You call the external engine inside anonymize() or a function rule, then return the redacted data back to Opik.

Example: Microsoft Presidio (open source, runs locally)

First, install Presidio in your environment:

$ pip install presidio-analyzer presidio-anonymizer

Then create an Anonymizer that delegates to Presidio:

1 from typing import Any
2 
3 import opik.hooks
4 from opik.anonymizer import RecursiveAnonymizer
5 
6 from presidio_analyzer import AnalyzerEngine
7 from presidio_anonymizer import AnonymizerEngine
8 from presidio_anonymizer.entities import OperatorConfig
9 
10 class PresidioPIIAnonymizer(RecursiveAnonymizer):
11     """Use Microsoft Presidio to detect and anonymize PII in text.
12     This anonymizer is a simple wrapper around Presidio's built-in anonymizer engine.
13     It extends the RecursiveAnonymizer base class to support nested data structures.
14     """
15     def __init__(self, language: str="en",  max_depth: int=10):
16         super().__init__(max_depth=max_depth)
17         self.language = language
18         self.analyzer = AnalyzerEngine()
19         self.anonymizer = AnonymizerEngine()
20 
21     def anonymize_text(self, data: str, **kwargs: Any) -> str:
22         # 1) Detect PII entities in the text
23         results = self.analyzer.analyze(
24             text=data,
25             language=self.language,
26             entities=None,  # detect all supported entities
27         )
28         if not results:
29             return data
30 
31         # 2) Apply Presidio anonymization
32         operators = {
33             "DEFAULT": OperatorConfig("replace", {"new_value": "[PII]"}),
34             # You can customize per entity type if needed, for example:
35             # "PHONE_NUMBER": OperatorConfig("mask", {"masking_char": "*", "chars_to_mask": 8}),
36         }
37         anon_result = self.anonymizer.anonymize(
38             text=data,
39             analyzer_results=results,
40             operators=operators,
41         )
42         return anon_result.text
43 
44 # Register the Presidio-based anonymizer globally
45 opik.hooks.add_anonymizer(PresidioPIIAnonymizer())

You can combine a Presidio anonymizer with existing regex/function rules by registering multiple anonymizers; they will be applied in sequence.

Integration with Frameworks

Anonymizers work seamlessly with all Opik integrations:

OpenAI Integration

1 import opik
2 import opik.hooks
3 from opik.anonymizer import create_anonymizer
4 from opik.integrations.openai import track_openai
5 import openai
6 
7 # Set up anonymization
8 pii_rules = [
9     {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"},
10     {"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"},
11 ]
12 opik.hooks.add_anonymizer(create_anonymizer(pii_rules))
13 
14 # Enable OpenAI tracking with automatic anonymization
15 client = track_openai(openai.OpenAI())
16 
17 # PII in prompts will be automatically anonymized in traces
18 response = client.chat.completions.create(
19     model="gpt-3.5-turbo",
20     messages=[{
21         "role": "user",
22         "content": "Help me draft an email to john.doe@company.com about his phone number 555-123-4567"
23     }]
24 )

LangChain Integration

1 import opik
2 import opik.hooks
3 from opik.anonymizer import create_anonymizer
4 from opik.integrations.langchain import OpikTracer
5 from langchain_openai import ChatOpenAI
6 from langchain.schema import HumanMessage
7 
8 # Configure anonymization - mix regex and callable function
9 def mask_credit_cards(text: str) -> str:
10     """Partial masking: show first 4 and last 4 digits, mask the middle"""
11     import re
12     def partial_mask(match):
13         card = match.group(0).replace('-', '').replace(' ', '')
14         if len(card) >= 8:
15             return card[:4] + '*' * (len(card) - 8) + card[-4:]
16         return '[CARD]'
17     return re.sub(r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b', partial_mask, text)
18 
19 anonymizer_rules = [
20     # Email pattern (regex tuple)
21     (r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]"),
22     # Callable function for smart masking
23     mask_credit_cards,
24 ]
25 opik.hooks.add_anonymizer(create_anonymizer(anonymizer_rules))
26 
27 # Set up LangChain with Opik tracing
28 llm = ChatOpenAI(callbacks=[OpikTracer()])
29 
30 # All inputs and outputs will be automatically anonymized
31 messages = [HumanMessage(content="Contact sarah@example.com about card 4532-1234-5678-9010")]
32 result = llm.invoke(messages)

Configuration Options

Max Depth

Control how deeply nested data structures are processed:

1 from opik.anonymizer import create_anonymizer
2 
3 rules = [{"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"}]
4 
5 # Default max_depth is 10
6 anonymizer = create_anonymizer(rules, max_depth=5)

Multiple Anonymizers

1 import opik
2 import opik.hooks
3 from opik.anonymizer import create_anonymizer
4 
5 # Clear existing anonymizers
6 opik.hooks.clear_anonymizers()
7 
8 # Add multiple anonymizers in order
9 opik.hooks.add_anonymizer(create_anonymizer([
10     {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"}
11 ]))
12 
13 opik.hooks.add_anonymizer(create_anonymizer([
14     {"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"}
15 ]))
16 
17 # Check if any anonymizers are registered
18 if opik.hooks.has_anonymizers():
19     print(f"Active anonymizers: {len(opik.hooks.get_anonymizers())}")

Best Practices

Rule Ordering

Rules are applied in the order they’re defined. More specific patterns should come before general ones:

1 rules = [
2     # Specific: Credit cards (more specific pattern first)
3     {"regex": r"\b4\d{3}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "replace": "[VISA_CARD]"},
4     # General: Any credit card
5     {"regex": r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "replace": "[CARD]"},
6     # General: Any number sequence
7     {"regex": r"\b\d{4,}\b", "replace": "[NUMBER]"},
8 ]

Performance Considerations

Use precompiled regex patterns for improved performance on large datasets when implementing custom anonymization functions. Note: Opik’s RegexRule automatically compiles patterns when the rule is created.
Keep the number of rules reasonable to avoid performance impacts
Consider using more specific patterns to reduce false positives

1 import re
2 from opik.anonymizer import create_anonymizer
3 
4 # Pre-compile regex for better performance
5 EMAIL_PATTERN = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b")
6 
7 def efficient_email_anonymizer(text):
8     return EMAIL_PATTERN.sub("[EMAIL]", text)
9 
10 anonymizer = create_anonymizer(efficient_email_anonymizer)

Testing Anonymizers

Always test your anonymization rules to ensure they work correctly:

1 from opik.anonymizer import create_anonymizer
2 
3 # Define your rules
4 rules = [
5     {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"},
6     {"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"},
7 ]
8 
9 anonymizer = create_anonymizer(rules)
10 
11 # Test with sample data
12 test_data = "Contact John at john.doe@company.com or call 555-123-4567"
13 anonymized = anonymizer.anonymize(test_data)
14 print(anonymized)  # Should output: "Contact John at [EMAIL] or call [PHONE]"
15 
16 # Test with nested data
17 test_nested = {
18     "user": {
19         "email": "user@example.com",
20         "phone": "555-987-6543",
21         "notes": "Called regarding john@company.com"
22     }
23 }
24 anonymized_nested = anonymizer.anonymize(test_nested)
25 print(anonymized_nested)

Troubleshooting

Common Issues

Anonymizer not working:

Ensure the anonymizer is registered with opik.hooks.add_anonymizer()
Check that your patterns are correct using a regex tester
Verify that opik.flush_tracker() is called if needed

Performance issues:

Reduce the complexity of regex patterns
Limit the number of registered anonymizers
Consider using more specific patterns to reduce processing overhead

False positives:

Make your regex patterns more specific
Test thoroughly with representative data
Consider using negative lookbehind/lookahead assertions

Security Considerations

Test thoroughly: Always test anonymization rules with representative data
Regular updates: Review and update patterns as your application evolves
Compliance: Ensure your anonymization approach meets regulatory requirements
Backup strategy: Consider how to handle cases where anonymization fails
Access control: Limit access to original data and anonymization rules

Remember that anonymization is a one-way process — once data is anonymized in Opik, the original values cannot be recovered. Plan your anonymization strategy accordingly.