Anonymizers

Anonymizers are available in both cloud and self-hosted installations of Opik.

Anonymizers help you protect sensitive information in your LLM applications by automatically detecting and replacing personally identifiable information (PII) and other sensitive data before it’s logged to Opik. This ensures compliance with privacy regulations and prevents accidental exposure of sensitive information in your trace data.

How it works

Anonymizers work by processing all data that flows through Opik’s tracing system - including inputs, outputs, and metadata - before it’s stored or displayed. They apply a set of rules to detect and replace sensitive information with anonymized placeholders.

The anonymization happens automatically and transparently:

  1. Data Ingestion: When you log traces and spans to Opik
  2. Rule Application: Registered anonymizers scan the data using their configured rules
  3. Replacement: Sensitive information is replaced with anonymized placeholders
  4. Storage: Only the anonymized data is stored in Opik

Types of Anonymizers

Rules-based Anonymizer

The most common type of anonymizer uses pattern-matching rules to identify and replace sensitive information. Rules can be defined in several formats:

Regex Rules

Use regular expressions to match specific patterns:

1import opik
2from opik.anonymizer import create_anonymizer
3
4# Dictionary format
5email_rule = {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"}
6
7# Tuple format
8phone_rule = (r"\b\d{3}-\d{3}-\d{4}\b", "[PHONE]")
9
10# Create anonymizer with multiple rules
11anonymizer = create_anonymizer([email_rule, phone_rule])
12
13# Register globally
14opik.hooks.add_anonymizer(anonymizer)

Function Rules

Use custom Python functions for more complex anonymization logic:

1import opik
2from opik.anonymizer import create_anonymizer
3
4def mask_api_keys(text: str) -> str:
5 """Custom function to anonymize API keys"""
6 import re
7 # Match common API key patterns
8 api_key_pattern = r'\b(sk-[a-zA-Z0-9]{32,}|pk_[a-zA-Z0-9]{24,})\b'
9 return re.sub(api_key_pattern, '[API_KEY]', text)
10
11def uppercase_names(text: str) -> str:
12 """Transform names to uppercase for consistency"""
13 return text.upper()
14
15# Create anonymizer with function rules
16anonymizer = create_anonymizer([mask_api_keys, uppercase_names])
17opik.hooks.add_anonymizer(anonymizer)

Mixed Rules

Combine different rule types for comprehensive anonymization:

1import opik
2import opik.hooks
3from opik.anonymizer import create_anonymizer
4
5# Mix of dictionary, tuple, and function rules
6mixed_rules = [
7 {"regex": r"\b\d{3}-\d{2}-\d{4}\b", "replace": "[SSN]"}, # Social Security Numbers
8 (r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "[CARD]"), # Credit Cards
9 lambda text: text.replace("CONFIDENTIAL", "[REDACTED]"), # Custom replacements
10]
11
12anonymizer = create_anonymizer(mixed_rules)
13opik.hooks.add_anonymizer(anonymizer)

Custom Anonymizers

For advanced use cases, create custom anonymizers by extending the Anonymizer base class:

1import opik
2import opik.hooks
3from opik.anonymizer import Anonymizer
4
5class AdvancedPIIAnonymizer(Anonymizer):
6 def anonymize(self, data, **kwargs):
7 """
8 Custom anonymizer with access to metadata about the field being processed
9 """
10 field_name = kwargs.get("field_name")
11 object_type = kwargs.get("object_type")
12
13 # Handle different data types
14 if isinstance(data, dict):
15 # Remove sensitive keys entirely
16 if "api_key" in data:
17 del data["api_key"]
18 if "password" in data:
19 del data["password"]
20
21 # Anonymize specific fields
22 for key, value in data.items():
23 if key.lower() in ["email", "user_email"]:
24 data[key] = "[EMAIL_REDACTED]"
25 elif key.lower() in ["phone", "telephone", "mobile"]:
26 data[key] = "[PHONE_REDACTED]"
27
28 elif isinstance(data, str):
29 # Apply string-based anonymization
30 import re
31 # Names (simple heuristic)
32 data = re.sub(r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', '[NAME]', data)
33 # Addresses
34 data = re.sub(r'\d+\s+\w+\s+(Street|St|Avenue|Ave|Road|Rd|Drive|Dr)\b', '[ADDRESS]', data)
35
36 return data
37
38# Register the custom anonymizer
39opik.hooks.add_anonymizer(AdvancedPIIAnonymizer())

Usage Examples

Basic Setup

Here’s a complete example showing how to set up anonymization for a simple LLM application:

1import opik
2import opik.hooks
3from opik.anonymizer import create_anonymizer
4
5# Define PII anonymization rules
6pii_rules = [
7 # Email addresses
8 {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"},
9 # Phone numbers (US format)
10 {"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"},
11 # Social Security Numbers
12 {"regex": r"\b\d{3}-\d{2}-\d{4}\b", "replace": "[SSN]"},
13 # Credit card numbers
14 {"regex": r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "replace": "[CARD]"},
15]
16
17# Create and register anonymizer
18anonymizer = create_anonymizer(pii_rules)
19opik.hooks.add_anonymizer(anonymizer)
20
21# Now all traced functions will automatically anonymize PII
22@opik.track
23def process_customer_data(customer_info):
24 """This function processes customer data with automatic PII anonymization"""
25 # The input and output will be automatically anonymized
26 return f"Processed customer: {customer_info}"
27
28# Example usage - PII will be automatically anonymized in traces
29result = process_customer_data("John Doe, email: john@example.com, phone: 555-123-4567")

Advanced Configuration

For more sophisticated anonymization scenarios:

1import opik
2import opik.hooks
3from opik.anonymizer import create_anonymizer, Anonymizer
4
5class ComplianceAnonymizer(Anonymizer):
6 """Enterprise-grade anonymizer for compliance requirements"""
7
8 def __init__(self, compliance_level="standard"):
9 self.compliance_level = compliance_level
10 self.sensitive_fields = {
11 "standard": ["email", "phone", "ssn"],
12 "strict": ["email", "phone", "ssn", "name", "address", "dob"],
13 "minimal": ["ssn", "password"]
14 }
15
16 def anonymize(self, data, **kwargs):
17 field_name = kwargs.get("field_name", "")
18
19 if isinstance(data, dict):
20 # Process dictionary fields
21 for key, value in list(data.items()):
22 if key.lower() in self.sensitive_fields[self.compliance_level]:
23 data[key] = f"[{key.upper()}_REDACTED]"
24
25 elif isinstance(data, str):
26 # Apply string-level anonymization based on the compliance level
27 if self.compliance_level == "strict":
28 # More aggressive anonymization
29 import re
30 data = re.sub(r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', '[NAME]', data)
31 data = re.sub(r'\b\d{1,4}\s+\w+\s+\w+\b', '[ADDRESS]', data)
32
33 return data
34
35# Set up multi-layer anonymization
36opik.hooks.clear_anonymizers() # Clear any existing anonymizers
37
38# Layer 1: Basic PII patterns
39basic_rules = [
40 (r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]"),
41 (r"\b\d{3}-\d{3}-\d{4}\b", "[PHONE]"),
42]
43opik.hooks.add_anonymizer(create_anonymizer(basic_rules))
44
45# Layer 2: Compliance-specific anonymization
46opik.hooks.add_anonymizer(ComplianceAnonymizer(compliance_level="standard"))
47
48# Layer 3: Custom business logic
49def remove_internal_identifiers(text):
50 """Remove company-specific internal identifiers"""
51 import re
52 return re.sub(r'\bEMP-\d{6}\b', '[EMPLOYEE_ID]', text)
53
54opik.hooks.add_anonymizer(create_anonymizer(remove_internal_identifiers))

Integration with Frameworks

Anonymizers work seamlessly with all Opik integrations:

OpenAI Integration

1import opik
2import opik.hooks
3from opik.anonymizer import create_anonymizer
4from opik.integrations.openai import track_openai
5import openai
6
7# Set up anonymization
8pii_rules = [
9 {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"},
10 {"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"},
11]
12opik.hooks.add_anonymizer(create_anonymizer(pii_rules))
13
14# Enable OpenAI tracking with automatic anonymization
15client = track_openai(openai.OpenAI())
16
17# PII in prompts will be automatically anonymized in traces
18response = client.chat.completions.create(
19 model="gpt-3.5-turbo",
20 messages=[{
21 "role": "user",
22 "content": "Help me draft an email to john.doe@company.com about his phone number 555-123-4567"
23 }]
24)

LangChain Integration

1import opik
2import opik.hooks
3from opik.anonymizer import create_anonymizer
4from opik.integrations.langchain import OpikTracer
5from langchain_openai import ChatOpenAI
6from langchain.schema import HumanMessage
7
8# Configure anonymization
9anonymizer_rules = [
10 # Email pattern
11 (r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]"),
12 # Custom function to handle API keys
13 lambda text: text.replace("sk-", "[API_KEY]"),
14]
15opik.hooks.add_anonymizer(create_anonymizer(anonymizer_rules))
16
17# Set up LangChain with Opik tracing
18llm = ChatOpenAI(callbacks=[OpikTracer()])
19
20# All inputs and outputs will be automatically anonymized
21messages = [HumanMessage(content="Contact sarah@example.com about the API key sk-1234567890")]
22result = llm.invoke(messages)

Configuration Options

Max Depth

Control how deeply nested data structures are processed:

1from opik.anonymizer import create_anonymizer
2
3rules = [{"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"}]
4
5# Default max_depth is 10
6anonymizer = create_anonymizer(rules, max_depth=5)

Multiple Anonymizers

Register multiple anonymizers that will be applied in sequence:

1import opik
2import opik.hooks
3from opik.anonymizer import create_anonymizer
4
5# Clear existing anonymizers
6opik.hooks.clear_anonymizers()
7
8# Add multiple anonymizers in order
9opik.hooks.add_anonymizer(create_anonymizer([
10 {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"}
11]))
12
13opik.hooks.add_anonymizer(create_anonymizer([
14 {"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"}
15]))
16
17# Check if any anonymizers are registered
18if opik.hooks.has_anonymizers():
19 print(f"Active anonymizers: {len(opik.hooks.get_anonymizers())}")

Best Practices

Rule Ordering

Rules are applied in the order they’re defined. More specific patterns should come before general ones:

1rules = [
2 # Specific: Credit cards (more specific pattern first)
3 {"regex": r"\b4\d{3}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "replace": "[VISA_CARD]"},
4 # General: Any credit card
5 {"regex": r"\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b", "replace": "[CARD]"},
6 # General: Any number sequence
7 {"regex": r"\b\d{4,}\b", "replace": "[NUMBER]"},
8]

Performance Considerations

  • Use precompiled regex patterns for improved performance on large datasets when implementing custom anonymization functions. (OPIK automatically compiles regex patterns when an anonymizer is registered using a regex pattern.)
  • Keep the number of rules reasonable to avoid performance impacts
  • Consider using more specific patterns to reduce false positives
1import re
2from opik.anonymizer import create_anonymizer
3
4# Pre-compile regex for better performance
5EMAIL_PATTERN = re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b")
6
7def efficient_email_anonymizer(text):
8 return EMAIL_PATTERN.sub("[EMAIL]", text)
9
10anonymizer = create_anonymizer(efficient_email_anonymizer)

Testing Anonymizers

Always test your anonymization rules to ensure they work correctly:

1from opik.anonymizer import create_anonymizer
2
3# Define your rules
4rules = [
5 {"regex": r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "replace": "[EMAIL]"},
6 {"regex": r"\b\d{3}-\d{3}-\d{4}\b", "replace": "[PHONE]"},
7]
8
9anonymizer = create_anonymizer(rules)
10
11# Test with sample data
12test_data = "Contact John at john.doe@company.com or call 555-123-4567"
13anonymized = anonymizer.anonymize(test_data)
14print(anonymized) # Should output: "Contact John at [EMAIL] or call [PHONE]"
15
16# Test with nested data
17test_nested = {
18 "user": {
19 "email": "user@example.com",
20 "phone": "555-987-6543",
21 "notes": "Called regarding john@company.com"
22 }
23}
24anonymized_nested = anonymizer.anonymize(test_nested)
25print(anonymized_nested)

Troubleshooting

Common Issues

Anonymizer not working:

  • Ensure the anonymizer is registered with opik.hooks.add_anonymizer()
  • Check that your patterns are correct using a regex tester
  • Verify that opik.flush_tracker() is called if needed

Performance issues:

  • Reduce the complexity of regex patterns
  • Limit the number of registered anonymizers
  • Consider using more specific patterns to reduce processing overhead

False positives:

  • Make your regex patterns more specific
  • Test thoroughly with representative data
  • Consider using negative lookbehind/lookahead assertions

Security Considerations

  • Test thoroughly: Always test anonymization rules with representative data
  • Regular updates: Review and update patterns as your application evolves
  • Compliance: Ensure your anonymization approach meets regulatory requirements
  • Backup strategy: Consider how to handle cases where anonymization fails
  • Access control: Limit access to original data and anonymization rules

Remember that anonymization is a one-way process — once data is anonymized in Opik, the original values cannot be recovered. Plan your anonymization strategy accordingly.