๐ Lesson 2: Data Sources and Collection
Master the identification and utilization of hunting data sources for effective threat hunting
๐ Learning Objectives
By the end of this lesson, you will be able to:
- Identify key data sources for threat hunting
- Understand endpoint telemetry data collection
- Analyze network traffic for hunting purposes
- Implement log aggregation and correlation
- Integrate threat intelligence feeds
- Ensure data quality and normalization
๐ Data Sources Overview
Importance of Data Sources
Effective threat hunting relies heavily on the quality, breadth, and depth of available data sources. The more comprehensive your data collection, the better your chances of detecting sophisticated threats that may have evaded traditional security controls.
๐ Key Principles for Data Sources:
- Completeness: Collect data from all relevant sources
- Quality: Ensure data accuracy and integrity
- Timeliness: Real-time or near real-time collection
- Retention: Maintain historical data for analysis
- Normalization: Standardize data formats for correlation
Data Source Categories
๐ฅ๏ธ Endpoint Data
- Process execution logs
- File system changes
- Registry modifications
- Network connections
- User activity logs
๐ Network Data
- Network flow data (NetFlow)
- Packet capture (PCAP)
- DNS query logs
- Proxy logs
- Firewall logs
๐ System Logs
- Windows Event Logs
- Syslog messages
- Application logs
- Security logs
- Authentication logs
๐ฏ Threat Intelligence
- IOC feeds
- Threat actor profiles
- Malware signatures
- TTP information
- Vulnerability data
๐ฅ๏ธ Endpoint Telemetry Data
Process Monitoring
Purpose: Track process creation, execution, and termination events
๐ Key Data Points:
- Process name and path
- Command line arguments
- Parent process information
- User and session context
- Timing information
๐ Hunting Use Cases:
- Detection of suspicious process chains
- Identification of living-off-the-land techniques
- Analysis of lateral movement patterns
- Monitoring of privilege escalation attempts
๐ป Example Queries:
# PowerShell execution detection ProcessName = "powershell.exe" AND CommandLine CONTAINS "-enc" # Suspicious parent-child relationships ParentProcessName = "explorer.exe" AND ProcessName = "cmd.exe" # Process execution from temp directories ProcessPath CONTAINS "temp" OR ProcessPath CONTAINS "appdata"
File System Monitoring
Purpose: Track file creation, modification, and deletion events
๐ Key Data Points:
- File path and name
- File size and timestamps
- File hash values
- File permissions and attributes
- Process performing the operation
๐ Hunting Use Cases:
- Detection of malware drops
- Identification of data exfiltration
- Monitoring of configuration changes
- Analysis of persistence mechanisms
๐ป Example Queries:
# Suspicious file extensions FileName ENDS WITH ".scr" OR FileName ENDS WITH ".pif" # Files created in system directories FilePath CONTAINS "system32" AND EventType = "FileCreated" # Large file transfers FileSize > 100000000 AND EventType = "FileCreated"
Registry Monitoring
Purpose: Track registry key and value modifications
๐ Key Data Points:
- Registry key path
- Value name and data
- Operation type (create, modify, delete)
- Process performing the operation
- Timestamp of the change
๐ Hunting Use Cases:
- Detection of persistence mechanisms
- Identification of system configuration changes
- Monitoring of security software tampering
- Analysis of privilege escalation attempts
๐ป Example Queries:
# Run key modifications RegistryPath CONTAINS "Run" AND EventType = "RegistryModified" # Security software tampering RegistryPath CONTAINS "antivirus" AND EventType = "RegistryDeleted" # Suspicious value data RegistryValue CONTAINS "powershell" OR RegistryValue CONTAINS "cmd"
๐ Network Traffic Analysis
Network Flow Data (NetFlow)
Purpose: Analyze network communication patterns and connections
๐ Key Data Points:
- Source and destination IP addresses
- Source and destination ports
- Protocol information
- Byte and packet counts
- Connection duration and timing
๐ Hunting Use Cases:
- Detection of C2 communications
- Identification of data exfiltration
- Analysis of lateral movement
- Monitoring of suspicious connections
DNS Query Analysis
Purpose: Monitor domain name resolution requests
๐ Key Data Points:
- Query domain names
- Query types (A, AAAA, MX, etc.)
- Response information
- Client IP addresses
- Query frequency and patterns
๐ Hunting Use Cases:
- Detection of DNS tunneling
- Identification of domain generation algorithms
- Analysis of suspicious domains
- Monitoring of data exfiltration via DNS
Packet Capture (PCAP)
Purpose: Deep packet inspection for detailed network analysis
๐ Key Data Points:
- Full packet payload
- Protocol headers
- Application layer data
- Encrypted traffic metadata
- Timing and sequence information
๐ Hunting Use Cases:
- Deep analysis of suspicious traffic
- Identification of custom protocols
- Analysis of encrypted communications
- Reconstruction of attack sequences
๐ Log Aggregation and Correlation
Centralized Log Management
Effective threat hunting requires centralized collection and normalization of logs from diverse sources.
Windows Event Logs
- Security Log: Authentication and authorization events
- System Log: System-level events and errors
- Application Log: Application-specific events
- PowerShell Log: PowerShell execution events
Linux/Unix Logs
- Syslog: System and application messages
- Auth Log: Authentication events
- Kernel Log: Kernel-level events
- Application Logs: Service-specific logs
Network Device Logs
- Firewall Logs: Traffic filtering and blocking events
- Router Logs: Routing and network events
- Switch Logs: Port and VLAN events
- Proxy Logs: Web traffic and filtering events
Log Correlation Techniques
Time-based Correlation
Correlate events that occur within specific time windows to identify attack sequences.
# Example: Correlate failed logins with successful logins EventType = "FailedLogin" AND Time > (EventType = "SuccessfulLogin" - 5 minutes)
IP-based Correlation
Track activities from specific IP addresses across multiple data sources.
# Example: Track all activities from suspicious IP SourceIP = "192.168.1.100" OR DestinationIP = "192.168.1.100"
User-based Correlation
Monitor all activities associated with specific user accounts.
# Example: Track user activities across systems Username = "admin" OR UserSID = "S-1-5-21-..."
๐ฏ Threat Intelligence Integration
Types of Threat Intelligence
Indicators of Compromise (IOCs)
- IP Addresses: Malicious or suspicious IPs
- Domain Names: Malicious domains and URLs
- File Hashes: MD5, SHA1, SHA256 of malware
- Email Addresses: Phishing and spam sources
Tactics, Techniques, and Procedures (TTPs)
- Attack Patterns: Common attack methodologies
- Tools and Techniques: Malware and attack tools
- Infrastructure: C2 servers and domains
- Behavioral Patterns: Attacker behaviors and habits
Threat Actor Intelligence
- Attribution: Known threat groups and actors
- Motivations: Financial, political, espionage
- Capabilities: Technical skills and resources
- Targeting: Industries and organizations
Integration Strategies
Automated IOC Matching
Automatically match collected data against known IOCs from threat intelligence feeds.
# Example: Match network connections against malicious IPs NetworkConnection.SourceIP IN ThreatIntelligence.MaliciousIPs OR NetworkConnection.DestinationIP IN ThreatIntelligence.MaliciousIPs
Behavioral Pattern Matching
Search for activities that match known attack patterns and techniques.
# Example: Detect living-off-the-land techniques ProcessName IN ["powershell.exe", "cmd.exe", "wmic.exe"] AND CommandLine CONTAINS ThreatIntelligence.SuspiciousCommands
Contextual Enrichment
Enrich hunting queries with contextual information from threat intelligence.
# Example: Search for activities associated with specific threat groups ThreatIntelligence.AttributedGroup = "APT29" AND (ProcessName CONTAINS "sophos" OR RegistryPath CONTAINS "security")
๐ Data Quality and Normalization
Data Quality Challenges
Ensuring high-quality data is crucial for effective threat hunting. Poor data quality can lead to missed threats or false positives.
Data Inconsistency
Different systems may log the same event in different formats or with different field names.
Solutions:
- Implement data normalization rules
- Use standard field naming conventions
- Create data mapping tables
Missing Data
Some events may not be logged due to configuration issues or system limitations.
Solutions:
- Implement comprehensive logging policies
- Use multiple data sources for redundancy
- Monitor data collection health
Data Volume
Large volumes of data can overwhelm analysis capabilities and slow down hunting activities.
Solutions:
- Implement data filtering and aggregation
- Use tiered storage strategies
- Optimize query performance
Data Normalization Techniques
Field Standardization
Standardize field names and formats across all data sources.
# Standard field names SourceIP, DestinationIP, SourcePort, DestinationPort EventTime, EventType, UserName, ProcessName
Time Standardization
Convert all timestamps to a common timezone and format.
# UTC timestamp format EventTime: "2024-01-15T14:30:25.123Z"
Value Normalization
Normalize values to consistent formats (e.g., lowercase, trimmed strings).
# Normalized values ProcessName: "powershell.exe" (not "PowerShell.EXE") UserName: "john.doe" (not "JOHN.DOE")
๐งช Hands-On Exercise
Exercise: Data Source Assessment and Planning
Objective: Assess available data sources and develop a data collection strategy for threat hunting.
๐ Scenarios:
Scenario 1: Small Enterprise Environment
Situation: You're setting up threat hunting for a 100-employee company with basic security infrastructure.
Requirements:
- Identify available data sources
- Assess data quality and completeness
- Recommend data collection improvements
- Develop data normalization strategy
Scenario 2: Large Enterprise Environment
Situation: You're optimizing threat hunting for a 10,000-employee enterprise with comprehensive security tools.
Requirements:
- Map all available data sources
- Identify data gaps and redundancies
- Optimize data collection and storage
- Develop correlation strategies
Scenario 3: Cloud-First Environment
Situation: You're implementing threat hunting for a cloud-native organization using AWS, Azure, and SaaS applications.
Requirements:
- Identify cloud-specific data sources
- Assess cloud logging capabilities
- Plan data integration from multiple clouds
- Address cloud security and compliance
๐ Deliverables:
- Data source inventory and assessment
- Data collection strategy document
- Data normalization and correlation plan
- Implementation roadmap with priorities