The migration to the cloud, particularly AWS, has introduced massive scale and complexity, often overwhelming traditional security approaches. Security teams are constantly reacting to misconfigurations, fighting “fire after fire” through manual Jira tickets and Slack alerts. To break this reactive cycle and keep pace with developers, security needs to be proactive, automated, and deeply integrated.
We recently spoke with Lily Chau, a “friendly neighborhood security janitor at Roku” and self-proclaimed “sub-short hacker” who shifted from reactive security to building tools. In our conversation, Lily detailed her dual-track strategy for cloud security, focusing on secure defaults and auto-remediation, the subject of her B-Sides SAF talk, “WIZBANG Lambda Fix.”
This article summarizes the essential steps and philosophy required to transition your organization from being reactive to having a self-healing cloud environment.
You can read the complete transcript of the epiosde here >
What is the Dual-Track Approach to AWS Cloud Security?
Lily emphasizes that tackling security in AWS effectively requires a dual-track approach.
Track 1: Secure Defaults (Prevention)
This track focuses on prevention by setting the strongest possible foundation and guardrails before code is deployed.
- Foundation: Set up AWS organizations with foundational tools like CloudTrail and GuardDuty.
- Access Control: Enforce Least Privilege IAM roles.
- Guardrails: Implement Service Control Policies (SCPs).
- Infrastructure as Code (IaC): Use secure by default templates (e.g., Terraform) and predefined CI/CD templates that scan and block critical misconfigurations from being pushed to production.
- Drift Detection: If issues arise, they should be in the code, which you can sample, block, and correct via drift detection.
Track 2: Auto-Remediation (Reaction to Manual Drift)
Security teams are often challenged: if we use IaC and have strong Track 1 controls, why do we need to invest in Track 2 (auto-remediation)? Lily explains that while you should invest heavily in IaC (Track 1) , auto-remediation (Track 2) is necessary because edge cases and manual deviations always happen.
- The “Spin Up Quickly” Problem: Developers manually spin up EC2s for “test purposes,” or they are unfamiliar with the golden standard service mesh and are simply more comfortable with older methods like Docker and Kubernetes.
- Break Glass Scenarios: Even in production, break glass accounts are needed for manual intervention if something breaks down.
- Containment: Automation is crucial for containment of common compromise types (e.g., quarantining an EC2, applying public access blocks, or disabling leaked credentials).
- Correcting Drift: Auto-remediation is also needed within Track 1 to automatically correct deviations from the known good state defined in your IaC, such as IAM drift in production.
The goal of Track 2 is to ensure manual click operations don’t introduce vulnerabilities that the organization “can’t live with”.
How Should Security Prioritize What to Auto-Remediate?
Prioritization is difficult because security is always time and resource-constrained. Lily emphasizes that if you buy a cloud tool and get millions of findings, you must focus on high-impact areas rather than low-hanging fruit.
Focus Areas Beyond Public S3 Buckets
While most companies only fix public S3 buckets (which Lily notes is often the only thing people jump up and down about), high-impact remediation should focus on categories that significantly reduce systemic risk:
Security Misconfigurations
- Focus on preventing credential exfiltration by mitigating IMDS v1 exposure (EC2s, auto-scaling groups, AMIs, EKS).
- This is the single best thing to reduce the risk of Server-Side Request Forgery (SSRF) attacks.
Attack Surface Reduction
- Focus on Route 53 to mitigate subdomain takeover and hosted zone takeover.
- If the only records for a subdomain are NS records (no TXT, CNAME, or A records), delete the whole thing to mitigate future DNS zone takeover risk.
Threat Detection
- Focus on high signal, low noise threats, such as detecting someone downloading secrets via Cloud Shell or triggering AWS Honeypot keys placed across the environment.
- High signal means it’s likely an attacker; immediate action is required (e.g., quarantining the EC2). Focus on attacker tactics that cause the most damage, like credential exfiltration.
Handling False Positives in Threat Detection
For threat detection, where false positives are possible, remediation workflows should require a Slack user response before moving onto the automated fix.
How Do Organizations Measure the ROI and Impact of Auto-Remediation?
Measuring the success of the auto-remediation program involves comparing the results of the two security tracks simultaneously.
- Track 1 (Secure Defaults) Metrics:
- Elimination of Bug Classes: Track how many bug classes (e.g., cross-site scripting) have been eliminated and the percentage of coverage across the code base.
- Adoption Rate: Track the adoption rate of secure IaC templates and CI/CD templates with predefined guardrails.
- Target: Measure the increase in secure default setups.
- Track 2 (Auto-Remediation) Metrics:
- Violation Reduction: Observe a decrease in policy violations resulting from auto-remediation of manual click operations.
- Validation: Use bug bounty programs or penetration testing to verify that even in a compromised state (e.g., a virtual machine with a gained user role), the compromised entity cannot perform harmful actions due to the security controls.
The goal is to see your Track 1 metrics (secure defaults) go up, and simultaneously, your Track 2 metrics (manual policy violations needing remediation) should go down.
What are the Biggest Challenges in Building an Auto-Remediation Program?
Lily highlights that the biggest hurdles are not always technical; they are organizational.
- Company Buy-In: The primary challenge is getting company buy-in. Security teams get “fed up” with the same issues coming up year over year, despite writing detailed Jira tickets.
- Solution: Collect metrics to show the horrible mean time to remediate (MTTR). This data makes it easier to convince management to let the security team take the reins and apply remediations themselves.
- Balancing Prevention vs. Remediation: It’s tempting to deploy as many auto-remediations as possible, but prevention (Track 1) is the only thing that significantly improves the difference between your attack surface and your small remediation capacity.
- Focus: Focus on building preventative measures and remediations that cover the attack surface so well that it’s unlikely a developer will make a beginner mistake and unlikely for an attacker to find a path through the network without triggering an alarm. Ideally, remediation should only fire when “something goes really, really wrong”.
- Stakeholder Management: Auto-remediation requires buy-in and collaboration from three key stakeholders:
- Security Team: Drives success and outlines the streamlined remediation process.
- Cloud Infrastructure Team: Ensures the necessary permissions (IAM, StackSets, Least Privilege) are configured across all AWS accounts for automation.
- Developers: Their workflows are impacted. The goal is to get their buy-in by using Track 1 to enable them to ship code faster without needing to worry about security or the infrastructure layer.
How Will AI Transform the Future of Auto-Remediation?
Lily is “very excited about AI as it applies to security” and sees it making remediation far more sophisticated.
- Predictive and Preemptive Action: Future solutions will not only detect anomalies but predict potential threats based on patterns and behaviors, allowing for preemptive auto-remediation.
- Self-Healing Systems: AI will drive more self-healing systems that automatically correct vulnerabilities and configuration issues without human intervention.
- Reduced False Positives: AI can significantly reduce false positives by better distinguishing between benign anomalies and actual threats, allowing security teams to focus on genuine critical issues.
- Policy-Driven Enforcement: Organizations will define security policies, and AI will simply enforce them automatically.
How Should Organizations Start Building an Auto-Remediation Program?
Lily recommends a three-pronged strategy focused on metrics and achieving a golden standard
- Define the Golden Standard: Define where you want to get to (e.g., Istio microservices architecture, a generic secure by default standard).
- Collect Metrics: Collect metrics for MTTR and compare Track 1 (secure defaults) against Track 2 (auto-remediation).
- Show Progress in Three Areas:
- New Services: Show that new services are automatically adopting the golden standard.
- Existing Services: Show that existing services are migrating to the golden standard while ensuring business continuity.
- Bypassed Standards: Show progress on everything (new and existing) that bypasses your standards (e.g., manual EC2 spin-ups, containerizing with Docker, or using Lambda/PaaS).
This structure demonstrates progress, value, and where remediation efforts are most needed. Lily’s approach essentially boils down to making the secure way easy for developers and the insecure way harder.
Conclusion
The journey from a reactive DevOps environment to a proactive DevSecOps culture is fundamentally a shift in human process, not just technology. The security tools and automation are essential, but they are only effective when used with pragmatism and empathy.
As Lily Chau powerfully demonstrated, true velocity isn’t achieved by pushing more findings faster; it’s achieved by acting as the “single source of truth” that normalizes data, provides context, and prioritizes the handful of issues that pose an existential threat to the business. By setting pragmatic SLOs, adopting incremental improvements, and treating developers as partners rather than adversaries, security can transition from being the “no cop” to the lubrication that enables high-speed, secure delivery. The ultimate win is when a critical fix takes 20 minutes instead of a year.