The 46-Minute Problem

The IAMTrail detection engine fetches ~1,500 AWS managed IAM policies every run. The original approach was pure bash:

1
2
3
4
aws iam list-policies --output json |
    jq -cr '...' |
    xargs -P 16 -n3 sh -c \
      'aws iam get-policy-version --policy-arn $1 --version-id $2 | jq --indent 4 . > "policies/$3"' sh

Looks fine, right? Except each iteration spawns a full AWS CLI process. That means a fresh Python runtime, boto3 import, credential resolution, one single HTTP call, then exit. Times 1,500.

Even with -P 16 parallelism on a 0.5 vCPU / 1 GiB Fargate Spot task, this took 46 minutes.

The Fix: One Process to Rule Them All

The replacement is embarrassingly simple - a ~60 line Python script using boto3 with ThreadPoolExecutor:

1
2
3
4
5
6
7
iam = boto3.client("iam", config=Config(
    retries={"max_attempts": 5, "mode": "adaptive"},
    max_pool_connections=32,
))

with ThreadPoolExecutor(max_workers=32) as pool:
    pool.map(fetch_and_write, policies)

One process, one boto3 session, one connection pool, 32 lightweight threads. No more spawning 1,500 Python runtimes.

The Tricky Part: Format Compatibility

The dangerous part of this migration is not the code - it’s the output. IAMTrail stores policies as JSON files in a git repository. If the Python serialization differs by even one byte from the bash version, git sees it as a “change” and we get a massive false-positive commit affecting all 1,500+ policies.

I actually hit this once already during the debugging session. The original bash pipeline used jq -S (sort keys alphabetically), but the existing files in the repo had the natural API key order. That single -S flag turned every policy file into a “change”.

For the Python migration, I verified byte-level compatibility:

1
2
3
4
5
# The format contract
serialized = json.dumps(payload, indent=4, default=str) + "\n"

with open(path, "w", newline="\n") as f:  # Force LF, no CRLF
    f.write(serialized)

Tested against all 1,547 files in the repo: 1,527 byte-identical, 20 outliers that were pre-existing legacy artifacts from 2018.

Results

Metric Before (bash) After (Python)
Duration 46 min 2 min 30s
Processes spawned 1,500+ 1
Fargate config 0.5 vCPU / 1 GiB 0.25 vCPU / 0.5 GiB
Format diff - 0 (byte-identical)

With a 2.5 minute run, I could now justify switching from every-4-hours to hourly scans. Policy changes detected faster, same infrastructure, smaller task.

The Cost “Savings”

Now let’s talk about the part that makes me smile.

The annual Fargate Spot cost went from $10.18/year to $1.11/year (at hourly schedule).

Savings: $9.07/year.

To achieve this, I spent an evening pair-programming with Claude in Cursor - burning through what I can only assume is a very respectable amount of LLM tokens. Between the debugging session, the format analysis, the benchmarking, the false-positive investigation, the Terraform changes, the Docker rebuild, the failed deploys - I am fairly confident the API bill for this session alone exceeds the lifetime Fargate savings of this optimization.

Peak cloud economics.

Takeaways

  • Spawning CLI processes in a loop is expensive - not in dollars, but in time. A single boto3 session with threads beats 1,500 CLI invocations by 20x.
  • Byte-level format testing matters - when your output is tracked by git, a single extra space creates thousands of false-positive diffs.
  • FARGATE_SPOT is absurdly cheap - the real optimization currency is time, not money.

The full change is on GitHub. IAMTrail now scans every hour at iamtrail.com.