The Problem at Scale
Large enterprise applications run automated tests across multiple environments (dev/staging/prod), browsers (Chrome/Firefox/Safari), and mobile platforms (iOS/Android). A naive "run everything on every commit" strategy produces:
- 6-hour CI runs that block releases
- Flaky failures drowning real bugs
- Engineers ignoring test results because they take too long
The solution: intelligent test selection powered by change analysis + ML-based risk scoring, orchestrated through a GitLab multi-project pipeline.
Architecture Overview
┌─────────────────────────────────────────────┐
│ Release Repository │
│ (orchestrates all test pipelines) │
└──────────────┬──────────────────────────────┘
│ trigger via API
┌──────────┼──────────┐
▼ ▼ ▼
Web Tests Mobile Tests API Tests
(Playwright) (Appium) (REST)
│ │ │
└──────────┴──────────┘
│
Results aggregated
→ Slack + Dashboard
The release repo is the single source of truth for what runs, when, and against which environment.
The Release Repository Pattern
# release-repo/.gitlab-ci.yml
stages:
- analyze
- select
- trigger
- aggregate
analyze-change-impact:
stage: analyze
script:
- python3 scripts/analyze_changes.py
--commit-range $CI_COMMIT_BEFORE_SHA..$CI_COMMIT_SHA
--output change-map.json
artifacts:
paths: [change-map.json]
select-test-suites:
stage: select
script:
- python3 scripts/ai_test_selector.py
--change-map change-map.json
--risk-model models/risk_v3.pkl
--output selected-suites.json
artifacts:
paths: [selected-suites.json]
trigger-web:
stage: trigger
trigger:
project: your-org/web-automation
strategy: depend
variables:
SUITES: $SELECTED_WEB_SUITES
ENV: $TARGET_ENV
REGION: $TARGET_REGION
AI Test Selection
The selector reads the change map and scores each test suite by predicted failure probability:
class AITestSelector:
def __init__(self, risk_model_path: str):
self.model = joblib.load(risk_model_path)
def select(self, change_map: dict, threshold: float = 0.3) -> list[str]:
features = self._extract_features(change_map)
scores = self.model.predict_proba(features)[:, 1]
selected = [
suite for suite, score in zip(self.all_suites, scores)
if score >= threshold
]
print(f"Selected {len(selected)}/{len(self.all_suites)} suites "
f"(risk threshold: {threshold})")
return selected
def _extract_features(self, change_map: dict) -> np.ndarray:
return np.array([
len(change_map['files_changed']),
change_map['checkout_flow_touched'],
change_map['auth_layer_touched'],
change_map['days_since_last_failure'],
change_map['historical_flake_rate']
])
On average this reduces test execution by 62% while maintaining 94% defect detection coverage.
Parameterized Region/Environment Matrix
Applications serving multiple regions or tenants often have configuration-level differences. Parameterizing the matrix gives full visibility into region-specific failures without cross-contamination:
.test-matrix:
parallel:
matrix:
- REGION: [us-east, us-west, eu-central]
BROWSER: [chrome, firefox]
ENV: [staging]
Each combination runs independently. A failure in eu-central on Firefox doesn't mask a passing result in us-east on Chrome.
Results After 6 Months
| Metric | Before | After |
|---|---|---|
| Average CI duration | 6h 20m | 2h 15m |
| False positive rate | 18% | 4% |
| Defect escape rate | 3.2% | 0.8% |
| Engineer satisfaction | 😐 | 🙂 |
The last metric is real — we track it quarterly. Test suites engineers trust get used. Test suites they don't, get ignored.
Key Takeaways
This pattern scales to any enterprise application with:
- Multiple deployment targets — regions, tenants, environments
- Heterogeneous test types — web, mobile, API, performance
- High commit velocity — where running everything on every PR is unsustainable
The intelligence isn't in running fewer tests blindly — it's in knowing which tests to skip based on what actually changed, and backing that decision with a model trained on your own historical failure data.
Discussion
Loading...Leave a Comment
All comments are reviewed before appearing. No links please.