← Blog

Building a GitLab CI/CD Pipeline for AI-Driven QE

A step-by-step walkthrough of a multi-project pipeline architecture for large-scale test automation — including AI-based test selection that cuts CI run time by 62%.

reading now
views
comments

The Problem at Scale

Large enterprise applications run automated tests across multiple environments (dev/staging/prod), browsers (Chrome/Firefox/Safari), and mobile platforms (iOS/Android). A naive "run everything on every commit" strategy produces:

  • 6-hour CI runs that block releases
  • Flaky failures drowning real bugs
  • Engineers ignoring test results because they take too long

The solution: intelligent test selection powered by change analysis + ML-based risk scoring, orchestrated through a GitLab multi-project pipeline.

Architecture Overview

┌─────────────────────────────────────────────┐
│           Release Repository                │
│  (orchestrates all test pipelines)          │
└──────────────┬──────────────────────────────┘
               │  trigger via API
    ┌──────────┼──────────┐
    ▼          ▼          ▼
 Web Tests  Mobile Tests  API Tests
 (Playwright) (Appium)   (REST)
    │          │          │
    └──────────┴──────────┘
               │
          Results aggregated
          → Slack + Dashboard

The release repo is the single source of truth for what runs, when, and against which environment.

The Release Repository Pattern

# release-repo/.gitlab-ci.yml
stages:
  - analyze
  - select
  - trigger
  - aggregate

analyze-change-impact:
  stage: analyze
  script:
    - python3 scripts/analyze_changes.py
      --commit-range $CI_COMMIT_BEFORE_SHA..$CI_COMMIT_SHA
      --output change-map.json
  artifacts:
    paths: [change-map.json]

select-test-suites:
  stage: select
  script:
    - python3 scripts/ai_test_selector.py
      --change-map change-map.json
      --risk-model models/risk_v3.pkl
      --output selected-suites.json
  artifacts:
    paths: [selected-suites.json]

trigger-web:
  stage: trigger
  trigger:
    project: your-org/web-automation
    strategy: depend
  variables:
    SUITES: $SELECTED_WEB_SUITES
    ENV: $TARGET_ENV
    REGION: $TARGET_REGION

AI Test Selection

The selector reads the change map and scores each test suite by predicted failure probability:

class AITestSelector:
    def __init__(self, risk_model_path: str):
        self.model = joblib.load(risk_model_path)

    def select(self, change_map: dict, threshold: float = 0.3) -> list[str]:
        features = self._extract_features(change_map)
        scores = self.model.predict_proba(features)[:, 1]

        selected = [
            suite for suite, score in zip(self.all_suites, scores)
            if score >= threshold
        ]

        print(f"Selected {len(selected)}/{len(self.all_suites)} suites "
              f"(risk threshold: {threshold})")
        return selected

    def _extract_features(self, change_map: dict) -> np.ndarray:
        return np.array([
            len(change_map['files_changed']),
            change_map['checkout_flow_touched'],
            change_map['auth_layer_touched'],
            change_map['days_since_last_failure'],
            change_map['historical_flake_rate']
        ])

On average this reduces test execution by 62% while maintaining 94% defect detection coverage.

Parameterized Region/Environment Matrix

Applications serving multiple regions or tenants often have configuration-level differences. Parameterizing the matrix gives full visibility into region-specific failures without cross-contamination:

.test-matrix:
  parallel:
    matrix:
      - REGION: [us-east, us-west, eu-central]
        BROWSER: [chrome, firefox]
        ENV: [staging]

Each combination runs independently. A failure in eu-central on Firefox doesn't mask a passing result in us-east on Chrome.

Results After 6 Months

Metric Before After
Average CI duration 6h 20m 2h 15m
False positive rate 18% 4%
Defect escape rate 3.2% 0.8%
Engineer satisfaction 😐 🙂

The last metric is real — we track it quarterly. Test suites engineers trust get used. Test suites they don't, get ignored.

Key Takeaways

This pattern scales to any enterprise application with:

  • Multiple deployment targets — regions, tenants, environments
  • Heterogeneous test types — web, mobile, API, performance
  • High commit velocity — where running everything on every PR is unsustainable

The intelligence isn't in running fewer tests blindly — it's in knowing which tests to skip based on what actually changed, and backing that decision with a model trained on your own historical failure data.

Discussion

Loading...

Leave a Comment

All comments are reviewed before appearing. No links please.

0 / 1000