@mehrsary/better-uc3-xgboost-bagging

EU BETTER Health Project

The EU BETTER Health Project is funded by the European Union's Horizon Programme and the UK Research and Innovation Programme. The project aims to improve the health of European citizens by building a decentralized infrastructure that enables healthcare professionals to leverage multi-source health data through tailored AI tools. These tools support secure, cost-effective comparison, integration, and analysis of data across national borders while complying with GDPR privacy requirements.

Project Use Cases

Use Case 1

This use case focuses on the integration of genomic and phenotypic data from paediatric rare diseases to identify pathways associated with intellectual disability.

Use Case 2

This use case investigates innovative AI-based data analysis methods for the diagnosis of inherited retinal dystrophies.

Use Case 3

This use case aims to improve the understanding and prediction of self-harm and suicidal behaviour risks in patients with Autism Spectrum Disorders.

Project website: https://www.better-health-project.eu/

PADME Platform

The EU BETTER Health Project uses PADME as its primary distributed analytics platform.

Flower-Compatible UC3 Federated XGBoost Bagging

This app ports PADME's better-uc3-xgboost-bagging logic to Flower while preserving the same high-level behavior:

Multi-target training (9 targets)
Round-0 statistics collection (class/value distributions)
Round-1..N local warm-start training from latest global model
Server-side tree-level bagging by appending only newly-added trees from each client

Algorithm Overview

The training procedure is a federated multi-target XGBoost bagging workflow:

Each client loads its local data. In Simulation Engine, Flower provides partition-id and num-partitions, and the client builds its partition from the shared data-csv. In Deployment Engine, each client loads its own file from node-config.data-path.
Each client preprocesses the predictors and prepares all nine targets independently. Target columns are never used as predictors, so every target is trained without leakage from the other questionnaire outcomes.
Statistics phase; Clients do not train models yet; they only send target-wise summary information to the server. For classification targets this includes class counts, and for regression/ordinal targets this includes value statistics such as mean, standard deviation, minimum, and maximum.
The server aggregates these statistics into global target-specific weights. For the binary classification target, the global class distribution is later used to derive scale_pos_weight. For regression-style targets, the aggregated statistics are tracked as global summaries.
In each training round, every client trains one XGBoost model per target. If a global model for a target already exists, the client warm-starts from that model and adds num-boost-round new trees locally. If no global model exists yet, training starts from scratch.
Clients return the updated per-target models and local metrics. The local metrics include train/validation scores and tree counts for each target.
The server performs bagging by target. The server appends the newly added trees from each client onto the current global model for that target.
After every round, the server writes round-level evaluation artifacts. These files include client metrics and weighted aggregate metrics per target.
At the end of training, the server saves the final global model for each target and the aggregated global weights.

This means the app trains nine separate federated XGBoost models in parallel, one for each UC3 outcome, while sharing the same predictor preprocessing and round structure.

Project layout

better-uc3-xgboost-bagging/
├── pyproject.toml
├── README.md
├── LICENSE
├── data/
│   └── synthetic_output.tsv
└── better_uc3_xgboost_bagging/
    ├── __init__.py
    ├── client_app.py
    ├── server_app.py
    └── task.py

Install

Install Flower:

pip install flwr

Fetch the app:

flwr new @mehrsary/better-uc3-xgboost-bagging

Change your directory and install the app dependencies:

pip install -e .

Run in simulation mode

The default config runs with 10 virtual SuperNodes and 4 rounds. In simulation mode, pass an absolute data-csv path.

Run:

flwr run . --stream --run-config 'data-csv="/absolute/path/to/better-uc3-xgboost-bagging/data/synthetic_output.tsv" output-dir="/absolute/path/to/better-uc3-xgboost-bagging/final_uc3_models"'

Change the number of virtual SuperNodes:

flwr run . --stream --federation-config 'num-supernodes=3' --run-config 'data-csv="/absolute/path/to/better-uc3-xgboost-bagging/data/synthetic_output.tsv" output-dir="/absolute/path/to/better-uc3-xgboost-bagging/final_uc3_models"'

Override config at runtime:

flwr run . --stream --run-config 'data-csv="/absolute/path/to/better-uc3-xgboost-bagging/data/synthetic_output.tsv" output-dir="/absolute/path/to/better-uc3-xgboost-bagging/final_uc3_models" num-server-rounds=5 num-boost-round=20'

Input Data

Synthetic test data can be downloaded from here: https://daten.ukk-cloud.de/public/download-shares/B2M7lv0dIqYh0wgGJimLy7DciKwoaBdG

Dataset summary (UC3 synthetic):

Approximately 1,000 participants
85 raw columns
Mixed data types: numeric, categorical, bimodal, and dates
Domain: mental health, suicidality, and self-harm risk modeling
Missingness may appear in predictors; targets are complete

Feature groups

The predictor space combines:

Background and demographic information
Perinatal and developmental variables
Clinical and psychological scale scores

To prevent leakage, the training features exclude:

ID_STUDY, RECORD_ID
All SBQR_* target columns
All SBQASC_* target columns
All SHQ_* target columns

Targets (trained independently)

SBQR_1: lifetime ideation/attempt (ordinal, encoded)
SBQR_2: past-year ideation frequency (ordinal)
SBQR_3: disclosure of suicidal intent (ordinal, encoded)
SBQR_T: SBQR total score (regression)
SBQASC_6: suicide attempt history (binary classification)
SBQASC_T: SBQASC total score (regression)
SHQ_1: non-suicidal self-harm ideation (ordinal)
SHQ_2: suicidal ideation (ordinal)
SHQ_3: self-harm behavior (ordinal)

Output

Final aggregated per-target models are written to:

final_uc3_models/<TARGET>/model.json
final_uc3_models/global_weights.json

Round-by-round metrics are also written per target (mainly for observability and debug):

final_uc3_models/<TARGET>/evaluation/round_<N>_metrics.json

Each round_<N>_metrics.json contains:

target and round metadata
client-level metrics (train and validation) and tree counts
weighted aggregate metrics across clients

Notes

In simulation mode, clients are created by partitioning data-csv horizontally.
In deployment mode, set node-config="data-path=/absolute/path/to/client_data.tsv" for each SuperNode. The TSV must contain the same columns as the bundled data/synthetic_output.tsv.