@mehrsary/better-uc3-xgboost-bagging
flwr new @mehrsary/better-uc3-xgboost-baggingEU BETTER Health Project
The EU BETTER Health Project is funded by the European Union's Horizon Programme and the UK Research and Innovation Programme. The project aims to improve the health of European citizens by building a decentralized infrastructure that enables healthcare professionals to leverage multi-source health data through tailored AI tools. These tools support secure, cost-effective comparison, integration, and analysis of data across national borders while complying with GDPR privacy requirements.
Project Use Cases
Use Case 1
This use case focuses on the integration of genomic and phenotypic data from paediatric rare diseases to identify pathways associated with intellectual disability.
Use Case 2
This use case investigates innovative AI-based data analysis methods for the diagnosis of inherited retinal dystrophies.
Use Case 3
This use case aims to improve the understanding and prediction of self-harm and suicidal behaviour risks in patients with Autism Spectrum Disorders.
Project website: https://www.better-health-project.eu/
PADME Platform
The EU BETTER Health Project uses PADME as its primary distributed analytics platform.
Flower-Compatible UC3 Federated XGBoost Bagging
This app ports PADME's better-uc3-xgboost-bagging logic to Flower while preserving the same high-level behavior:
- Multi-target training (9 targets)
- Round-0 statistics collection (class/value distributions)
- Round-1..N local warm-start training from latest global model
- Server-side tree-level bagging by appending only newly-added trees from each client
Algorithm Overview
The training procedure is a federated multi-target XGBoost bagging workflow:
-
Each client loads its local data. In Simulation Engine, Flower provides partition-id and num-partitions, and the client builds its partition from the shared data-csv. In Deployment Engine, each client loads its own file from node-config.data-path.
-
Each client preprocesses the predictors and prepares all nine targets independently. Target columns are never used as predictors, so every target is trained without leakage from the other questionnaire outcomes.
-
Statistics phase; Clients do not train models yet; they only send target-wise summary information to the server. For classification targets this includes class counts, and for regression/ordinal targets this includes value statistics such as mean, standard deviation, minimum, and maximum.
-
The server aggregates these statistics into global target-specific weights. For the binary classification target, the global class distribution is later used to derive scale_pos_weight. For regression-style targets, the aggregated statistics are tracked as global summaries.
-
In each training round, every client trains one XGBoost model per target. If a global model for a target already exists, the client warm-starts from that model and adds num-boost-round new trees locally. If no global model exists yet, training starts from scratch.
-
Clients return the updated per-target models and local metrics. The local metrics include train/validation scores and tree counts for each target.
-
The server performs bagging by target. The server appends the newly added trees from each client onto the current global model for that target.
-
After every round, the server writes round-level evaluation artifacts. These files include client metrics and weighted aggregate metrics per target.
-
At the end of training, the server saves the final global model for each target and the aggregated global weights.
This means the app trains nine separate federated XGBoost models in parallel, one for each UC3 outcome, while sharing the same predictor preprocessing and round structure.
Project layout
better-uc3-xgboost-bagging/ ├── pyproject.toml ├── README.md ├── LICENSE ├── data/ │ └── synthetic_output.tsv └── better_uc3_xgboost_bagging/ ├── __init__.py ├── client_app.py ├── server_app.py └── task.py
Install
Install Flower:
pip install flwr
Fetch the app:
flwr new @mehrsary/better-uc3-xgboost-bagging
Change your directory and install the app dependencies:
pip install -e .
Run in simulation mode
The default config runs with 10 virtual SuperNodes and 4 rounds. In simulation mode, pass an absolute data-csv path.
Run:
flwr run . --stream --run-config 'data-csv="/absolute/path/to/better-uc3-xgboost-bagging/data/synthetic_output.tsv" output-dir="/absolute/path/to/better-uc3-xgboost-bagging/final_uc3_models"'
Change the number of virtual SuperNodes:
flwr run . --stream --federation-config 'num-supernodes=3' --run-config 'data-csv="/absolute/path/to/better-uc3-xgboost-bagging/data/synthetic_output.tsv" output-dir="/absolute/path/to/better-uc3-xgboost-bagging/final_uc3_models"'
Override config at runtime:
flwr run . --stream --run-config 'data-csv="/absolute/path/to/better-uc3-xgboost-bagging/data/synthetic_output.tsv" output-dir="/absolute/path/to/better-uc3-xgboost-bagging/final_uc3_models" num-server-rounds=5 num-boost-round=20'
Input Data
Synthetic test data can be downloaded from here: https://daten.ukk-cloud.de/public/download-shares/B2M7lv0dIqYh0wgGJimLy7DciKwoaBdG
Dataset summary (UC3 synthetic):
- Approximately 1,000 participants
- 85 raw columns
- Mixed data types: numeric, categorical, bimodal, and dates
- Domain: mental health, suicidality, and self-harm risk modeling
- Missingness may appear in predictors; targets are complete
Feature groups
The predictor space combines:
- Background and demographic information
- Perinatal and developmental variables
- Clinical and psychological scale scores
To prevent leakage, the training features exclude:
- ID_STUDY, RECORD_ID
- All SBQR_* target columns
- All SBQASC_* target columns
- All SHQ_* target columns
Targets (trained independently)
- SBQR_1: lifetime ideation/attempt (ordinal, encoded)
- SBQR_2: past-year ideation frequency (ordinal)
- SBQR_3: disclosure of suicidal intent (ordinal, encoded)
- SBQR_T: SBQR total score (regression)
- SBQASC_6: suicide attempt history (binary classification)
- SBQASC_T: SBQASC total score (regression)
- SHQ_1: non-suicidal self-harm ideation (ordinal)
- SHQ_2: suicidal ideation (ordinal)
- SHQ_3: self-harm behavior (ordinal)
Output
Final aggregated per-target models are written to:
- final_uc3_models/<TARGET>/model.json
- final_uc3_models/global_weights.json
Round-by-round metrics are also written per target (mainly for observability and debug):
- final_uc3_models/<TARGET>/evaluation/round_<N>_metrics.json
Each round_<N>_metrics.json contains:
- target and round metadata
- client-level metrics (train and validation) and tree counts
- weighted aggregate metrics across clients
Notes
- In simulation mode, clients are created by partitioning data-csv horizontally.
- In deployment mode, set node-config="data-path=/absolute/path/to/client_data.tsv" for each SuperNode. The TSV must contain the same columns as the bundled data/synthetic_output.tsv.