MedQA : Fine-tuning d'une IA médicale sur AMD ROCm

Q: Comment MedQA fonctionne-t-il sans CUDA ?

MedQA utilise ROCm d'AMD pour fine-tuner le modèle Qwen3-1.7B avec LoRA, sans nécessiter de dépendance CUDA, grâce à des variables d'environnement spécifiques.

Back to Articles A complete walkthrough of LoRA fine-tuning Qwen3-1.7B on MedMCQA using AMD MI300X, built for the AMD Developer Hackathon on lablab.ai. The Idea Medical question answering is one of those tasks where the stakes are genuinely high. A model that confidently picks the wrong answer on a clinical MCQ isn't just wrong — it's dangerous. At the same time, most open-source medical AI work assumes you have an NVIDIA GPU. CUDA is the default. Everything else is an afterthought. This project challenges that assumption. MedQA is a LoRA fine-tuned clinical question-answering model built entirely on AMD hardware using ROCm. It takes a multiple-choice medical question and returns both the correct answer letter and a clinical explanation of the reasoning. The entire training pipeline — from data loading to adapter export — runs on an AMD Instinct MI300X without a single CUDA dependency. 🤗 Model on HuggingFace Hub: HK2184/medqa-qwen3-lora 🚀 Live Demo: HuggingFace Spaces 💻 GitHub: MedQA-Medical-AI-on-AMD-ROCm Why AMD ROCm? The AMD Instinct MI300X is a remarkable piece of hardware: 192 GB of HBM3 memory in a single device. For LLM fine-tuning, VRAM is often the binding constraint — it dictates batch size, sequence length, and whether you need to quantize at all. With 192 GB available, we trained Qwen3-1.7B with LoRA in full fp16 without any 4-bit or 8-bit quantization hacks. More importantly, the goal was to prove that the HuggingFace ecosystem — Transformers, PEFT, TRL, Accelerate — works seamlessly on ROCm. It does. The same training code that runs on CUDA runs on ROCm with three environment variables set: os.environ["ROCR_VISIBLE_DEVICES"] = "0" os.environ["HIP_VISIBLE_DEVICES"] = "0" os.environ["HSA_OVERRIDE_GFX_VERSION"] = "9.4.2" That's it. No code changes. No custom kernels. No CUDA compatibility shims. The Dataset: MedMCQA MedMCQA is a large-scale multiple-choice question dataset derived from Indian medical entrance exams (AIIMS, USMLE-style). Each example contains: A clinical question Four answer options (A–D) The correct answer index An optional free-text explanation (exp field) For this project we used 2,000 training samples — a deliberately small slice to demonstrate that meaningful fine-tuning is achievable quickly. Training took approximately 5 minutes on the MI300X. Model: Qwen3-1.7B The base model is Qwen/Qwen3-1.7B — Alibaba's latest small-scale language model. At 1.7 billion parameters it's compact enough to fine-tune cheaply but capable enough to produce coherent clinical reasoning. It supports trust_remote_code=True and loads cleanly with HuggingFace Transformers. The Prompt Format Consistency in prompt formatting is critical for instruction fine-tuning. Every training example and every inference call uses the same template: ### Question: {question} ### Options: A) {opa} B) {opb} C) {opc} D) {opd} ### Answer: {answer_letter}) {answer_text} ### Explanation: {explanation} During training the model sees the full sequence including the answer and explanation. During inference we provide everything up to ### Answer:\n and let the model complete from there. Training with LoRA Rather than fine-tuning all 1.5 billion parameters, we use LoRA (Low-Rank Adaptation) via the PEFT library. LoRA injects small trainable rank-decomposition matrices into the attention layers, leaving the base weights frozen. LoRA Configuration from peft import LoraConfig, get_peft_model, TaskType lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=8, lora_alpha=16, lora_dropout=0.05, target_modules=["q_proj", "v_proj"], bias="none", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 2,228,224 || all params: 1,543,901,184 || trainable%: 0.1443 Only ~2.2 million of the model's 1.5 billion parameters are trained. This keeps memory usage low and training fast. Training Arguments from transformers import TrainingArguments args = TrainingArguments( output_dir="./outputs", num_train_epochs=2, per_device_train_batch_size=4, gradient_accumulation_steps=4, # effective batch size = 16 learning_rate=2e-4, fp16=True, bf16=False, eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, gradient_checkpointing=True, optim="adamw_torch", warmup_ratio=0.05, lr_scheduler_type="cosine", report_to="none", ) A few things worth noting: fp16=True, bf16=False — We use standard fp16. In early experiments with bfloat16 we encountered NaN loss; switching to fp16 resolved it entirely. gradient_checkpointing=True — Trades compute for memory. Not strictly necessary on MI300X given the 192 GB VRAM, but good practice for reproducibility on smaller GPUs. gradient_accumulation_steps=4 — Effective batch size of 16 with a physical batch of 4. Cosine LR sch...

Accueil

Outils

Annuaire

Apprendre

MedQA : Fine-tuning d'une IA médicale sur AMD ROCm

Que faut-il retenir ?

Pourquoi cette nouvelle compte-t-elle ?

Comment MedQA fonctionne-t-il sans CUDA ?

Ressources

À propos

Communauté

Reste à jour en veille IA

Confirmer

Accueil

Outils

Annuaire

Apprendre