In uncertain environments in which resources fluctuate continuously, animals must permanently decide whether to exploit what they currently believe to be their best option, or instead explore potential alternatives in case better opportunities are in fact available. While such a trade-off has been extensively studied in pretrained animals facing non-stationary decision-making tasks, it is yet unknown how they progressively tune it while progressively learning the task structure during pretraining. Here, we compared the ability of different computational models to account for long-term changes in the behaviour of 24 rats while they learned to choose a rewarded lever in a three-armed bandit task across 24 days of pretraining. We found that the day-by-day evolution of rat performance and win-shift tendency revealed a progressive stabilization of the way they regulated the exploration-exploitation trade-off. We successfully captured these behavioural adaptations using a meta-learning model in which the exploration-exploitation trade-off is controlled by the animal's average reward rate.