The method
The method is particularly suited for when the system and the goal can be described with an environment (e.g. a labyrinth), an observer (a sensor), an agent (a control system), and a reward system modelling an objective (get out of the labyrinth in shortest possible time). The controller will then, through randomized trial and error, explore the different options and get feedback in the form of rewards, and then alter its behavior to maximize the expected future reward. It has become increasingly popular such as exemplified by the game-playing AlphaZero and training of conversational abilities in e.g. ChatGTP.
What we do in SINTEF
Reinforcement learning suffers from many hyper parameters and can be tricky to set up. However, we have proven it successful in e.g. hydro power planning, path optimization, drone stabilization, wind turbine control etc.