Because there is no way to create a sensible set of data for SL. The system needs to try allocation combinations in different environmental conditions and collect the data as to whether it is a good or bad combination at the same time. This is the basic idea behind the RL problem.