It basically learns shifting the output of each Transformer layer
That would increase inference time, which is something they explicitly avoid.
Check point 4.1 in the paper. W is a weight matrix for a single layer, and the training focuses on finding a ∆W such that the result is fine tuned. The LoRA optimization lies in calculating a ∆W in the form of BA with lower ranks, but W still being a weight matrix for the layer, not its output:
W0 + ∆W = W0 + BA
A bit later:
When deployed in production, we can explicitly compute and store W = W0 + BA and perform inference as usual
W0 being the model’s layer’s original weight matrix, and W being the modified weight matrix that’s being “executed”.
the original Transformer stays intact
At training time, yes. At inference time, no.
before you know it you have hundreds and your output quality has degraded irrecoverably.
This is correct. Just not because you’ve messed with the output of each layer, but with the weights of each layer… I’d guess messing with the outputs would cause a quicker degradation.
That would increase inference time, which is something they explicitly avoid.
Check point 4.1 in the paper. W is a weight matrix for a single layer, and the training focuses on finding a ∆W such that the result is fine tuned. The LoRA optimization lies in calculating a ∆W in the form of BA with lower ranks, but W still being a weight matrix for the layer, not its output:
A bit later:
W0 being the model’s layer’s original weight matrix, and W being the modified weight matrix that’s being “executed”.
At training time, yes. At inference time, no.
This is correct. Just not because you’ve messed with the output of each layer, but with the weights of each layer… I’d guess messing with the outputs would cause a quicker degradation.