• jarfil@beehaw.org
    link
    fedilink
    arrow-up
    1
    ·
    10 months ago

    It basically learns shifting the output of each Transformer layer

    That would increase inference time, which is something they explicitly avoid.

    Check point 4.1 in the paper. W is a weight matrix for a single layer, and the training focuses on finding a ∆W such that the result is fine tuned. The LoRA optimization lies in calculating a ∆W in the form of BA with lower ranks, but W still being a weight matrix for the layer, not its output:

    W0 + ∆W = W0 + BA

    A bit later:

    When deployed in production, we can explicitly compute and store W = W0 + BA and perform inference as usual

    W0 being the model’s layer’s original weight matrix, and W being the modified weight matrix that’s being “executed”.

    the original Transformer stays intact

    At training time, yes. At inference time, no.

    before you know it you have hundreds and your output quality has degraded irrecoverably.

    This is correct. Just not because you’ve messed with the output of each layer, but with the weights of each layer… I’d guess messing with the outputs would cause a quicker degradation.