#Initialization of Mixtral-8x7B

3 messages · Page 1 of 1 (latest)

frosty topaz
#

Hi, and congratulations on the Mixtral release!

My question is a simple one. From the recent a16z podcast you briefly commented on how Mixtral was made:

"You take all of the dense layers of your transformer, and you duplicate them."

I very humbly request a confirmation on whether this refers to:

vague forum
#

I'd assume so! to paraphrase the paper, "all parameters, and optionally their optimizer state, are copied from the original checkpoint, except those corresponding to the MoE router, which does not exist in the original architecture."

frosty topaz