Hi, and congratulations on the Mixtral release!
My question is a simple one. From the recent a16z podcast you briefly commented on how Mixtral was made:
"You take all of the dense layers of your transformer, and you duplicate them."
I very humbly request a confirmation on whether this refers to:
- Sparse Upcycling? (https://arxiv.org/abs/2212.05055)
- Another method of initializing Mixtral-8x7B from Mistral-7B?
- Neither?