Mixtral of Experts (Paper Explained) - Yannic Kilcher

Yannic Kilcher Summary

image_1712367638430

Table of Contents

Introduction to Mixtral 8x7b Mixtro of Experts Model | 0:00:00 - 0:01:40

\
Today we're looking at the Mixtral 8x7b Mixtro of Experts model, an extension of the Mixtral 7b architecture by Mistral AI. The paper, nicknamed 'Don't Say Data,' stands out for its lack of information on the training data source. The model has garnered attention through blog posts and word art visuals, harking back to a nostalgic era. Despite its unconventional approach, the Mixtral of Experts model has generated interest in the AI community.

I believe that using synthetic data is a smart choice due to the current trend among professional complainers to criticize the source of training data, leading to lawsuits over copyright issues. Previously, complaints focused on model biases and crowd workers.

Importance of Transparency in Research Papers | 0:01:40 - 0:03:00

\
Mistral doesn't say much, which might be wise, but it's a bit strange in a research paper that's supposed to inform the research community about the process and methods used. If you're unaware, Mistral.ai is a new startup, presumably out of France. So far, its approach is the most open-source of all the AI startups, surpassing even Stability AI, which is known for being very open-source. Mistral models are released under the Apache license, which gives users a free hand. However, as I pointed out earlier, they don't disclose the source of the data they use. In contrast, Stability AI does give some information about the data source and how they acquired it, despite tying their models to more restrictive licenses.

This paper doesn't hold many surprises. It states that the model is a transformer with a mixture of experts architecture, providing an opportunity to understand its meaning. I've made videos about the mixture of experts and expert routing in the past, but it's worth delving into the topic again here.

image_1712367689297

Introduction to Mixtral 8x7b | 0:03:00 - 0:07:20

\
Mixtral 8x7b is a sparse mixture of experts model with open weights released under the Apache 2 license. It has shown superior performance compared to Lama 270b and gpt 3.5 on various benchmarks.

image_1712369078466

The total parameter count of the model is less than other models like GPT-3.5 and GPT-3, although the exact number is not specified. The model utilizes a mixture of experts with expert routing, allowing it to use only a subset of parameters for each token. This results in a lower actual parameter count per token, contributing to its performance on various benchmarks.

The model allows for optimizations to achieve faster inference speed at low batch sizes and higher throughput at large batch sizes. It is a decoder-only model with feedforward blocks that select from eight distinct groups of parameters. This selection process will be further explored to understand its significance.

There's a context size, or token window, of 32,000, which is comparable to other large, transformer-based language models. This window is substantial, providing a significant context. Our first indication of where the training data originates is it pre-training with multilingual data. Thus, they could literally use a pile of multilingual data, such as a piece by Shakespeare with an added German phrase. However, the specifics aren't divulged.

So, what is a mixture of experts model? Commonly in transformers, we frequently discuss attention, the attention mechanism, which has traditionally been the central component of these transformer models. Essentially, transformer models involve turning your input tokens into a vector, performed by an embedding layer. Consequently, at the top, there is an output akin to a reversed embedding layer, also known as negative one embedding. The output vectors are then converted back into tokens, or if predicting the next token, one at the very end is predicted into one of the 32,000 tokens or thereabouts.

In between, there are transformer blocks, repeated 'n' times. Over the years, these transformer blocks have evolved somewhat but typically consist of two primary layers - the attention layer and the feedforward layer. Hence, there is usually an attention layer followed by a feedforward network or layer. Notably, when discussing transformers, the attention layer is typically the centre of discussion.

Attention Layer in Neural Network | 0:07:20 - 0:12:20

\
In the attention layer of a neural network, information is passed between tokens in a sequence to transform the input signal into the next signal. The attention mechanism allows for sharing information between vectors to make computations context-dependent. The next layer is the feedforward network, which applies to each token individually. Each token's representation goes through the feedforward network independently, resulting in a vector transformation for each token. The feedforward networks contain a large number of parameters due to the multiplication of token vectors by weight matrices.

Sparse Mixture of Experts Model and Routing Neural Network | 0:12:20 - 0:21:20

\
In a sparse mixture of experts model, tokens are not sent to all experts but only to a subset, possibly just one expert. A routing neural network, denoted as G, determines which expert each token should be routed to. The routing network selects two experts for stability reasons. The goal is to classify each token and route it to the appropriate expert based on its intermediate vector representation. This process resembles a classic classification problem where a feature vector is passed through a function to determine the class label or output. The routing network helps in deciding the destination expert for each token based on the logits produced by the classification function.

image_1712368994877

After applying softmax to logit, a distribution or weighting is obtained. This is used to route input to different experts for signal gathering. The routing is based on weightings determined by a function. Each token goes through the same process with different weightings. The routing and computation path vary for each signal. The process involves routing network, expert network, and weighted sum based on routing and expert output. Sparse output from routing function reduces computation by focusing on relevant experts. This approach reduces parameter count and assigns different experts to different tokens. There is no entropy regularization to ensure tokens are routed to different experts, but it may have been used in previous papers. Overall, the process optimizes computation and expert usage without the need for additional constraints.

The other thing here, so they say, E_i denotes the output of the i-th expert. So i are the number of experts and they go to n, I guess, there are n experts. And then here gx sub i denotes the n-dimensional output of the gating network for the i-th expert.

image_1712369385479

This, I believe, is a mistake in the the paper? Maybe not, but so if you think of it, if like this thing here outputs a vector, right? And you ultimately sum these vectors and you want another vector, then these things here, they must either be scalars or matrices, but they cannot be like n-dimensional.

The output of the gating network for the i-th expert cannot be n-dimensional. So I believe what they meant to say is that g of x has an n-dimensional output and then gx of sub i is the i-th entry of that n-dimensional output, which this n-dimensional output is just that what I said before this kind of classification layer where you only take the top k, in their case the top two entries before so you set everything else to zero and then you normalize that using a softmax.

So the neural network is just a linear feed forward layer so there is a function ff, let's call it ff, and we apply it to each token individually, which means that every single token goes through the same function in here. And that function is always the same function. So that hasn't changed. What has changed is that internally, inside of these functions, if you peer in, then an input x might take a different path than an input y. So the input y might take a different computation path in here and activate different parameters inside of this. But it's still the case that each token is pushed separately through that feedforward stage. It's just that inside of that feedforward stage, we have some sparse elements and depending on the signal that we put in there, depending on the token, it gets routed differently.

Routing and Expert Parallelism for High Throughput Processing (SparseNet) | 0:21:20 - 0:27:20

\
They say The other thing is the active parameter count, which is the number of parameters used for processing an individual token which grows with k. So the more experts you consider per token, the more work you do. The trick here is to only use two out of the eight experts for each token which immediately divides the number of parameters per token inside of these feedforward layers by four. They also say that this can be used to do expert parallelism. Now expert parallelism if you're doing really high throughput stuff, you put each of those experts like this is W1, this is W2,this is W3, so there might be two matrices here and there might be some nonlinearities whatnot, but you put each of these experts onto different GPUs, so just GPU1, this GPU2 and so on And therefore to each GPU, it looks like a dense operation. And it's just that the router here decides to which GPU you send the token. This obviously only works if you have a high throughput, right? If you kind of pipeline your signals, and then you have maybe a bit of queuing here and some and so on are very high batch sizes and so on. But if you have that throughput, then you can shard the model like this.

image_1712371007115

SparseNet involves routing tokens to different experts on GPUs to increase throughput. By using a sparse mixture of experts, the model can achieve higher throughput by distributing tokens efficiently. The speaker explains that the magic of machine learning lies in this new feed-forward layer with routing and softmax aggregation. Experimental results show that SparseNet outperforms models like LLama 270 billion and GPT-3.5. However, be cautious about interpreting plots comparing active parameters in different models because of the dynamic selection of active parameters in SparseNet. The model performs well in tasks like reasoning and retrieval, demonstrating its ability to retrieve information from context windows effectively. Smartly selecting what to include in the context is more effective than adding everything.

Impact of Selective Information Inclusion | 0:27:20 - 0:31:20

\
Fine-tuning on multilingual data and releasing models under Apache has a positive impact on the community. I appreciate the transparency but I question the lack of information on the dataset used for training, which may be intentional to provoke reactions from critics.

Routing Analysis: Experts Assigned Consecutive Tokens | 0:31:40 -

\
It seems that mixtral experts are assigned consecutive tokens frequently, but no clear patterns are observed in the assignments based on topics. The only notable observation is the tendency to assign consecutive tokens to the same experts, along with some regularities.

image_1712371538762

image_1712371560392

The patterns may be complex or beyond human understanding, but there is the possibility of finding regularities in the future and it is important to research this to build new applications. I praise the release of the analysis on Apache for wider accessibility. I can only speculate on the reasons for keeping the data source undisclosed, but I'm excited about the potential of open source AI applications like Mixtrel.