eventually, we provide an example of an entire language design: a deep sequence design backbone (with repeating Mamba blocks) + language model head.
functioning on byte-sized tokens, transformers scale inadequately as https://isaiahevyp520182.vidublog.com/29307541/mamba-paper-no-further-a-mystery