It’s actually a really neat model: the experts are split into 8 ‘groups’ and routed so that the same number are active in each group at any given time. In other words, it’s specifically architected for 8X Huawei NPU servers, so that there’s no excessive cross-communication or idle time between them.
So yeah, even if it’s not a B200, proof’s in the puddin, and huge models are being trained and run on these things.
It’s not theoretical. They’ve already released an 300B LLM dubbed Pangu Pro, trained on Huawei NPUs:
https://huggingface.co/papers/2505.21411
And it’s open weights!
https://huggingface.co/IntervitensInc/pangu-pro-moe-model
It’s actually a really neat model: the experts are split into 8 ‘groups’ and routed so that the same number are active in each group at any given time. In other words, it’s specifically architected for 8X Huawei NPU servers, so that there’s no excessive cross-communication or idle time between them.
So yeah, even if it’s not a B200, proof’s in the puddin, and huge models are being trained and run on these things.