Running Distilled’s DeepSeek R1 models locally on Copilot+ PCs with the Windows Copilot Runtime boost.

Artificial intelligence continues to pave a path of innovation and transformation in the realm of technology, especially with the advent of Copilot+ PCs. These devices are being equipped with the latest version of DeepSeek R1, now available on Azure AI Foundry. This advancement includes models optimized specifically to work with neural processing units (NPU), starting with the Qualcomm Snapdragon X and on track to include versions for Intel Core Ultra 200V, among others.

The initial model that can be utilized is the DeepSeek-R1-Distill-Qwen-1.5B, accessible through the AI Toolkit. In a short time, variations of 7B and 14B will be incorporated, expanding options for developers looking to integrate artificial intelligence into their applications. These models are designed to deliver exceptional performance on devices, leveraging the capacity of NPUs to make inferences effectively. This represents a step towards a new paradigm in which generative AI can function semi-continuously, providing services without the need to be activated exclusively on demand.

The development of Phi Silica has been key in this context, facilitating more efficient recognition that allows for competitive response times and a notable reduction in resource consumption, while maintaining optimal battery life. By optimizing DeepSeek for NPUs, techniques such as model component separation and low-bit quantization have been implemented, balancing performance and efficiency.

Interested developers can begin experimenting with DeepSeek on their Copilot+ PCs by downloading the AI Toolkit extension for Visual Studio Code. They will also have access to a catalog of optimized models in ONNX QDQ format, making integration into AI projects easier. Additionally, there is the option to test the source model via the cloud on Azure Foundry.

The Qwen 1.5B model features enhanced characteristics, including a tokenizer, context processing models, and an advanced quantization scheme. This allows the model to operate extremely quickly, with a response time of just 130 ms and the ability to generate 16 tokens per second in short responses. These achievements are the result of clever design utilizing a sliding window and quantization techniques that surpass previous methods in terms of precision.

With these innovations, users will be able to interact with state-of-the-art artificial intelligence models on their personal devices, transforming the way AI applications are developed and utilized. This evolution promises to redefine the user experience in the realm of personal technology.

Referrer: MiMub in Spanish

Scroll to Top
×