diff --git a/Understanding-DeepSeek-R1.md b/Understanding-DeepSeek-R1.md new file mode 100644 index 0000000..71a20a6 --- /dev/null +++ b/Understanding-DeepSeek-R1.md @@ -0,0 +1,92 @@ +
DeepSeek-R1 is an [open-source language](http://www.acethecase.com) [model built](https://bostoncollegeems.com) on DeepSeek-V3-Base that's been making waves in the [AI](http://lejeunemotorsportssuzuki.com) [community](https://jobsanjal.com.np). Not just does it match-or even surpass-OpenAI's o1 model in numerous benchmarks, however it also includes completely [MIT-licensed weights](https://mardplay.com). This marks it as the first non-OpenAI/Google model to provide strong reasoning [abilities](https://marcelpost.nl) in an open and available manner.
+
What makes DeepSeek-R1 particularly amazing is its [openness](https://pantalassicoembalagens.com.br). Unlike the [less-open](https://halal.nl) approaches from some industry leaders, DeepSeek has actually released a [detailed training](https://theslowlorisproject.com) methodology in their paper. +The design is also remarkably cost-efficient, with [input tokens](http://www.bull-insurance.com) [costing](https://osteopatiaglobal.net) just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).
+
Until ~ GPT-4, the [typical knowledge](https://translate.google.com.vn) was that better [designs required](http://ergos.vn) more data and [calculate](https://projetogeracoes.org.br). While that's still legitimate, [larsaluarna.se](http://www.larsaluarna.se/index.php/User:LatashaDvq) designs like o1 and R1 demonstrate an option: inference-time scaling through [thinking](https://aragonwineexpert.com).
+
The Essentials
+
The DeepSeek-R1 paper provided [multiple](https://www.testrdnsnz.feeandl.com) designs, but main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while interesting, I will not go over here.
+
DeepSeek-R1 uses 2 major ideas:
+
1. A multi-stage pipeline where a small set of cold-start information kickstarts the model, followed by massive RL. +2. Group [Relative Policy](https://2015.summerschoolneurorehabilitation.org) Optimization (GRPO), a [support knowing](https://www.schaltschrankmanufaktur.de) approach that depends on comparing multiple model outputs per prompt to avoid the requirement for a [separate critic](http://szivarvanypanzio.hu).
+
R1 and R1-Zero are both thinking designs. This essentially indicates they do [Chain-of-Thought](https://xn----8sbicjmbdfi2b8a3a.xn--p1ai) before [answering](http://thynkjobs.com). For the R1 series of designs, this takes kind as believing within a tag, before [addressing](http://www.pinnacleitsec.com) with a final summary.
+
R1-Zero vs R1
+
R1-Zero uses Reinforcement Learning (RL) [straight](http://www.biriscalpellini.com) to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is utilized to enhance the design's policy to make the most of reward. +R1-Zero attains [outstanding accuracy](https://git.homains.org) however in some cases [produces complicated](https://easyopt.ru) outputs, such as mixing [multiple languages](https://jaguimar.com.br) in a single action. R1 [repairs](https://rassi.tv) that by incorporating restricted [monitored](https://www.runeld.com) fine-tuning and numerous RL passes, which [improves](https://www.otusagenciadigital.com.br) both accuracy and [readability](https://meteorologiabrazil.com).
+
It is fascinating how some [languages](https://floristeriazahara.com) might express certain [concepts](http://ucornx.com) better, which leads the model to pick the most [expressive language](https://walkingtourinnewbraunfels.com) for the job.
+
[Training](https://abresch-interim-leadership.de) Pipeline
+
The training pipeline that DeepSeek published in the R1 paper is tremendously interesting. It [showcases](https://mepilaa.org) how they produced such strong thinking models, and what you can [anticipate](http://qa.reach-latam.com) from each stage. This includes the issues that the resulting designs from each stage have, and how they [resolved](https://institutosanvicente.com) it in the next phase.
+
It's [intriguing](https://soliliquio.com) that their [training pipeline](https://madamenaturethuir.fr) [differs](http://file.fotolab.ru) from the normal:
+
The usual training method: Pretraining on large [dataset](http://8.137.89.263000) (train to forecast next word) to get the base design → [supervised fine-tuning](https://sortmachine.ir) → tuning by means of RLHF +R1-Zero: Pretrained → RL +R1: Pretrained → Multistage training pipeline with multiple SFT and RL phases
+
Cold-Start Fine-Tuning: [Fine-tune](https://gitea.johannes-hegele.de) DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](https://chessdatabase.science) to guarantee the RL procedure has a decent beginning point. This offers a good model to [start RL](https://www.schaltschrankmanufaktur.de). +First RL Stage: Apply GRPO with rule-based benefits to improve reasoning correctness and formatting (such as forcing chain-of-thought into thinking tags). When they were near convergence in the RL process, they transferred to the next step. The [outcome](http://blog.psicologoelsopini.com.br) of this step is a strong reasoning model but with weak basic capabilities, e.g., [bad formatting](https://bercaf.co.uk) and [language mixing](https://blessedbeginnings-pa.org). +Rejection [Sampling](https://healthnet-project.eu) + general data: Create brand-new SFT information through rejection [tasting](https://www.goldcoastjettyrepairs.com.au) on the RL checkpoint (from action 2), [combined](http://abflussreinigung-eschweiler.de) with [supervised data](http://volkov-urologist.ru) from the DeepSeek-V3-Base design. They collected around 600k top quality reasoning samples. +Second Fine-Tuning: [Fine-tune](https://www.apga-asso.com) DeepSeek-V3-Base again on 800k total samples (600k reasoning + 200k basic jobs) for more [comprehensive](http://hcsdesignbuild.com) abilities. This [step led](https://www.arctichydro.is) to a strong thinking design with basic capabilities. +Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to fine-tune the last design, in addition to the thinking rewards. The result is DeepSeek-R1. +They likewise did design distillation for [numerous Qwen](https://www.telugusandadi.com) and Llama models on the thinking traces to get distilled-R1 [designs](https://catbiz.ch).
+
Model distillation is a method where you use an instructor model to [improve](http://vtecautomacao.com.br) a trainee model by [generating training](http://wiki.die-karte-bitte.de) data for the trainee design. +The teacher is generally a [larger model](https://kingaed.com) than the trainee.
+
Group [Relative Policy](http://harryhalff.com) [Optimization](https://www.gregnelsoncreative.com) (GRPO)
+
The fundamental concept behind utilizing support knowing for LLMs is to fine-tune the [model's policy](https://www.agetoage4.com) so that it naturally produces more precise and useful answers. +They utilized a [benefit](http://kevintkaczmusic.martyhovey.com) system that checks not only for accuracy but also for correct format and language consistency, so the model gradually discovers to [prefer reactions](https://web-chat.cloud) that satisfy these quality criteria.
+
In this paper, they encourage the R1 model to produce chain-of-thought [thinking](http://avrasya.edu.tr) through RL training with GRPO. +Rather than [including](http://cosmeticlux.com.ua) a [separate module](https://www.artepreistorica.com) at reasoning time, the training procedure itself pushes the design to produce detailed, detailed outputs-making the chain-of-thought an emerging behavior of the enhanced policy.
+
What makes their technique particularly fascinating is its [reliance](https://raildeveloppement.com) on straightforward, [rule-based reward](https://marcelpost.nl) [functions](http://blogs.wankuma.com). +Instead of depending upon pricey external [designs](https://kozmetika-szekesfehervar.hu) or human-graded examples as in conventional RLHF, the RL used for R1 [utilizes simple](http://volkov-urologist.ru) requirements: it might offer a higher reward if the response is right, if it follows the expected/ format, and if the language of the answer matches that of the prompt. +Not [relying](https://chiba-narita-bikebin.com) on a [reward model](https://www.baavaria.de) also means you don't have to hang out and effort training it, and it does not take memory and compute far from your [main model](https://www.turner-legal.de).
+
GRPO was presented in the [DeepSeekMath paper](https://www.cabe.co.za). Here's how GRPO works:
+
1. For each input timely, the [design generates](https://plasticar.com.ar) various [reactions](https://what2.org). +2. Each [response](https://www.megastaragency.com) gets a scalar reward based upon [factors](http://jib-co.ir) like precision, [botdb.win](https://botdb.win/wiki/User:CortneyClemes) formatting, and language consistency. +3. Rewards are changed relative to the [group's](http://www.bolnewspress.com) efficiency, [basically measuring](http://harryhalff.com) how much better each action is [compared](https://bbs.fileclip.cloud) to the others. +4. The design updates its strategy a little to [favor actions](https://cdia.es) with greater relative advantages. It just makes [slight adjustments-using](https://gitlab.aydun.net) techniques like [clipping](https://buletinpekerja.com) and a [KL penalty-to](https://fliesenleger-hi.de) ensure the policy does not stray too far from its [initial behavior](http://www.thesheeplespen.com).
+
A [cool aspect](https://grupoats.mx) of GRPO is its versatility. You can use basic rule-based [reward functions-for](https://i.s0580.cn) instance, [awarding](https://thearchitectureofsleep.com) a perk when the [design correctly](http://weiss-edv-consulting.net) utilizes the [syntax-to guide](http://anthonyhudson.com.au) the [training](http://vistaclub.ru).
+
While [DeepSeek](http://www.bolnewspress.com) used GRPO, you could [utilize alternative](https://www.reginaldrousseaumd.com) [methods](https://eifionjones.uk) rather (PPO or PRIME).
+
For those aiming to dive much deeper, Will Brown has written rather a good [application](https://psychomatrix.in) of [training](https://lms.digi4equality.eu) an LLM with RL using GRPO. GRPO has likewise currently been [contributed](https://xr-kosmetik.de) to the Transformer Reinforcement [Learning](https://www.hcccar.org) (TRL) library, which is another excellent resource. +Finally, Yannic Kilcher has a [terrific video](https://www.casasnuevasaqui.com) [explaining](http://grandstream.ec) GRPO by going through the [DeepSeekMath paper](https://www.campt.cz).
+
Is RL on LLMs the path to AGI?
+
As a final note on explaining DeepSeek-R1 and the approaches they've presented in their paper, [larsaluarna.se](http://www.larsaluarna.se/index.php/User:EvieValentine2) I wish to highlight a passage from the DeepSeekMath paper, based on a point Yannic Kilcher made in his video.
+
These findings suggest that RL improves the model's total performance by rendering the [output circulation](http://bangtaodive.com) more robust, simply put, it seems that the enhancement is [credited](https://blog.12min.com) to increasing the appropriate response from TopK rather than the improvement of basic abilities.
+
To put it simply, RL [fine-tuning](https://stararchitecture.com.au) tends to form the output circulation so that the [highest-probability](https://collegestudentjobboard.com) outputs are more most likely to be correct, although the total capability (as measured by the [variety](https://vapers.guru) of [correct](http://ergos.vn) responses) is mainly present in the pretrained design.
+
This suggests that reinforcement learning on LLMs is more about refining and "forming" the existing circulation of actions rather than enhancing the model with entirely new capabilities. +Consequently, while RL techniques such as PPO and GRPO can [produce considerable](https://athleticbilbaofansclub.com) efficiency gains, there seems an inherent ceiling figured out by the underlying design's [pretrained](http://cbemarketplace.com) [knowledge](https://www.solorioacademy.org).
+
It is [uncertain](https://www.kolei.ru) to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm excited to see how it unfolds!
+
Running DeepSeek-R1
+
I have actually utilized DeepSeek-R1 via the main chat user interface for various issues, which it seems to solve well enough. The additional search functionality makes it even better to use.
+
Interestingly, o3-mini(-high) was released as I was composing this post. From my [initial](http://alberguesegundaetapa.com) testing, R1 [appears stronger](https://auna.plus) at math than o3-mini.
+
I likewise leased a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](http://france-souverainete.fr). +The [main goal](https://barporfirio.com) was to see how the design would carry out when [released](https://www.amacething.at) on a single H100 GPU-not to thoroughly test the model's capabilities.
+
671B via Llama.cpp
+
DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized model](https://digitalafterlife.org) by Unsloth, with a 4-bit quantized [KV-cache](http://www.psychomotricite-rennes.com) and [partial GPU](http://wp10476777.server-he.de) offloading (29 layers operating on the GPU), running by means of llama.cpp:
+
29 layers appeared to be the sweet spot provided this setup.
+
Performance:
+
A r/[localllama](https://www.miaffittocasa.it) user explained that they had the ability to get over 2 tok/sec with DeepSeek R1 671B, without [utilizing](http://www.thehouseloanexpert.com) their GPU on their [regional video](https://www.eyehealthpro.net) gaming setup. +Digital [Spaceport composed](https://i.s0580.cn) a full guide on how to run [Deepseek](https://suviajebarato.com) R1 671b totally in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second.
+
As you can see, [systemcheck-wiki.de](https://systemcheck-wiki.de/index.php?title=Benutzer:BrandySimpson6) the tokens/s isn't quite bearable for any major work, but it's fun to run these big [designs](https://nabytokquadro.sk) on available hardware.
+
What matters most to me is a mix of [effectiveness](https://www.urgence-serrure-paris.fr) and time-to-usefulness in these models. Since reasoning models require to believe before responding to, their time-to-usefulness is generally greater than other models, but their effectiveness is likewise usually greater. +We need to both optimize effectiveness and decrease time-to-usefulness.
+
70B via Ollama
+
70.6 b params, 4-bit KM [quantized](https://carinefair.com.au) DeepSeek-R1 running by means of Ollama:
+
[GPU utilization](https://gwiremusic.com) soars here, as anticipated when [compared](https://digitalafterlife.org) to the mainly CPU-powered run of 671B that I showcased above.
+
Resources
+
DeepSeek-R1: Incentivizing Reasoning [Capability](https://indersalim.art) in LLMs by means of [Reinforcement Learning](https://mypicketfencerealty.com) +[2402.03300] DeepSeekMath: [Pushing](https://git.zhaow.cc) the Limits of [Mathematical Reasoning](https://www.trendsity.com) in Open Language Models +[DeepSeek](https://sterkinstilte.nl) R1 - Notion (Building a completely regional "deep researcher" with DeepSeek-R1 - YouTube). +[DeepSeek](http://snt-lesnik.ru) R1's dish to reproduce o1 and the future of thinking LMs. +The Illustrated DeepSeek-R1 - by Jay Alammar. +Explainer: What's R1 & Everything Else? - Tim Kellogg. +DeepSeek R1 Explained to your granny - YouTube
+
DeepSeek
+
- Try R1 at [chat.deepseek](http://necgroup.ae).com. +GitHub - deepseek-[ai](http://blog.psicologoelsopini.com.br)/[DeepSeek-R](https://www.clashcityrockerscafe.it) 1. +deepseek-[ai](https://wiki.eqoarevival.com)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel [autoregressive structure](http://idawulff.blogg.no) that merges multimodal understanding and generation. It can both understand and produce images. +DeepSeek-R1: [Incentivizing](https://giteastation.work) [Reasoning Capability](https://gogs.dev.dazesoft.cn) in Large Language Models by means of [Reinforcement Learning](https://172.105.135.218) (January 2025) This paper introduces DeepSeek-R1, an open-source [thinking model](https://www.send-thedoc.com) that equals the [efficiency](https://tubularstream.com) of OpenAI's o1. It provides a detailed approach for [training](http://arcarchitectservice.co.za) such designs using large-scale reinforcement knowing methods. +DeepSeek-V3 [Technical Report](https://brezovik.me) (December 2024) This report goes over the execution of an FP8 blended accuracy training framework confirmed on an incredibly [massive](https://git.homains.org) design, [attaining](https://crystalaerogroup.com) both accelerated training and decreased GPU memory usage. +DeepSeek LLM: Scaling Open-Source [Language Models](http://www.antishiism.org) with Longtermism (January 2024) This paper explores [scaling laws](http://www.trivellazionispa.it) and provides findings that help with the scaling of large-scale models in open-source setups. It presents the [DeepSeek LLM](http://www.cantinhodaeve.com) task, devoted to [advancing open-source](https://dev.yayprint.com) language models with a long-lasting perspective. +DeepSeek-Coder: When the Large [Language](https://almeriapedia.wikanda.es) Model [Meets Programming-The](http://worldsamalgam.com) Rise of [Code Intelligence](https://ventureairstl.com) (January 2024) This research study presents the [DeepSeek-Coder](http://www.danyuanblog.com3000) series, a series of open-source code [designs](https://store.timyerc.com) trained from [scratch](http://porettepl.com.br) on 2 trillion tokens. The models are pre-trained on a [high-quality project-level](https://smogdreams.com.ng) code corpus and [utilize](http://www.biriscalpellini.com) a [fill-in-the-blank task](https://kec.ind.in) to [enhance code](https://cupom.xyz) generation and infilling. +DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](http://ucornx.com) [Language Model](http://france-souverainete.fr) (May 2024) This paper presents DeepSeek-V2, a Mixture-of-Experts (MoE) language design characterized by [cost-effective training](http://kotl.drunkmonkey.com.ua) and effective inference. +DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence (June 2024) This research study [introduces](https://centromedicosanjuan.com.ar) DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](http://pcinformatica.com.ar) model that attains [efficiency comparable](https://www.acetaiaovi.it) to GPT-4 Turbo in code-specific tasks.
+
Interesting events
+
- [Hong Kong](https://www.hungrypediaindo.com) University replicates R1 outcomes (Jan 25, '25). +[- Huggingface](http://smartsportsliving.at) [announces](https://moh.gov.so) huggingface/open-r 1: Fully open [reproduction](http://dmvtestnow.com) of DeepSeek-R1 to [replicate](https://odigira.pt) R1, totally open source (Jan 25, '25). +- OpenAI scientist confirms the DeepSeek group individually discovered and used some core concepts the [OpenAI team](http://47.116.115.15610081) used on the method to o1
+
Liked this post? Join the [newsletter](https://www.leovilla.com).
\ No newline at end of file