1
54
submitted 13 hours ago by yogthos@lemmy.ml to c/technology@lemmy.ml
2
39
submitted 2 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml

The hardware efficiency gains are honestly the most interesting part of the paper. The main reason DeepSeek-V4 is so cheap to run comes down to how they completely bypassed the quadratic cost of standard attention for massive context windows.

They built a hybrid attention architecture that interleaves Compressed Sparse Attention and Heavily Compressed Attention. Standard models keep every single token in the KV cache which absolutely kills memory. CSA fixes this by compressing the KV cache of multiple tokens into a single entry and then uses a sparse routing mechanism to only compute attention over the top-k most relevant compressed blocks. HCA takes it a step further by compressing an even larger number of tokens into one entry but computes dense attention over them. So, a 1.6T parameter Pro model only uses a third of the compute FLOPs and 10% of the KV cache memory compared to DeepSeek-V3.2 at a one million token context.

They also aggressively pushed low-precision formats applying FP4 quantization-aware training to the Mixture-of-Experts weights and the attention Query-Key paths. MoE models are notoriously memory bound because you have to constantly shuttle massive expert weights into the GPU cores. Dropping these to FP4 slashes the memory bandwidth bottleneck and lets the model run way faster during inference without ruining accuracy since they handle the quantization dynamically during training.

On the infrastructure side they wrote a custom fused kernel using TileLang that overlaps communication and computation. When running expert parallelism across multiple GPUs you usually hit a wall waiting for the network. DeepSeek slices the experts into micro-waves so the GPU is crunching matrix math on the first wave while the network is simultaneously pulling the data for the second wave. They basically hid the network latency behind the compute time which means you do not need super expensive interconnects to get peak hardware utilization out of the cluster.

3
32
submitted 2 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
4
2
Flipbook (www.sketchapedia.com)
submitted 1 day ago by Zerush@lemmy.ml to c/technology@lemmy.ml

Flipbook (sketchapedia.com)

Flipbook Image: Flipbook - Flipbook

Flipbook (hosted at sketchapedia.com) is an AI-powered visual browser that generates illustrated, interactive infographics on demand in real time. You type any topic, and it renders a clickable, sometimes animated image explaining it — similar to prompting ChatGPT or Claude, but the output is visual rather than text.

According to LinkedIn, the tool was built by Zain Shah and team. It describes itself as "an infinite visual browser generated entirely on demand in real time."

Japanese bookmarking site Hatena categorises it under AI, LLM, and web tools, with users tagging it as worth reading later.

Sources: LinkedIn, Hatena

5
25

A directory created by the Centers for Medicare and Medicaid Services (CMS) has exposed the Social Security numbers of a number of US healthcare providers.

The Trump administration introduced a new Medicare portal as part of plans to modernize US healthcare technology.

However, a database that was part of the directory was left publicly accessible, and exposed providers’ names and Social Security numbers.

6
30
submitted 2 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
7
31
8
22
submitted 3 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
9
17
Apocalypse Early Warning System (ews.kylemcdonald.net)
submitted 3 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
10
66
submitted 4 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
11
20
submitted 3 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
12
1
submitted 2 days ago* (last edited 2 days ago) by zdhzm2pgp@lemmy.ml to c/technology@lemmy.ml
13
4
submitted 2 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
14
17
submitted 4 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
15
44
submitted 5 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
16
76
submitted 5 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
17
192
submitted 6 days ago* (last edited 6 days ago) by yogthos@lemmy.ml to c/technology@lemmy.ml
18
12
submitted 4 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
19
21
submitted 6 days ago* (last edited 5 days ago) by exu@feditown.com to c/technology@lemmy.ml

cross-posted from: https://feditown.com/post/2911581

Edit: Adding a warning here; The post was probably heavily AI written and contains mistakes to that effect, which is unfortunate. The data in general is still interesting though.

20
5
submitted 4 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
21
11
submitted 5 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
22
33
submitted 6 days ago by chobeat@lemmy.ml to c/technology@lemmy.ml
23
-1
submitted 2 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
24
-1
submitted 3 days ago by yogthos@lemmy.ml to c/technology@lemmy.ml
25
13
submitted 6 days ago* (last edited 6 days ago) by yogthos@lemmy.ml to c/technology@lemmy.ml

A GGUF port of DFlash speculative decoding. Standalone C++/CUDA stack on top of ggml, runs on a single 24 GB RTX 3090, hosts the new Qwen3.6-27B.

~1.98x mean over autoregressive on Qwen3.6 across HumanEval / GSM8K / Math500, with zero retraining.

If you have CUDA 12+ and an NVIDIA GPU like RTX 3090 / 4090 / 5090, then all you need to do is

clone the repo

cd lucebox-hub/dflash
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --target test_dflash -j

fetch target (~16 GB)

hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/

matched 3.6 draft is gated: accept terms + set HF_TOKEN first

hf download z-lab/Qwen3.6-27B-DFlash --local-dir models/draft/

run

DFLASH_TARGET=models/Qwen3.6-27B-Q4_K_M.gguf python3 scripts/run.py --prompt "def fibonacci(n):"

That's it. No Python runtime in the engine, no llama.cpp install, no vLLM, no SGLang.

Luce DFlash will:

  1. Load Qwen3.6-27B Q4_K_M target weights (~16 GB) plus the matched DFlash bf16 draft (~3.46 GB) and run DDTree tree-verify speculative decoding (block size 16, default budget 22, greedy verify).
  2. Compress the KV cache to TQ3_0 (3.5 bpv, ~9.7x vs F16) and roll a 4096-slot target_feat ring so 256K context fits in 24 GB. Q4_0 is the legacy path and tops out near 128K.
  3. Auto-bump the prefill ubatch from 16 to 192 for prompts past 2048 tokens (~913 tok/s prefill on 13K prompts).
  4. Apply sliding-window flash attention at decode (default 2048-token window, 100% speculative acceptance retained) so 60K context still decodes at 89.7 tok/s instead of 25.8 tok/s.
  5. Serve over an OpenAI-compatible HTTP endpoint or a local chat REPL.

Running on RTX 3090, Qwen3.6-27B UD-Q4_K_XL (unsloth Dynamic 2.0) target, 10 prompts/dataset, n_gen=256:

Bench AR tok/s DFlash tok/s AL Speedup

HumanEval 34.90 78.16 5.94 2.24x

Math500 35.13 69.77 5.15 1.99x

GSM8K 34.89 59.65 4.43 1.71x

Mean 34.97 69.19 5.17 1.98x

view more: next ›

Technology

42494 readers
221 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 7 years ago
MODERATORS