Fixing Linux Kernel Bugs by Vibe Coding
Published on
My Kaby Lake-R laptop had an annoying problem. On Linux, in performance mode, the audio crackled. A lot. Here's the story of how I debugged the problem, and vibe coded a kernel patch to fix it.
I had recently switched to Fedora from Arch, so it wasn't a distro misconfiguration. I tried changing PipeWire's (and pipewire-pulse's) sample rate and quantum values according to the Arch Wiki, that didn't work. Nothing out of the ordinary in pw-top or PipeWire logs. Tried intel_pstate=disable too, and all the usual suspects.
I found many forum posts about this — none of them had found a fix, except one. snappy91 found that the issue vanishes when he disabled SOF drivers and set intel_idle.max_cstate=1. So it wasn't PipeWire, it was either the audio driver stack or a CPU power-management quirk.
Ruling out the audio driver stack
I didn't want to nuke C-states, that's insane! So I investigated the driver stack first, hoping that was the problem. lsmod did report that all the modules snappy91 told to disable were loaded. But were they being used?
I ran lspci -k | grep -A3 -i audio,
00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
Subsystem: ASUSTeK Computer Inc. Device 1a00
Kernel driver in use: snd_hda_intel
Kernel modules: snd_soc_avs, snd_sof_pci_intel_skl, snd_hda_intel
which told me only snd_hda_intel was being used. The SOF docs go into more detail if you're interested. The docs also tell us how to force the legacy driver: options snd-intel-dspcfg dsp_driver=1. Just to be sure, I set CONFIG_SND_SOC=n in my custom kernel to compile it out — as expected, it didn't fix it.
That leaves us with C-states. Reluctantly, I disabled them — and it worked! But that's not a practical solution. Now what?
Interrupt galore
I removed the kernel parameter and got back to debugging. Among other things, I checked /proc/interrupts. It shows us what interrupts (IRQs) we're getting, from where, how many, and which core is handling them. I turned off performance mode, fired up watch -d -n 1 'rg hda /proc/interrupts' and started playback. A few IRQs at the start, then nothing. No crackling either.
I switched to performance mode, and watched the IRQs come in. Sometimes reaching 100 IRQs a second. The crackling happened exactly when the number was going up.
GPT helped me with a /sys/kernel/tracing run and we found out that in performance mode, when the time came to handle audio, the CPU was often sleeping.
Pinpointing the culprit
/proc/interrupts doesn't show us which particular interrupt cause we're getting from the driver. So I asked GPT to include some debug logs in dmesg. While writing the patch, it also noticed a no_period_wakeup flag. I asked it to log whether a stream was using it or not. I compiled it, booted into it and started playback.
hda-intel: open PCM c0p3p -> stream idx=7
hda-intel: trigger START c0p3p idx=7 buf=245760 period=8192 npw_runtime=1 npw_hw=1 SD_CTL=0x1c SD_STS=0x00
hda-intel: IRQ INTSTS=0x80000080
hda-intel: IRQ stream idx=7 dir=0 SD_STS=0x28 pos=222944
npw_runtime=1 and npw_hw=1 both confirm that no_period_wakeup was on (I later tried forcing it off, but it didn't fix the problem). The last two lines were spammed exactly when the crackling occurred. GPT decoded the SD_STS=0x28 as 0x20 (SD_STS_FIFO_READY) + 0x08 (SD_INT_FIFO_ERR).
I ran sudo bash -c 'exec 3>/dev/cpu_dma_latency; echo -n 0 >&3; sleep infinity' to effectively kill C-states and as expected, the interrupts stopped. /dev/cpu_dma_latency is part of the PM QoS (Power Management Quality Of Service) system of the kernel. By writing a value to it and keeping the file descriptor open, you force the CPU to be ready within x µs, in this case 0.
I kept bumping the number up. 45 was the limit, the crackling returned at 46. The kernel parses it as hex and not decimal. 0x45 = 69 µs; crackling at 70 µs. Nice!
drivers/idle/intel_idle.c's skl_cstates[] tells us that for my CPU (Kaby Lake-R/Skylake), C3's exit latency is 70 µs. Lines up perfectly.
After adding some more debug logs to rule out ALSA underruns and other causes, only one possible explanation remained: something's wrong with the PCU.
What was happening
In modern CPUs, the kernel doesn't directly control P-states and C-states. It sends hints to the PCU (Power Control Unit). PCU makes the final decision. The HDA (audio) controller lives in the PCH (Platform Controller Hub), also called "uncore" — as opposed to CPU "core". Together, they're called "package". PCU is responsible for the whole package.
My theory is that when maximum performance is requested, the PCU does deliver, but decides to be more aggressive with C-states to save power. Trading latency for efficiency. Not a bad idea, except when latency is actually critical, like during audio playback.
Debugging confirmed that the ALSA buffer wasn't underrunning. The HDA's internal FIFO buffer was starving every now and then, due to the PCU allowing deeper package C-states when it shouldn't have. That's the only plausible explanation for why we were getting SD_INT_FIFO_ERR during crackling. If anyone reading this has more info about this, please let me know!
The Fix
The rest was rather straightforward. It's common in audio drivers to use PM QoS and request a certain level of latency. That's what we're gonna do, but our fix will be more efficient. Way more.
First, we'll scope the PM QoS request to only the core that's handling the HDA IRQ. As obvious as this sounds, I didn't see any other audio driver do this — and audio drivers typically do use similar PM QoS requests. This alone saves so much power.
Second, we'll add a CPUFreq policy change notifier. We'll only add the request when the CPU governor is performance when stream starts, and dynamically add/remove it if it changes during the stream. Since it doesn't happen on any other mode, this lets the PCU do whatever it want with C-states in other modes, saving a lot of power.
GPT wrote both for me. And it worked perfectly first try. I made it iterate on the patches once for further polish. Here they are:
- cpufreq: Add policy change notifier (Pastebin)
- ALSA: hda: Keep IRQ core out of deep C-states (Pastebin)
I'll probably play with the first patch to implement perf/power hacks in other subsystems. Very fun!
The future is now, old man
On one side, we have devs vehemently opposing the use of AI in coding, claiming it's terrible. On the other, we have LinkedIn slop-masters "vibe coding", claiming their workflow with a million different moving parts (which are all bullshit) is marginally better than the next guy's. Both are idiots. Outside these camps, a minority exists happily, learning and getting stuff done at unprecedented speeds.
LLMs let us do things that simply weren't worth the time and effort before. I could've done all this on my own, sure. I've written kernel patches to fix bugs before LLMs, even though I didn't really know C or kernel internals (I still don't!). The difference is, I can afford to get nerd-sniped into a rabbit hole like this every day. It takes hours of my time, not weeks. I get to choose the abstraction level I'm operating at.
I left out many things in this blog post, but I learnt so much more after the patches were in place. I spent a few more hours figuring out exactly what was happening, I tried different approaches, I learnt how audio playback is handled at different layers. This would've taken me a few more weeks. (And yes, the title might be bait, since the bug isn't in the kernel.)
Ironically, The Advent of AI is making me work more than ever before, because I can't bear the thought of letting all these opportunities go to waste.
Note: I used Codex CLI with GPT-5.1 — regular, not the Codex model — in only low and medium reasoning modes.