英語

西班牙語

處理迴歸¶

我們不引入迴歸 —— 本文件闡述了這條“Linux 核心開發首要規則”對開發者而言在實踐中意味著什麼。它是《報告迴歸》的補充，後者從使用者的角度涵蓋了該主題；如果您從未閱讀過那篇文章，請在繼續閱讀本文之前至少快速瀏覽一遍。

要點（即“TL;DR”）¶

確保迴歸郵件列表 (regressions mailing list) 的訂閱者 (regressions@lists.linux.dev) 能迅速獲知任何新的迴歸報告
- 當收到一份未抄送給列表的郵件報告時，立即傳送至少一份簡短的“回覆全部”郵件，並抄送給列表，使其進入處理流程。
- 將透過 Bug 跟蹤器提交的任何報告轉發或彈回（bounce）到列表。
讓 Linux 核心迴歸跟蹤機器人“regzbot”跟蹤該問題（這是可選的，但建議這樣做）
- 對於郵件報告，檢查報告者是否包含類似 #regzbot introduced: v5.13..v5.14-rc1 的行。如果沒有，傳送一封回覆（抄送給迴歸列表），其中包含如下段落，告訴 regzbot 問題何時開始出現
  #regzbot ^introduced: 1f2e3d4c5b6a
- 當將 Bug 跟蹤器中的報告轉發到迴歸列表時（見上文），包含如下段落
  #regzbot introduced: v5.13..v5.14-rc1 #regzbot from: Some N. Ice Human <some.human@example.com> #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
提交回歸修復時，請在補丁描述中新增“Closes:”標籤，指向所有報告該問題的地方，如《提交補丁：將程式碼引入核心的必備指南》和《Documentation/process/5.Posting.rst》所規定。如果您只修復導致迴歸問題的一部分，則可以使用“Link:”標籤代替。regzbot 目前不對兩者進行區分。
一旦確定了罪魁禍首，應儘快修復迴歸；大多數迴歸的修復應在兩週內合併，但有些需要在兩到三天內解決。

與開發者相關的 Linux 核心迴歸問題的所有詳情¶

更詳細的要點¶

收到迴歸報告時該怎麼做¶

確保 Linux 核心的迴歸跟蹤者和迴歸郵件列表 (regressions mailing list) 的其他訂閱者 (regressions@lists.linux.dev) 能獲知任何新報告的迴歸問題

當您收到一份未抄送給列表的郵件報告時，立即傳送至少一份簡短的“回覆全部”郵件，並抄送給列表，使其進入處理流程；如果回覆的回覆中又遺漏了列表，請嘗試確保再次抄送。

如果 Bug 跟蹤器中提交的報告到達您的收件箱，請將其轉發或彈回（bounce）到列表。如果報告者已按照《報告問題》中的指示轉發了報告，請考慮事先檢查列表存檔。

在執行上述任一操作時，請考慮讓 Linux 核心迴歸跟蹤機器人“regzbot”立即開始跟蹤該問題

對於郵件報告，檢查報告者是否包含類似 #regzbot introduced: 1f2e3d4c5b6a 的“regzbot 命令”。如果沒有，傳送一封回覆（抄送給迴歸列表），其中包含如下段落：
#regzbot ^introduced: v5.13..v5.14-rc1
這會告訴 regzbot 問題開始出現的版本範圍；您也可以使用 commit-id 來指定範圍，或者在報告者已二分法定位到問題提交時，直接指定單個 commit-id。

請注意“introduced”前的插入符號 (^)：它告訴 regzbot 將父郵件（您回覆的郵件）視為您希望跟蹤的迴歸問題的初始報告；這很重要，因為 regzbot 稍後會查詢帶有“Closes:”標籤的補丁，這些標籤指向 lore.kernel.org 存檔中的報告。
當轉發一個報告到 Bug 跟蹤器的迴歸問題時，包含一個帶有這些 regzbot 命令的段落
#regzbot introduced: 1f2e3d4c5b6a
#regzbot from: Some N. Ice Human <some.human@example.com>
#regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
Regzbot 將自動把包含指向您的郵件或提到的工單的“Closes:”標籤的補丁與報告關聯起來。

修復迴歸問題時的要點¶

提交回歸修復時無需做任何特殊操作，只需記住按照《提交補丁：將程式碼引入核心的必備指南》、《Documentation/process/5.Posting.rst》和《關於 Linux -stable 版本的你需要了解的一切》中已詳細解釋的內容進行即可。

使用“Closes:”標籤指向所有報告該問題的地方
Closes: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
如果您只修復問題的一部分，可以如上述第一份文件中所述，使用“Link:”代替。regzbot 目前將兩者視為等同，並認為連結的報告已解決。
新增“Fixes:”標籤以指定導致迴歸的提交。

如果罪魁禍首是在較早的開發週期中合併的，請使用 Cc: stable@vger.kernel.org 標籤明確標記該修復以進行反向移植（backporting）。

所有這些都是您應做的，並且在處理迴歸問題時非常重要，因為這些標籤對於（包括您在內的）將來可能在數週、數月甚至數年後調查該問題的每個人都非常有價值。這些標籤對於其他核心開發者或 Linux 發行版使用的工具和指令碼也至關重要；其中一個工具就是 regzbot，它嚴重依賴“Closes:”標籤來將回歸報告與解決它們的更改關聯起來。

修復迴歸的期望和最佳實踐¶

作為一名 Linux 核心開發者，您應盡最大努力避免出現因您最近的更改導致迴歸，從而只留給使用者以下選擇的情況

執行一個存在影響使用的迴歸問題的核心。

切換到更舊或更新的核心系列。

在迴歸問題的罪魁禍首被識別後，繼續執行一個過時且可能不安全的核心超過三週。理想情況下應少於兩週。如果問題嚴重或影響許多使用者——無論是普遍情況還是在常見環境中——則應在幾天內解決。

如何在實踐中實現這一點取決於多種因素。以下經驗法則可作為指導。

總的來說

優先處理迴歸問題，高於所有其他 Linux 核心工作，除非後者涉及嚴重問題（例如：嚴重安全漏洞、資料丟失、硬體損壞等）。

加速修復最近已進入正式 mainline、stable 或 longterm 版本的迴歸問題（無論是直接合並還是透過反向移植）。

不要將當前週期的迴歸視為可以等到週期結束再處理的問題，因為該問題可能會阻礙或阻止使用者和 CI 系統現在或普遍地測試 mainline。

在解決問題時需謹慎，以避免造成額外或更大的損害，即使這樣解決問題可能比下面所述的時間更長。

一旦確定迴歸問題的罪魁禍首，關於時間安排

如果問題嚴重或困擾許多使用者——無論是普遍情況還是在特定硬體環境、發行版或 stable/longterm 系列等常見條件下——目標是在兩到三天內將修復合併到 mainline。

如果罪魁禍首已進入最近的 mainline、stable 或 longterm 版本（無論是直接合並還是透過反向移植），目標是在下下個週日之前將修復合併到 mainline；如果罪魁禍首在一週初被發現且易於解決，請嘗試在同一周內將修復合併到 mainline。

對於其他迴歸問題，目標是在未來三週內的最後一個週日之前將修復合併到 mainline。如果迴歸是人們可以輕鬆忍受一段時間的，例如輕微的效能迴歸，則推遲一兩個週日是可以接受的。

強烈不建議將回歸修復的合入 mainline 延遲到下一個合併視窗，除非修復的風險極高或罪魁禍首是在一年多前合入 mainline 的。

關於流程

始終考慮回滾（reverting）罪魁禍首，因為它通常是修復迴歸問題最快、最不危險的方法。不必擔心之後再將修復後的版本合併到 mainline：這應該很簡單，因為大部分程式碼已經審查過一次了。

嘗試在當前開發週期結束前解決過去十二個月內引入 mainline 的所有迴歸問題：Linus 希望這類迴歸能像當前週期的迴歸一樣處理，除非修復帶來異常風險。

如果迴歸問題看起來很棘手，請考慮在討論或補丁審查時抄送 Linus。在緊急或危急情況下也這樣做——特別是當子系統維護者可能無法聯絡時。當您知道此類迴歸已進入 mainline、stable 或 longterm 版本時，也請抄送 stable 團隊。

對於緊急迴歸問題，考慮請求 Linus 直接從郵件列表中接收修復：對於沒有爭議的修復，他完全可以接受。但理想情況下，此類請求應與子系統維護者協商一致或直接由他們提出。

如果您不確定某個修復在新的 mainline 版本釋出前幾天應用是否值得冒險，請給 Linus 傳送一封郵件，抄送給常規列表和相關人員；在郵件中，總結情況並請求他考慮直接從列表中接收修復。他可以自行決定，必要時甚至可以推遲釋出。此類請求也應理想地與子系統維護者協商一致或直接由他們提出。

關於 stable 和 longterm 核心

如果迴歸問題從未在 mainline 中出現，或者已經在 mainline 中修復，您可以將其留給 stable 團隊處理。

如果在過去十二個月內，某個迴歸問題進入了正式的 mainline 版本，請確保為修復標記“Cc: stable@vger.kernel.org”，因為單獨的“Fixes:”標籤並不能保證進行反向移植。如果您知道罪魁禍首已被反向移植到 stable 或 longterm 核心，請新增相同的標籤。

當收到有關近期 stable 或 longterm 核心系列中迴歸問題的報告時，請至少簡要評估該問題是否也可能發生在當前 mainline 中——如果可能性較大，請接手該報告。如有疑問，請要求報告者檢查 mainline。

每當您想迅速解決一個最近也進入了正式 mainline、stable 或 longterm 版本的迴歸問題時，請在 mainline 中快速修復它；適當時，請 Linus 介入以加速修復（見上文）。這是因為 stable 團隊通常既不會回滾也不會修復在 mainline 中造成相同問題的任何更改。

對於緊急的迴歸修復，一旦修復被合併到 mainline，您可能希望透過給 stable 團隊發一個通知來確保及時反向移植；這在合併視窗期間和之後不久尤其值得推薦，因為否則修復可能會落在大量補丁佇列的末尾。

關於補丁流程

開發者們，在嘗試達到上述時間段時，請記住要考慮修復經過測試、審查並由 Linus 合併所需的時間，理想情況下它們至少會在 linux-next 中短暫存在。因此，如果修復是緊急的，請使其顯而易見，以確保其他人能適當處理。

評審者們，請您及時審查迴歸修復，以幫助開發者達到上述時間段。

子系統維護者們，同樣鼓勵您加速處理迴歸修復。因此，評估對於特定修復跳過 linux-next 是否可行。必要時，也請考慮比平時更頻繁地傳送 git pull 請求。並儘量避免在週末拖延迴歸修復——特別是當該修復被標記為需要反向移植時。

開發者應瞭解的更多關於迴歸的方面¶

如何處理已知存在迴歸風險的變更¶

評估迴歸風險有多大，例如透過在 Linux 發行版和 Git 倉庫中執行程式碼搜尋。同時，考慮要求可能受影響的其他開發者或專案評估甚至測試擬議的更改；如果出現問題，或許可以找到一個所有人都接受的解決方案。

如果最終迴歸風險看起來相對較小，請繼續進行更改，但要讓所有相關方瞭解風險。因此，請確保您的補丁描述清晰地說明了這一點。一旦更改合併，請告知 Linux 核心的迴歸跟蹤器和迴歸郵件列表有關風險，以便在報告陸續出現時，每個人都能關注到該更改。根據風險情況，您可能還希望要求子系統維護者在他的 mainline pull request 中提及該問題。

關於迴歸還有哪些需要了解？¶

查閱《報告迴歸》，它涵蓋了您可能想了解的許多其他方面

“無迴歸”規則的目的

哪些問題實際屬於迴歸

誰負責尋找回歸的根本原因

如何處理棘手情況，例如迴歸是由安全修復引起時，或修復迴歸可能導致另一個迴歸時

遇到迴歸問題時應向誰尋求建議¶

向迴歸郵件列表 (regressions@lists.linux.dev) 傳送郵件，同時抄送 Linux 核心的迴歸跟蹤者 (regressions@leemhuis.info)；如果問題最好私下處理，可以省略列表。

Linus 關於迴歸問題的引言¶

以下是 Linus Torvalds 期望如何處理迴歸問題的幾個實際例子

摘自 2017-10-26 (1/2)

If you break existing user space setups THAT IS A REGRESSION.

It's not ok to say "but we'll fix the user space setup".

Really. NOT OK.

[...]

The first rule is:

 - we don't cause regressions

and the corollary is that when regressions *do* occur, we admit to
them and fix them, instead of blaming user space.

The fact that you have apparently been denying the regression now for
three weeks means that I will revert, and I will stop pulling apparmor
requests until the people involved understand how kernel development
is done.

摘自 2017-10-26 (2/2)

People should basically always feel like they can update their kernel
and simply not have to worry about it.

I refuse to introduce "you can only update the kernel if you also
update that other program" kind of limitations. If the kernel used to
work for you, the rule is that it continues to work for you.

There have been exceptions, but they are few and far between, and they
generally have some major and fundamental reasons for having happened,
that were basically entirely unavoidable, and people _tried_hard_ to
avoid them. Maybe we can't practically support the hardware any more
after it is decades old and nobody uses it with modern kernels any
more. Maybe there's a serious security issue with how we did things,
and people actually depended on that fundamentally broken model. Maybe
there was some fundamental other breakage that just _had_ to have a
flag day for very core and fundamental reasons.

And notice that this is very much about *breaking* peoples environments.

Behavioral changes happen, and maybe we don't even support some
feature any more. There's a number of fields in /proc/<pid>/stat that
are printed out as zeroes, simply because they don't even *exist* in
the kernel any more, or because showing them was a mistake (typically
an information leak). But the numbers got replaced by zeroes, so that
the code that used to parse the fields still works. The user might not
see everything they used to see, and so behavior is clearly different,
but things still _work_, even if they might no longer show sensitive
(or no longer relevant) information.

But if something actually breaks, then the change must get fixed or
reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
your user space then". It was a kernel change that exposed the
problem, it needs to be the kernel that corrects for it, because we
have a "upgrade in place" model. We don't have a "upgrade with new
user space".

And I seriously will refuse to take code from people who do not
understand and honor this very simple rule.

This rule is also not going to change.

And yes, I realize that the kernel is "special" in this respect. I'm
proud of it.

I have seen, and can point to, lots of projects that go "We need to
break that use case in order to make progress" or "you relied on
undocumented behavior, it sucks to be you" or "there's a better way to
do what you want to do, and you have to change to that new better
way", and I simply don't think that's acceptable outside of very early
alpha releases that have experimental users that know what they signed
up for. The kernel hasn't been in that situation for the last two
decades.

We do API breakage _inside_ the kernel all the time. We will fix
internal problems by saying "you now need to do XYZ", but then it's
about internal kernel API's, and the people who do that then also
obviously have to fix up all the in-kernel users of that API. Nobody
can say "I now broke the API you used, and now _you_ need to fix it
up". Whoever broke something gets to fix it too.

And we simply do not break user space.

摘自 2020-05-21

The rules about regressions have never been about any kind of
documented behavior, or where the code lives.

The rules about regressions are always about "breaks user workflow".

Users are literally the _only_ thing that matters.

No amount of "you shouldn't have used this" or "that behavior was
undefined, it's your own fault your app broke" or "that used to work
simply because of a kernel bug" is at all relevant.

Now, reality is never entirely black-and-white. So we've had things
like "serious security issue" etc that just forces us to make changes
that may break user space. But even then the rule is that we don't
really have other options that would allow things to continue.

And obviously, if users take years to even notice that something
broke, or if we have sane ways to work around the breakage that
doesn't make for too much trouble for users (ie "ok, there are a
handful of users, and they can use a kernel command line to work
around it" kind of things) we've also been a bit less strict.

But no, "that was documented to be broken" (whether it's because the
code was in staging or because the man-page said something else) is
irrelevant. If staging code is so useful that people end up using it,
that means that it's basically regular kernel code with a flag saying
"please clean this up".

The other side of the coin is that people who talk about "API
stability" are entirely wrong. API's don't matter either. You can make
any changes to an API you like - as long as nobody notices.

Again, the regression rule is not about documentation, not about
API's, and not about the phase of the moon.

It's entirely about "we caused problems for user space that used to work".

摘自 2017-11-05

And our regression rule has never been "behavior doesn't change".
That would mean that we could never make any changes at all.

For example, we do things like add new error handling etc all the
time, which we then sometimes even add tests for in our kselftest
directory.

So clearly behavior changes all the time and we don't consider that a
regression per se.

The rule for a regression for the kernel is that some real user
workflow breaks. Not some test. Not a "look, I used to be able to do
X, now I can't".

摘自 2018-08-03

YOU ARE MISSING THE #1 KERNEL RULE.

We do not regress, and we do not regress exactly because your are 100% wrong.

And the reason you state for your opinion is in fact exactly *WHY* you
are wrong.

Your "good reasons" are pure and utter garbage.

The whole point of "we do not regress" is so that people can upgrade
the kernel and never have to worry about it.

> Kernel had a bug which has been fixed

That is *ENTIRELY* immaterial.

Guys, whether something was buggy or not DOES NOT MATTER.

Why?

Bugs happen. That's a fact of life. Arguing that "we had to break
something because we were fixing a bug" is completely insane. We fix
tens of bugs every single day, thinking that "fixing a bug" means that
we can break something is simply NOT TRUE.

So bugs simply aren't even relevant to the discussion. They happen,
they get found, they get fixed, and it has nothing to do with "we
break users".

Because the only thing that matters IS THE USER.

How hard is that to understand?

Anybody who uses "but it was buggy" as an argument is entirely missing
the point. As far as the USER was concerned, it wasn't buggy - it
worked for him/her.

Maybe it worked *because* the user had taken the bug into account,
maybe it worked because the user didn't notice - again, it doesn't
matter. It worked for the user.

Breaking a user workflow for a "bug" is absolutely the WORST reason
for breakage you can imagine.

It's basically saying "I took something that worked, and I broke it,
but now it's better". Do you not see how f*cking insane that statement
is?

And without users, your program is not a program, it's a pointless
piece of code that you might as well throw away.

Seriously. This is *why* the #1 rule for kernel development is "we
don't break users". Because "I fixed a bug" is absolutely NOT AN
ARGUMENT if that bug fix broke a user setup. You actually introduced a
MUCH BIGGER bug by "fixing" something that the user clearly didn't
even care about.

And dammit, we upgrade the kernel ALL THE TIME without upgrading any
other programs at all. It is absolutely required, because flag-days
and dependencies are horribly bad.

And it is also required simply because I as a kernel developer do not
upgrade random other tools that I don't even care about as I develop
the kernel, and I want any of my users to feel safe doing the same
time.

So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
without upgrading some other random binary, then we have a problem.

摘自 2021-06-05

THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.

Honestly, security people need to understand that "not working" is not
a success case of security. It's a failure case.

Yes, "not working" may be secure. But security in that case is *pointless*.

摘自 2011-05-06 (1/3)

Binary compatibility is more important.

And if binaries don't use the interface to parse the format (or just
parse it wrongly - see the fairly recent example of adding uuid's to
/proc/self/mountinfo), then it's a regression.

And regressions get reverted, unless there are security issues or
similar that makes us go "Oh Gods, we really have to break things".

I don't understand why this simple logic is so hard for some kernel
developers to understand. Reality matters. Your personal wishes matter
NOT AT ALL.

If you made an interface that can be used without parsing the
interface description, then we're stuck with the interface. Theory
simply doesn't matter.

You could help fix the tools, and try to avoid the compatibility
issues that way. There aren't that many of them.

摘自 2011-05-06 (2/3)

it's clearly NOT an internal tracepoint. By definition. It's being
used by powertop.

摘自 2011-05-06 (3/3)

We have programs that use that ABI and thus it's a regression if they break.

摘自 2012-07-06

> Now this got me wondering if Debian _unstable_ actually qualifies as a
> standard distro userspace.

Oh, if the kernel breaks some standard user space, that counts. Tons
of people run Debian unstable

摘自 2019-09-15

One _particularly_ last-minute revert is the top-most commit (ignoring
the version change itself) done just before the release, and while
it's very annoying, it's perhaps also instructive.

What's instructive about it is that I reverted a commit that wasn't
actually buggy. In fact, it was doing exactly what it set out to do,
and did it very well. In fact it did it _so_ well that the much
improved IO patterns it caused then ended up revealing a user-visible
regression due to a real bug in a completely unrelated area.

The actual details of that regression are not the reason I point that
revert out as instructive, though. It's more that it's an instructive
example of what counts as a regression, and what the whole "no
regressions" kernel rule means. The reverted commit didn't change any
API's, and it didn't introduce any new bugs. But it ended up exposing
another problem, and as such caused a kernel upgrade to fail for a
user. So it got reverted.

The point here being that we revert based on user-reported _behavior_,
not based on some "it changes the ABI" or "it caused a bug" concept.
The problem was really pre-existing, and it just didn't happen to
trigger before. The better IO patterns introduced by the change just
happened to expose an old bug, and people had grown to depend on the
previously benign behavior of that old issue.

And never fear, we'll re-introduce the fix that improved on the IO
patterns once we've decided just how to handle the fact that we had a
bad interaction with an interface that people had then just happened
to rely on incidental behavior for before. It's just that we'll have
to hash through how to do that (there are no less than three different
patches by three different developers being discussed, and there might
be more coming...). In the meantime, I reverted the thing that exposed
the problem to users for this release, even if I hope it will be
re-introduced (perhaps even backported as a stable patch) once we have
consensus about the issue it exposed.

Take-away from the whole thing: it's not about whether you change the
kernel-userspace ABI, or fix a bug, or about whether the old code
"should never have worked in the first place". It's about whether
something breaks existing users' workflow.

Anyway, that was my little aside on the whole regression thing.  Since
it's that "first rule of kernel programming", I felt it is perhaps
worth just bringing it up every once in a while

Linux 核心

目錄

本頁

處理迴歸¶

要點（即“TL;DR”）¶

與開發者相關的 Linux 核心迴歸問題的所有詳情¶

更詳細的要點¶

收到迴歸報告時該怎麼做¶

修復迴歸問題時的要點¶

修復迴歸的期望和最佳實踐¶

開發者應瞭解的更多關於迴歸的方面¶

如何處理已知存在迴歸風險的變更¶

關於迴歸還有哪些需要了解？¶

遇到迴歸問題時應向誰尋求建議¶

更多關於迴歸跟蹤和 regzbot 的資訊¶

為什麼 Linux 核心有迴歸跟蹤者，以及為什麼使用 regzbot？¶

regzbot 如何進行迴歸跟蹤？¶

我必須使用 regzbot 嗎？¶

我必須向 regzbot 報告我遇到的每一個迴歸問題嗎？¶

如何檢視 regzbot 目前正在跟蹤哪些迴歸？¶

regzbot 監控哪些地方？¶

regzbot 應該跟蹤哪類問題？¶

我可以將 CI 系統發現的迴歸新增到 regzbot 的跟蹤中嗎？¶

如何與 regzbot 互動？¶

關於 regzbot 及其命令，還有更多要說的嗎？¶

Linus 關於迴歸問題的引言¶