處理迴歸¶
我們不引入迴歸 —— 本文件闡述了這條“Linux 核心開發首要規則”對開發者而言在實踐中意味著什麼。它是《報告迴歸》的補充,後者從使用者的角度涵蓋了該主題;如果您從未閱讀過那篇文章,請在繼續閱讀本文之前至少快速瀏覽一遍。
要點(即“TL;DR”)¶
確保迴歸郵件列表 (regressions mailing list) 的訂閱者 (regressions@lists.linux.dev) 能迅速獲知任何新的迴歸報告
當收到一份未抄送給列表的郵件報告時,立即傳送至少一份簡短的“回覆全部”郵件,並抄送給列表,使其進入處理流程。
將透過 Bug 跟蹤器提交的任何報告轉發或彈回(bounce)到列表。
讓 Linux 核心迴歸跟蹤機器人“regzbot”跟蹤該問題(這是可選的,但建議這樣做)
對於郵件報告,檢查報告者是否包含類似
#regzbot introduced: v5.13..v5.14-rc1的行。如果沒有,傳送一封回覆(抄送給迴歸列表),其中包含如下段落,告訴 regzbot 問題何時開始出現#regzbot ^introduced: 1f2e3d4c5b6a
當將 Bug 跟蹤器中的報告轉發到迴歸列表時(見上文),包含如下段落
#regzbot introduced: v5.13..v5.14-rc1 #regzbot from: Some N. Ice Human <some.human@example.com> #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
提交回歸修復時,請在補丁描述中新增“Closes:”標籤,指向所有報告該問題的地方,如《提交補丁:將程式碼引入核心的必備指南》和《Documentation/process/5.Posting.rst》所規定。如果您只修復導致迴歸問題的一部分,則可以使用“Link:”標籤代替。regzbot 目前不對兩者進行區分。
一旦確定了罪魁禍首,應儘快修復迴歸;大多數迴歸的修復應在兩週內合併,但有些需要在兩到三天內解決。
與開發者相關的 Linux 核心迴歸問題的所有詳情¶
更詳細的要點¶
收到迴歸報告時該怎麼做¶
確保 Linux 核心的迴歸跟蹤者和迴歸郵件列表 (regressions mailing list) 的其他訂閱者 (regressions@lists.linux.dev) 能獲知任何新報告的迴歸問題
當您收到一份未抄送給列表的郵件報告時,立即傳送至少一份簡短的“回覆全部”郵件,並抄送給列表,使其進入處理流程;如果回覆的回覆中又遺漏了列表,請嘗試確保再次抄送。
如果 Bug 跟蹤器中提交的報告到達您的收件箱,請將其轉發或彈回(bounce)到列表。如果報告者已按照《報告問題》中的指示轉發了報告,請考慮事先檢查列表存檔。
在執行上述任一操作時,請考慮讓 Linux 核心迴歸跟蹤機器人“regzbot”立即開始跟蹤該問題
對於郵件報告,檢查報告者是否包含類似
#regzbot introduced: 1f2e3d4c5b6a的“regzbot 命令”。如果沒有,傳送一封回覆(抄送給迴歸列表),其中包含如下段落:#regzbot ^introduced: v5.13..v5.14-rc1這會告訴 regzbot 問題開始出現的版本範圍;您也可以使用 commit-id 來指定範圍,或者在報告者已二分法定位到問題提交時,直接指定單個 commit-id。
請注意“introduced”前的插入符號 (^):它告訴 regzbot 將父郵件(您回覆的郵件)視為您希望跟蹤的迴歸問題的初始報告;這很重要,因為 regzbot 稍後會查詢帶有“Closes:”標籤的補丁,這些標籤指向 lore.kernel.org 存檔中的報告。
當轉發一個報告到 Bug 跟蹤器的迴歸問題時,包含一個帶有這些 regzbot 命令的段落
#regzbot introduced: 1f2e3d4c5b6a #regzbot from: Some N. Ice Human <some.human@example.com> #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789Regzbot 將自動把包含指向您的郵件或提到的工單的“Closes:”標籤的補丁與報告關聯起來。
修復迴歸問題時的要點¶
提交回歸修復時無需做任何特殊操作,只需記住按照《提交補丁:將程式碼引入核心的必備指南》、《Documentation/process/5.Posting.rst》和《關於 Linux -stable 版本的你需要了解的一切》中已詳細解釋的內容進行即可。
使用“Closes:”標籤指向所有報告該問題的地方
Closes: https://lore.kernel.org/r/30th.anniversary.repost@klaava.Helsinki.FI/ Closes: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890如果您只修復問題的一部分,可以如上述第一份文件中所述,使用“Link:”代替。regzbot 目前將兩者視為等同,並認為連結的報告已解決。
新增“Fixes:”標籤以指定導致迴歸的提交。
如果罪魁禍首是在較早的開發週期中合併的,請使用
Cc: stable@vger.kernel.org標籤明確標記該修復以進行反向移植(backporting)。
所有這些都是您應做的,並且在處理迴歸問題時非常重要,因為這些標籤對於(包括您在內的)將來可能在數週、數月甚至數年後調查該問題的每個人都非常有價值。這些標籤對於其他核心開發者或 Linux 發行版使用的工具和指令碼也至關重要;其中一個工具就是 regzbot,它嚴重依賴“Closes:”標籤來將回歸報告與解決它們的更改關聯起來。
修復迴歸的期望和最佳實踐¶
作為一名 Linux 核心開發者,您應盡最大努力避免出現因您最近的更改導致迴歸,從而只留給使用者以下選擇的情況
執行一個存在影響使用的迴歸問題的核心。
切換到更舊或更新的核心系列。
在迴歸問題的罪魁禍首被識別後,繼續執行一個過時且可能不安全的核心超過三週。理想情況下應少於兩週。如果問題嚴重或影響許多使用者——無論是普遍情況還是在常見環境中——則應在幾天內解決。
如何在實踐中實現這一點取決於多種因素。以下經驗法則可作為指導。
總的來說
優先處理迴歸問題,高於所有其他 Linux 核心工作,除非後者涉及嚴重問題(例如:嚴重安全漏洞、資料丟失、硬體損壞等)。
加速修復最近已進入正式 mainline、stable 或 longterm 版本的迴歸問題(無論是直接合並還是透過反向移植)。
不要將當前週期的迴歸視為可以等到週期結束再處理的問題,因為該問題可能會阻礙或阻止使用者和 CI 系統現在或普遍地測試 mainline。
在解決問題時需謹慎,以避免造成額外或更大的損害,即使這樣解決問題可能比下面所述的時間更長。
一旦確定迴歸問題的罪魁禍首,關於時間安排
如果問題嚴重或困擾許多使用者——無論是普遍情況還是在特定硬體環境、發行版或 stable/longterm 系列等常見條件下——目標是在兩到三天內將修復合併到 mainline。
如果罪魁禍首已進入最近的 mainline、stable 或 longterm 版本(無論是直接合並還是透過反向移植),目標是在下下個週日之前將修復合併到 mainline;如果罪魁禍首在一週初被發現且易於解決,請嘗試在同一周內將修復合併到 mainline。
對於其他迴歸問題,目標是在未來三週內的最後一個週日之前將修復合併到 mainline。如果迴歸是人們可以輕鬆忍受一段時間的,例如輕微的效能迴歸,則推遲一兩個週日是可以接受的。
強烈不建議將回歸修復的合入 mainline 延遲到下一個合併視窗,除非修復的風險極高或罪魁禍首是在一年多前合入 mainline 的。
關於流程
始終考慮回滾(reverting)罪魁禍首,因為它通常是修復迴歸問題最快、最不危險的方法。不必擔心之後再將修復後的版本合併到 mainline:這應該很簡單,因為大部分程式碼已經審查過一次了。
嘗試在當前開發週期結束前解決過去十二個月內引入 mainline 的所有迴歸問題:Linus 希望這類迴歸能像當前週期的迴歸一樣處理,除非修復帶來異常風險。
如果迴歸問題看起來很棘手,請考慮在討論或補丁審查時抄送 Linus。在緊急或危急情況下也這樣做——特別是當子系統維護者可能無法聯絡時。當您知道此類迴歸已進入 mainline、stable 或 longterm 版本時,也請抄送 stable 團隊。
對於緊急迴歸問題,考慮請求 Linus 直接從郵件列表中接收修復:對於沒有爭議的修復,他完全可以接受。但理想情況下,此類請求應與子系統維護者協商一致或直接由他們提出。
如果您不確定某個修復在新的 mainline 版本釋出前幾天應用是否值得冒險,請給 Linus 傳送一封郵件,抄送給常規列表和相關人員;在郵件中,總結情況並請求他考慮直接從列表中接收修復。他可以自行決定,必要時甚至可以推遲釋出。此類請求也應理想地與子系統維護者協商一致或直接由他們提出。
關於 stable 和 longterm 核心
如果迴歸問題從未在 mainline 中出現,或者已經在 mainline 中修復,您可以將其留給 stable 團隊處理。
如果在過去十二個月內,某個迴歸問題進入了正式的 mainline 版本,請確保為修復標記“Cc: stable@vger.kernel.org”,因為單獨的“Fixes:”標籤並不能保證進行反向移植。如果您知道罪魁禍首已被反向移植到 stable 或 longterm 核心,請新增相同的標籤。
當收到有關近期 stable 或 longterm 核心系列中迴歸問題的報告時,請至少簡要評估該問題是否也可能發生在當前 mainline 中——如果可能性較大,請接手該報告。如有疑問,請要求報告者檢查 mainline。
每當您想迅速解決一個最近也進入了正式 mainline、stable 或 longterm 版本的迴歸問題時,請在 mainline 中快速修復它;適當時,請 Linus 介入以加速修復(見上文)。這是因為 stable 團隊通常既不會回滾也不會修復在 mainline 中造成相同問題的任何更改。
對於緊急的迴歸修復,一旦修復被合併到 mainline,您可能希望透過給 stable 團隊發一個通知來確保及時反向移植;這在合併視窗期間和之後不久尤其值得推薦,因為否則修復可能會落在大量補丁佇列的末尾。
關於補丁流程
開發者們,在嘗試達到上述時間段時,請記住要考慮修復經過測試、審查並由 Linus 合併所需的時間,理想情況下它們至少會在 linux-next 中短暫存在。因此,如果修復是緊急的,請使其顯而易見,以確保其他人能適當處理。
評審者們,請您及時審查迴歸修復,以幫助開發者達到上述時間段。
子系統維護者們,同樣鼓勵您加速處理迴歸修復。因此,評估對於特定修復跳過 linux-next 是否可行。必要時,也請考慮比平時更頻繁地傳送 git pull 請求。並儘量避免在週末拖延迴歸修復——特別是當該修復被標記為需要反向移植時。
開發者應瞭解的更多關於迴歸的方面¶
如何處理已知存在迴歸風險的變更¶
評估迴歸風險有多大,例如透過在 Linux 發行版和 Git 倉庫中執行程式碼搜尋。同時,考慮要求可能受影響的其他開發者或專案評估甚至測試擬議的更改;如果出現問題,或許可以找到一個所有人都接受的解決方案。
如果最終迴歸風險看起來相對較小,請繼續進行更改,但要讓所有相關方瞭解風險。因此,請確保您的補丁描述清晰地說明了這一點。一旦更改合併,請告知 Linux 核心的迴歸跟蹤器和迴歸郵件列表有關風險,以便在報告陸續出現時,每個人都能關注到該更改。根據風險情況,您可能還希望要求子系統維護者在他的 mainline pull request 中提及該問題。
關於迴歸還有哪些需要了解?¶
查閱《報告迴歸》,它涵蓋了您可能想了解的許多其他方面
“無迴歸”規則的目的
哪些問題實際屬於迴歸
誰負責尋找回歸的根本原因
如何處理棘手情況,例如迴歸是由安全修復引起時,或修復迴歸可能導致另一個迴歸時
遇到迴歸問題時應向誰尋求建議¶
向迴歸郵件列表 (regressions@lists.linux.dev) 傳送郵件,同時抄送 Linux 核心的迴歸跟蹤者 (regressions@leemhuis.info);如果問題最好私下處理,可以省略列表。
更多關於迴歸跟蹤和 regzbot 的資訊¶
為什麼 Linux 核心有迴歸跟蹤者,以及為什麼使用 regzbot?¶
像“無迴歸”這樣的規則需要有人來確保其得到遵守,否則它們可能會意外或有意地被打破。歷史表明,Linux 核心也是如此。這就是為什麼 Thorsten Leemhuis 自願擔任 Linux 核心的迴歸跟蹤者來關注這些事情,他偶爾也會得到其他人的幫助。他們都沒有因此獲得報酬,這就是為什麼迴歸跟蹤是在盡最大努力的基礎上進行的。
早期手動跟蹤迴歸的嘗試表明這是一項耗時且令人沮喪的工作,這就是它們在一段時間後被放棄的原因。為了防止這種情況再次發生,Thorsten 開發了 regzbot 來協助這項工作,其長期目標是儘可能為所有相關人員自動化迴歸跟蹤。
regzbot 如何進行迴歸跟蹤?¶
該機器人會監控對已跟蹤迴歸報告的回覆。此外,它還會查詢引用此類報告並帶有“Closes:”標籤的已釋出或已提交的補丁;對這些補丁釋出的回覆也會被跟蹤。綜合這些資料,可以很好地瞭解修復過程的當前狀態。
Regzbot 嘗試以儘可能少的開銷完成其工作,無論對報告者還是開發者。實際上,只有報告者會承擔一項額外職責:他們需要使用上面概述的 #regzbot introduced 命令告知 regzbot 迴歸報告;如果他們不這樣做,其他人可以使用 #regzbot ^introduced 來處理。
對於開發者來說,通常沒有額外的工作,他們只需確保做一件在 regzbot 出現很久以前就已期望的事情:在補丁描述中新增指向所有已修復問題報告的連結。
我必須使用 regzbot 嗎?¶
如果您使用 regzbot,這符合所有人的利益,因為像 Linus Torvalds 這樣的核心維護者在他們的工作中部分依賴 regzbot 的跟蹤——例如在決定釋出新版本或延長開發階段時。為此,他們需要了解所有未修復的迴歸問題;為了做到這一點,Linus 會檢視 regzbot 每週傳送的報告。
我必須向 regzbot 報告我遇到的每一個迴歸問題嗎?¶
理想情況下是的:我們都是人類,當意外出現更重要的事情時,很容易忘記問題——例如 Linux 核心中一個更大的問題,或者現實生活中讓我們暫時遠離鍵盤的事情。因此,最好向 regzbot 報告每一個迴歸問題,除非您立即編寫了修復程式並將其提交到定期合併到受影響核心系列的樹中。
如何檢視 regzbot 目前正在跟蹤哪些迴歸?¶
檢視 regzbot 的網頁介面以獲取最新資訊;或者,搜尋最新的迴歸報告,regzbot 通常在每週日晚上(UTC 時間)傳送一次,這通常在 Linus 釋出新的(預)版本前幾個小時。
regzbot 監控哪些地方?¶
Regzbot 正在監控最重要的 Linux 郵件列表以及 linux-next、mainline 和 stable/longterm 的 git 倉庫。
regzbot 應該跟蹤哪類問題?¶
該機器人旨在跟蹤迴歸問題,因此請不要讓 regzbot 參與常規問題。但如果您使用 regzbot 跟蹤嚴重問題,例如關於宕機、資料損壞或內部錯誤(Panic、Oops、BUG()、warning 等)的報告,Linux 核心的迴歸跟蹤者是沒意見的。
我可以將 CI 系統發現的迴歸新增到 regzbot 的跟蹤中嗎?¶
如果特定的迴歸可能對實際用例產生影響,並因此可能被使用者注意到,請隨意新增;因此,請不要讓 regzbot 參與不太可能在實際使用中出現的理論性迴歸。
如何與 regzbot 互動?¶
透過在直接或間接回復回歸報告郵件時使用“regzbot 命令”來實現。這些命令需要獨立成段(即:它們需要使用空行與郵件的其他部分隔開)。
其中一個命令是 #regzbot introduced: <version or commit>,它使 regzbot 將您的郵件視為已新增到跟蹤的迴歸報告,如上文所述;#regzbot ^introduced: <version or commit> 是另一個類似的命令,它使 regzbot 將父郵件視為它開始跟蹤的迴歸報告。
一旦使用了上述兩個命令中的一個,其他 regzbot 命令就可以在對報告的直接或間接回復中使用。您可以將它們寫在 introduced 命令之一的下方,或者在使用了其中一個命令的郵件的回覆中,或者本身就是對該郵件的回覆的郵件中
設定或更新標題
#regzbot title: foo監控討論或 bugzilla.kernel.org 工單,其中討論了問題的附加方面或修復——例如釋出修復迴歸的補丁
#regzbot monitor: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/監控功能僅適用於 lore.kernel.org 和 bugzilla.kernel.org;regzbot 將把該執行緒或工單中的所有訊息視為與修復過程相關。
指向包含更多相關詳細資訊的地方,例如郵件列表帖子或 Bug 跟蹤器中的工單,這些資訊略有相關,但屬於不同主題
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789將回歸標記為已由即將上游或已合併的提交修復
#regzbot fix: 1f2e3d4c5d將回歸標記為 regzbot 已跟蹤的另一個迴歸的副本
#regzbot dup-of: https://lore.kernel.org/all/30th.anniversary.repost@klaava.Helsinki.FI/將回歸標記為無效
#regzbot invalid: wasn't a regression, problem has always existed
關於 regzbot 及其命令,還有更多要說的嗎?¶
有關 Linux 核心迴歸跟蹤機器人的更詳細和最新資訊,可以在其專案頁面上找到,其中包括入門指南和參考文件,這兩者都比上面一節涵蓋了更多細節。
Linus 關於迴歸問題的引言¶
以下是 Linus Torvalds 期望如何處理迴歸問題的幾個實際例子
If you break existing user space setups THAT IS A REGRESSION. It's not ok to say "but we'll fix the user space setup". Really. NOT OK. [...] The first rule is: - we don't cause regressions and the corollary is that when regressions *do* occur, we admit to them and fix them, instead of blaming user space. The fact that you have apparently been denying the regression now for three weeks means that I will revert, and I will stop pulling apparmor requests until the people involved understand how kernel development is done. People should basically always feel like they can update their kernel and simply not have to worry about it. I refuse to introduce "you can only update the kernel if you also update that other program" kind of limitations. If the kernel used to work for you, the rule is that it continues to work for you. There have been exceptions, but they are few and far between, and they generally have some major and fundamental reasons for having happened, that were basically entirely unavoidable, and people _tried_hard_ to avoid them. Maybe we can't practically support the hardware any more after it is decades old and nobody uses it with modern kernels any more. Maybe there's a serious security issue with how we did things, and people actually depended on that fundamentally broken model. Maybe there was some fundamental other breakage that just _had_ to have a flag day for very core and fundamental reasons. And notice that this is very much about *breaking* peoples environments. Behavioral changes happen, and maybe we don't even support some feature any more. There's a number of fields in /proc/<pid>/stat that are printed out as zeroes, simply because they don't even *exist* in the kernel any more, or because showing them was a mistake (typically an information leak). But the numbers got replaced by zeroes, so that the code that used to parse the fields still works. The user might not see everything they used to see, and so behavior is clearly different, but things still _work_, even if they might no longer show sensitive (or no longer relevant) information. But if something actually breaks, then the change must get fixed or reverted. And it gets fixed in the *kernel*. Not by saying "well, fix your user space then". It was a kernel change that exposed the problem, it needs to be the kernel that corrects for it, because we have a "upgrade in place" model. We don't have a "upgrade with new user space". And I seriously will refuse to take code from people who do not understand and honor this very simple rule. This rule is also not going to change. And yes, I realize that the kernel is "special" in this respect. I'm proud of it. I have seen, and can point to, lots of projects that go "We need to break that use case in order to make progress" or "you relied on undocumented behavior, it sucks to be you" or "there's a better way to do what you want to do, and you have to change to that new better way", and I simply don't think that's acceptable outside of very early alpha releases that have experimental users that know what they signed up for. The kernel hasn't been in that situation for the last two decades. We do API breakage _inside_ the kernel all the time. We will fix internal problems by saying "you now need to do XYZ", but then it's about internal kernel API's, and the people who do that then also obviously have to fix up all the in-kernel users of that API. Nobody can say "I now broke the API you used, and now _you_ need to fix it up". Whoever broke something gets to fix it too. And we simply do not break user space.摘自 2020-05-21
The rules about regressions have never been about any kind of documented behavior, or where the code lives. The rules about regressions are always about "breaks user workflow". Users are literally the _only_ thing that matters. No amount of "you shouldn't have used this" or "that behavior was undefined, it's your own fault your app broke" or "that used to work simply because of a kernel bug" is at all relevant. Now, reality is never entirely black-and-white. So we've had things like "serious security issue" etc that just forces us to make changes that may break user space. But even then the rule is that we don't really have other options that would allow things to continue. And obviously, if users take years to even notice that something broke, or if we have sane ways to work around the breakage that doesn't make for too much trouble for users (ie "ok, there are a handful of users, and they can use a kernel command line to work around it" kind of things) we've also been a bit less strict. But no, "that was documented to be broken" (whether it's because the code was in staging or because the man-page said something else) is irrelevant. If staging code is so useful that people end up using it, that means that it's basically regular kernel code with a flag saying "please clean this up". The other side of the coin is that people who talk about "API stability" are entirely wrong. API's don't matter either. You can make any changes to an API you like - as long as nobody notices. Again, the regression rule is not about documentation, not about API's, and not about the phase of the moon. It's entirely about "we caused problems for user space that used to work".摘自 2017-11-05
And our regression rule has never been "behavior doesn't change". That would mean that we could never make any changes at all. For example, we do things like add new error handling etc all the time, which we then sometimes even add tests for in our kselftest directory. So clearly behavior changes all the time and we don't consider that a regression per se. The rule for a regression for the kernel is that some real user workflow breaks. Not some test. Not a "look, I used to be able to do X, now I can't".摘自 2018-08-03
YOU ARE MISSING THE #1 KERNEL RULE. We do not regress, and we do not regress exactly because your are 100% wrong. And the reason you state for your opinion is in fact exactly *WHY* you are wrong. Your "good reasons" are pure and utter garbage. The whole point of "we do not regress" is so that people can upgrade the kernel and never have to worry about it. > Kernel had a bug which has been fixed That is *ENTIRELY* immaterial. Guys, whether something was buggy or not DOES NOT MATTER. Why? Bugs happen. That's a fact of life. Arguing that "we had to break something because we were fixing a bug" is completely insane. We fix tens of bugs every single day, thinking that "fixing a bug" means that we can break something is simply NOT TRUE. So bugs simply aren't even relevant to the discussion. They happen, they get found, they get fixed, and it has nothing to do with "we break users". Because the only thing that matters IS THE USER. How hard is that to understand? Anybody who uses "but it was buggy" as an argument is entirely missing the point. As far as the USER was concerned, it wasn't buggy - it worked for him/her. Maybe it worked *because* the user had taken the bug into account, maybe it worked because the user didn't notice - again, it doesn't matter. It worked for the user. Breaking a user workflow for a "bug" is absolutely the WORST reason for breakage you can imagine. It's basically saying "I took something that worked, and I broke it, but now it's better". Do you not see how f*cking insane that statement is? And without users, your program is not a program, it's a pointless piece of code that you might as well throw away. Seriously. This is *why* the #1 rule for kernel development is "we don't break users". Because "I fixed a bug" is absolutely NOT AN ARGUMENT if that bug fix broke a user setup. You actually introduced a MUCH BIGGER bug by "fixing" something that the user clearly didn't even care about. And dammit, we upgrade the kernel ALL THE TIME without upgrading any other programs at all. It is absolutely required, because flag-days and dependencies are horribly bad. And it is also required simply because I as a kernel developer do not upgrade random other tools that I don't even care about as I develop the kernel, and I want any of my users to feel safe doing the same time. So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel without upgrading some other random binary, then we have a problem.摘自 2021-06-05
THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS. Honestly, security people need to understand that "not working" is not a success case of security. It's a failure case. Yes, "not working" may be secure. But security in that case is *pointless*. Binary compatibility is more important. And if binaries don't use the interface to parse the format (or just parse it wrongly - see the fairly recent example of adding uuid's to /proc/self/mountinfo), then it's a regression. And regressions get reverted, unless there are security issues or similar that makes us go "Oh Gods, we really have to break things". I don't understand why this simple logic is so hard for some kernel developers to understand. Reality matters. Your personal wishes matter NOT AT ALL. If you made an interface that can be used without parsing the interface description, then we're stuck with the interface. Theory simply doesn't matter. You could help fix the tools, and try to avoid the compatibility issues that way. There aren't that many of them.it's clearly NOT an internal tracepoint. By definition. It's being used by powertop.We have programs that use that ABI and thus it's a regression if they break.摘自 2012-07-06
> Now this got me wondering if Debian _unstable_ actually qualifies as a > standard distro userspace. Oh, if the kernel breaks some standard user space, that counts. Tons of people run Debian unstable摘自 2019-09-15
One _particularly_ last-minute revert is the top-most commit (ignoring the version change itself) done just before the release, and while it's very annoying, it's perhaps also instructive. What's instructive about it is that I reverted a commit that wasn't actually buggy. In fact, it was doing exactly what it set out to do, and did it very well. In fact it did it _so_ well that the much improved IO patterns it caused then ended up revealing a user-visible regression due to a real bug in a completely unrelated area. The actual details of that regression are not the reason I point that revert out as instructive, though. It's more that it's an instructive example of what counts as a regression, and what the whole "no regressions" kernel rule means. The reverted commit didn't change any API's, and it didn't introduce any new bugs. But it ended up exposing another problem, and as such caused a kernel upgrade to fail for a user. So it got reverted. The point here being that we revert based on user-reported _behavior_, not based on some "it changes the ABI" or "it caused a bug" concept. The problem was really pre-existing, and it just didn't happen to trigger before. The better IO patterns introduced by the change just happened to expose an old bug, and people had grown to depend on the previously benign behavior of that old issue. And never fear, we'll re-introduce the fix that improved on the IO patterns once we've decided just how to handle the fact that we had a bad interaction with an interface that people had then just happened to rely on incidental behavior for before. It's just that we'll have to hash through how to do that (there are no less than three different patches by three different developers being discussed, and there might be more coming...). In the meantime, I reverted the thing that exposed the problem to users for this release, even if I hope it will be re-introduced (perhaps even backported as a stable patch) once we have consensus about the issue it exposed. Take-away from the whole thing: it's not about whether you change the kernel-userspace ABI, or fix a bug, or about whether the old code "should never have worked in the first place". It's about whether something breaks existing users' workflow. Anyway, that was my little aside on the whole regression thing. Since it's that "first rule of kernel programming", I felt it is perhaps worth just bringing it up every once in a while