Dealing with a problem in the Linux kernel

When we got reports of some users having unresponsive systems, diving into this issue had us face a few challenges. We didn't have the issue initially reproducible, though the users that encountered it had the problem even after a full reflash of their system.

This meant that we depended on users providing us information from their system to perform a meaningful diagnosis. At first glance, it wasn't at all clear what introduced the problem, or what part was breaking thing first. Fortunately, the boot logs are thorough with reporting on the initialization procedure.

Diagnosing the issue

After digging through some of their logs we found a common denominator:

phy phy-ff800000.phy.6: phy poweron failed --> -110
dwc3 fe900000.usb: error -ETIMEDOUT: failed to initialize core
dwc3: probe of fe900000.usb failed with error -110

We then found a public mail interaction where they mention this exact error, on the Linux kernel development mailing list:

This explains with some esoterically technical verbiage that normally the USB as a network interface is to be turned off if the USB system is suspended and set to host mode (think of it being like the server in a client-server connection, but suspended because there's no clients using it). They go into how a previous change to handling this scenario may cause issues because of how components are reset. They thus introduced changes to fix how the system executes these resets, but with that unknowingly introduced problems for how the rockchip64 boards handle their components.

In short: with how they implemented a fix for a problem affecting other systems, some rockchip64 based boards may now no longer intialize their drivers properly. This explains why affected users their systems have become unresponsive over the network. It also explains why a USB keyboard directly attached to the board may sometimes not respond after booting. This scenario leaves the user entirely unable to interact with the system.

Digging into git commits

After tracking the offending commit, we figured out it came from a branch off of linux kernel v6.5-rc3 that then got merged into v6.5-rc4: usb: dwc3: don't reset device side if dwc3 was configured as host-only.

Our RoninOS builds has Linux kernel 6.1.50 or 6.1.55 depending on which one of our images was flashed, though users might have had package updates that puts them on an intermediate version.

This is obviously a different major version range than the aforementioned 6.5 that introduced the bug. After some more digging (we've dug a whole trench at this point) we found the commit was also introduced into their longterm support version 6.1, specifically 6.1.43:

The offending commit in question is thus this one:

With that, we finally have a full picture of why our users are affected.

Finding a fix commit

Following Lu jicong's comment on the mail exchange, we saw 3 changes to that file after the date of said comment.

That last one looked most interesting with the commit's specific comment:

On Rockchip RK3588 one of the DWC3 cores is integrated weirdly and
requires two extra clocks to be enabled. Without these extra clocks
hot-plugging USB devices is broken.

While the RockPro64 has the cpu RK3399, it isn't too far fetched to think the architecture between these two chips aren't too different. Lo and behold, similar issues had been encountered before on the RK3399:

We saw the commit with this fix on a branch off of v6.6-rc4 (and a merge of v6.6-rc6 into this branch) which merged into v6.7-rc1 and that's where we are right now at the time of writing this article.

We've yet to establish their intentions as to whether the fix commits will be merged into Armbian's longterm stable 6.1 based sourcecode. Though the first of the 3 is already present in 6.1.59 (commit), this is of no indication for when we can expect the rest of the fixes.

We have to accomodate our users now, so we're forced come up with alternative solutions until then.

Finding solutions

Armbian has its own versioning scheme and the offending commit is present since Armbian 23.08 (trunk being currently a 23.11). Since there is no Armbian upgrade possible with a fix, we're forced to perform a downgrade on affected systems. We've found that when affected systems are downgraded to the previous public build (Armbian 23.02) using a simple CLI command calling apt, effectively putting users back on Linux 5.15.93, the system is able to operate in full function again. This gives us a good start on helping affected users with a workaround so they can transact again using their RoninDojo.

In stark contrast to this, we've run into difficulties for further facilitating these users with the option to perform a simple reflash, as we were hoping to build a 23.02 based image that has the aforementioned kernel downgrade in it. The Armbian build procedure is a beast of its own, with its own dedicated repository separate from the Armbian sourcecode. We've figured out the method for making builds with instructions to use a specific kernel version. However, it does not seem to produce a stable OS as the build procedure has been updated since and does not accomodate in all the necessities for making images using an older kernel. On top of that, in an attempt at reverting to an older codebase for the build procedure, to match how the older image version was built at the time by the Armbian community, we found that this itself was also problematic in dealing with the newer packages it automatically pulls in, not even allowing the build procedure to complete.

In short: it is nigh on impossible with Armbian's tools as they currently are, to build an Armbian 23.02 based RoninOS image, since we cannot reproduce the externalities as they were back when the base image of Armbian 23.02 was built and released.

Helping users

Up until now, we've had the system upgrade all packages in full every time you ssh into the system and perform a RoninDojo upgrade from the RoninCLI terminal. Though risky as that turned out to be in this specific case, it did give our users a simple option to update the system in case there were critical security updates for packages present on the system.

Now that we've arrived at this point in time, for any users who have downgraded manually using our workaround, we're forced to lock the version of the Linux kernel down until we can resume receiving updates again that won't invariably break their system with this bug. We're already researching how best to keep these users up to date on the 5.15 longterm stable releases so that they can receive security updates still, as 5.15.138 is the latest patch at the time of writing this article.

It is likely that linux kernel packages for a rockchip64 based board are not built by the Armbian community, aside from the exact versions that were released along with the public Armbian releases. Building the kernel ourselves in turn will put pressure on us to perform the necessary testing on these systems and guarantee a stable release for the affected users with these self-built kernel packages. We will post an update on the situation when we can give conslusive results on this portion of the matter.

To conclude

We're actively keeping track of the problem and the Linux community's progress on it, to ensure these fixes will be present in the Linux 6.1 longterm stable release. We're looking to communicate this with its maintainers, already having reached out in the Armbian community.

Granted, we are dealing with having a limited set of hands, and with complex low-level issues like these throwing curveballs at us, we regrettably lose a lot of development time on this. But our main priority is that users can transact freely, both in the face of those who wish we do not transact freely, as well as the elements of the technical environment we have to work in. And that includes making sure users have a secure machine with maximum up-time.

If you run into connectivity issues with your RoninDojo node, please contact support and they will help you find out if it's the kernel bug affecting you and if so, help you with applying the workaround so that your node is up and running again. Contact support through one of the following communication channels:

Share this post