Jump to content

Desktop freeze?!


Hedon James

Recommended Posts

Finally getting around to installing my customized Debian-Spiral remix on the small fleet of machines in my home network.  Things were going wonderfully until the current machine, which has given me FITS.

 

This particular machine is a refurbed enterprise grade Lenovo ThinkCentre m715q with Ryzen5 2400 (Radeon RavenRidge integrated graphics).  The DE is LXQT and the window manager is Fluxbox. 

Inxi output as follows:

Quote

$ inxi -F
System:
  Host: LenovoTC-m715q Kernel: 6.1.0-13-amd64 arch: x86_64 bits: 64
    Desktop: LXQt v: 1.2.1 Distro: Debian GNU/Linux 12 (bookworm)
Machine:
  Type: Mini-pc System: LENOVO product: 10VHS17600 v: ThinkCentre M715q
    serial: <superuser required>
  Mobo: LENOVO model: 3130 v: SDK0J40697 WIN 3305183174933
    serial: <superuser required> UEFI: LENOVO v: M1XKT39A date: 04/08/2019
CPU:
  Info: quad core model: AMD Ryzen 5 PRO 2400GE w/ Radeon Vega Graphics
    bits: 64 type: MT MCP cache: L2: 2 MiB
  Speed (MHz): avg: 1600 min/max: 1600/3200 cores: 1: 1600 2: 1600 3: 1600
    4: 1600 5: 1600 6: 1600 7: 1600 8: 1600
Graphics:
  Device-1: AMD Raven Ridge [Radeon Vega Series / Radeon Mobile Series]
    driver: amdgpu v: kernel
  Display: x11 server: X.Org v: 1.21.1.7 driver: X: loaded: amdgpu
    unloaded: fbdev,modesetting,vesa dri: radeonsi gpu: amdgpu
    resolution: 1920x1080~60Hz
  API: OpenGL v: 4.6 Mesa 22.3.6 renderer: AMD Radeon Vega 11 Graphics
    (raven LLVM 15.0.6 DRM 3.49 6.1.0-13-amd64)
Audio:
  Device-1: AMD Raven/Raven2/Fenghuang HDMI/DP Audio driver: snd_hda_intel
  Device-2: AMD ACP/ACP3X/ACP6x Audio Coprocessor driver: snd_pci_acp3x
  Device-3: AMD Family 17h/19h HD Audio driver: snd_hda_intel
  API: ALSA v: k6.1.0-13-amd64 status: kernel-api
  Server-1: PulseAudio v: 16.1 status: active
Network:
  Device-1: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet
    driver: r8169
  IF: eth0 state: up speed: 1000 Mbps duplex: full mac: 6c:4b:90:a6:f7:68
Drives:
  Local Storage: total: 238.47 GiB used: 16.18 GiB (6.8%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: MZVLB256HAHQ-000L7
    size: 238.47 GiB
Partition:
  ID-1: / size: 233.38 GiB used: 16.18 GiB (6.9%) fs: ext4 dev: /dev/nvme0n1p2
  ID-2: /boot/efi size: 299.4 MiB used: 4 MiB (1.3%) fs: vfat
    dev: /dev/nvme0n1p1
Swap:
  ID-1: swap-1 type: file size: 512 MiB used: 0 KiB (0.0%) file: /swapfile
  ID-2: swap-2 type: zram size: 13.84 GiB used: 0 KiB (0.0%) dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 47.9 C mobo: N/A gpu: amdgpu temp: 47.0 C
  Fan Speeds (RPM): N/A
Info:
  Processes: 256 Uptime: 1h 2m Memory: 14.56 GiB used: 2.12 GiB (14.5%)
  Shell: Bash inxi: 3.3.26

 

I'm experiencing 2 issues.  The main problem is that when this machine is awoken from sleep, I can't open applications.  Not from the LXQT menu, not from the LXQT panel quicklaunchers, and not from the Fluxbox root menu.  I can mouse around and click, and keyboard function seems to work, but can't even restart or exit clean, because the exit/shutdown button doesn't register (from any of my 3 potential inputs....lxqt menu, lxqt launcher panel, fluxbox menu); and since I can't open a terminal, i can't shutdown from there either.  I'm stuck with a hard power-off and restart, which fixes everything....until the next sleep/awake sequence.  I was inclined to think this a fluxbox issue, except that I can, however, use fluxbox functions, such as change styles, change workspaces, reconfigure flux, restart flux, etc...

 

I thought this points to an LXQT issue (possibly), or a LightDM sleep/hibernate issue.  So I disabled sleep/hibernate (I think)...I can wake up the display with typical mouse or keyboard movement and i'm in my session without having to log back in (no screen lock....the way I want it).  But the issue persists...so either I haven't disabled like I think, or sleep/hibernate/wake is not the issue.

 

I'm STRONGLY inclined to consider a graphics driver issue, and I find all kinds of issues with RavenRidge graphics in the Linux ecosystem.  And this FEELS like a graphics issue, as I've experienced them before.  But the issues I locate seem to have all been resolved in the 5.xx kernels.  I'm running 6.1.0-13 on Debian.  So either that's not it, or I'm missing something in my graphics stack.  I've looked at the Xorg log files and I don't really understand everything going on there, but nothing jumps out as an error or warning either.  I'm seeing references to dmesg log files but can't find one ANYWHERE on my system.

 

My brain is jello at this point and I probably need a fresh set of eyes on this problem.  Anyone care to chime in and suggest next steps to troubleshoot this PITA?!

Link to comment
Share on other sites

using opensource amdgpu driver.

 

Also, I rebooted LXQt with my backup Openbox WM, rather than Fluxbox.  Issues persist with Openbox, so I'm confident in ruling out Fluxbox.  Looking squarely at LXQT now (with no issues I can find on the LXQt forum), or Ryzen5 RavenRidge GPU (with many reported issues, but supposedly fixed in early-mid 5.xx kernels).

 

I'll try your suggestion SB, but I'm in another rabbit hole right now.  trying to change a suggested setting in Lenovo BIOS, which asked for a password.  I've never set a password, so just tried to enter through.  Apparently, I'm the third failed attempt and now the hard drive is locked until either a supervisor password is entered (not gonna happen....it's a refurb, and there's no way to know), or I can reset CMOS (battery removal failed....trying to jump bridge connectors to reset to default), or I have a $200 paperweight.

 

I don't have issues very often, but when I do, it's a MOTHERLODE of issues.  Be back when I find out if CMOS/BIOS reset succeeds.  Here we go....

Link to comment
Share on other sites

5 hours ago, securitybreach said:

Also, Install amdgpu-dkms & amdgpu-core packages

 

https://wiki.debian.org/AMDGPUDriverOnStretchAndBuster2

okay....what a CLUSTER you-know-what THAT was!  got the BIOS reset, password eliminated, and back to default BIOS settings.  Wouldn't boot, due to secure boot, but disabled that and booted back into system.  switched back to fluxbox, cuz that's what I KNOW, and that's my "baseline" for behavior.

 

I've seen that link before, and looks like that was the proof of concept to get things streamlined into the kernel.  but I tried to install anyways, and neither package is in my Debian Bookworm repos.  FWIW....I also have non-free firmware repo enabled.  Here's everything in Debian repos for "amdgpu" packages:

https://packages.debian.org/search?keywords=amdgpu

 

Looking at my inxi output above, I note the AMD Ryzen 5 PRO 2400....that "PRO" suffix might be a clue.  Seems I've read about the PRO version of Ryzen 5, but can't remember exactly what I read.  Starting to wonder if I shouldn't be looking at AMDs proprietary driver for the PRO gpu?  I really don't want to use proprietary if I don't have to; and everyone seems to say that the opensource driver works better anyhow...but I've got to get to a stable system first.

 

First....I'd like to get it working in a stable manner.  Later, I'll worry about what is better...

 

What else ya got SB?

Edited by Hedon James
Link to comment
Share on other sites

securitybreach

Oh wait.. I know what might be going on. It sounds like you need to have a composite manager running like xcompmgr, picom, etc. This acts like a service that renders drop shadows and and primitive window transparency. Even if you do not need these things, a lot of gtk apps depend on a composite manager to function without glitches. Usually you just install one and add it to the startup config of your environment. These are not always required but with systems that have graphic artifacts, screen tearing, etc., these are required. Might be worth a try anyway.

Link to comment
Share on other sites

44 minutes ago, securitybreach said:

Oh wait.. I know what might be going on. It sounds like you need to have a composite manager running like xcompmgr, picom, etc. This acts like a service that renders drop shadows and and primitive window transparency. Even if you do not need these things, a lot of gtk apps depend on a composite manager to function without glitches. Usually you just install one and add it to the startup config of your environment. These are not always required but with systems that have graphic artifacts, screen tearing, etc., these are required. Might be worth a try anyway.

no artifacts, screen tearing, etc...just "freezing" of the LXQT desktop.  I can still pull up my fluxbox root menu and do "fluxbox things", just not launch applications.  weird right?

 

I do have compton installed, and autostarting at boot, but that was the first thing i disabled in troubleshooting process.  On the chance that one of my other prior actions may have "fixed" something, I'll re-enable compton and see how it goes.  I found a Manjaro posting that sounds like it MIGHT apply to me:

https://forum.manjaro.org/t/manjaro-installation-ryzen-pro-not-still-full-supported/63362/21

I don't understand this, and I'm reluctant to make changes I don't understand.  Any of this make sense to anyone reading/lurking this thread?  But I've dealt with nomodeset before when it comes to gpus, so at least sounds promising.

 

Also, I keep seeing references to "check your logs".  WHAT logs should I be checking?  I've looked at Xorg.logs and don't see anything, but don't fully understand either (I've got about 4 logs for xorg, xorg.old, xorg1 and xorg1.old, etc...)  I also see references to journal logs and dmesg, but don't know what I'm LOOKING for, let alone how to address.

Link to comment
Share on other sites

securitybreach

Yeah, if you havn't tried nomodeset, I would try that first. I usually do that with my nvidia systems. Of course, I use the dkms version of nvidia.

Link to comment
Share on other sites

No dice on the compositing suggesting.  Re-enabled compton, and re-booted, but eventually froze. 

 

While I was waiting for that to happen, or NOT, I started reading up on dmesg logs, and overlaying together with IOMMU (never heard of it before) issues, I get the following output from dmesg (can't copy & paste because I'm "watching" in 2 second intervals)

image.thumb.png.30b17a3cdf442760e332369cddf33164.png

I note the lines about VPD access failed, for serial 0000:01:00.1, & *0.2, and pci 0000:01:00.3.  lspci indicates the following:

image.thumb.png.cf78c1c4f630e20e5966e04274174534.png

If I read the lspci correctly(?), serial device errors relate to serial controller Realtek Semiconductor....RTL8111xP, while device error relates to IPMI Interface.  Not sure what all that means, but stumbled across IOMMU issues with RavenRidge processors and seems like IOMMU might have something to do with that, so followed this thread, and added recommended GRUB parameters to boot.

https://bbs.minisforum.com/threads/the-iommu-issue-boot-and-usb-problems.2180/

re-booted, and now waiting to see what happens, or NOT happen?!

 

I don't even know if I'm on the right trail here....but hoping MAYBE something I say or do will spark an idea in someone else, or something in the logs catch their eyes.  Until resolved, I'm open to advice!

Link to comment
Share on other sites

38 minutes ago, securitybreach said:

Well I would at least try the closed driver just to make sure its not a hardware issue.

according to the AMD site, the closed source amdgpu-pro drivers do not support the Raven Ridge/Radeon Vega series.

 

when I lshw, i note that the IOMMU and IPMI entries indicate UNCLAIMED.  Earlier readings indicate this is a problem.  No clue how to fix that.  May not even be related to my issue.  Not even sure if it's considered a harmless "warning" or a critical "error."  Maybe I should look into BIOS update?  Probably another rabbit hole to descend.  (Looking for Alice, but all I see are Mad Hatters!)

 

In the meantime, I've probably spent 10-14 man hours trying to hunt down and remedy this issue, and still no closer (it seems).  Not really cost effective for a $200 machine, so I pulled the trigger on a Lenovo ThinkCentre m910q Tiny....a nearly identical machine but with Intel Core i7-6700T cpu and integrated Intel HD 530 graphics.  Looks like folks have been installing linux on this machine since about 2019 with no reported issues.  (FINGERS CROSSED)

 

I'll keep running down the issues with this m715q and Ryzen5, but at least it won't be time-critical anymore.  I need to get my primary system up & running in a stable manner before I switch over and migrate to the new machine.  Once THAT is done, I'll be able to take my time and try to figure this out as the knowledge reveals itself, rather than looking for needles in a haystack.  But this is how linux issues usually are for me....

Link to comment
Share on other sites

still chasing this and based on this thread:

disabled C6 in the BIOS....no change.

then added GRUB boot flags for idle=nomwait and rcu_nocbs=0-7 and rebooted....no change.

 

but then I kept a terminal open for the next freeze and learned something interesting.  couldn't click an application to open from the launcher or fluxbox menu, so tried to start in terminal to see error messages(?)....and saw THIS:

cannot open display: :0.0
Maximum number of clients reached

 

started googling THAT issue and found this page:

https://askubuntu.com/questions/4499/how-can-i-diagnose-debug-maximum-number-of-clients-reached-x-errors

 

I don't understand all this again, but I see more processes than the terminal can list.  so i started killing some with command

kill PID

after killing 3 or 4 processes, I clicked on a launcher and TADA...application launched, as expected.  So I've finally got SOME kind of progress, which I think are clues to a solution.  It seems I've got "zombie" processes using up all my X client connections.  I've never seen this before....WTH is this?  What might cause this?  With that answer, what might be the solution?

 

Have I been chasing the wrong problem (Ryzen5 driver issues?) this whole time?  Or might the graphics driver issues be the cause of the unrelinquished X client connections?  I finally feel like I have a confirmed action that resolves the issue (killing PID to free X client connections), rather than poking in the dark in hopes that the poke will do something.  But now that I have that, not sure what to do next?

 

Any thoughts?

 

Link to comment
Share on other sites

yep...confirmed....I let the desktop "sleep" for awhile and when I wake it up, xrestop indicates there are 256 clients being monitored with numerous "xerrors".  I don't know how to read the xrestop display, but I know that 256 number is important, as I read the default number of connections to Xclients is 256...with a maximum of 2048.  Once I kill all instances of one(?) continuously respawned process, the number drops to 40-something clients and my ability to launch applications is restored.  This is nearly identical to my current desktop, also running Debian (Buster instead of Bookworm) LXQT (0.14 instead of 1.2), which is a known good configuration for about 3-4 years now.  I've repeated this sleep/wake process several times today, and I can duplicate the situation and the remedy every time.  So problem identified(!)...but solution still elusive.

 

A seemingly obvious solution (that I can find) is to increase the number of Xlients from 256 to 512 or 1024, or 2048.  But "raising the ceiling" doesn't seem like a solution to me....only allowing for greater intervals in between killing processes to restore availability of Xclient connections.  It looks like ONE process is causing the issue, because when I "pkill" that process, it reduces the number of Xclients being monitored in xrestop from 256 to 42, just by killing that one process.  So not only have I identified the problem, I THINK I've identified the culprit; which appears to be lxqt-conf.  Although it's still possible that one process isn't the culprit, but merely responding to what the actual culprit is telling it to do.

 

I'm off to see if there are bugs reported in lxqt-conf that might cause this issue.  But I also can't help wondering if a BIOS issue isn't the underlying cause here.  I don't like to upgrade BIOS, unless absolutely necessary.  And given the issues I had with "disk lock" described earlier in this thread, I'm reluctant to mess with the BIOS.  But I've got a nagging feeling that maybe a bug in BIOS is causing an issue with the RavenRidge/Vega memory banks, which is causing an issue with Xorg, which is causing an issue with lxqt-conf (and most likely, the "monitor" section of lxqt-conf I'm guessing).  Too soon to say, but that would certainly tie all symptoms together.  But I want to pursue less invasive/more obvious solutions first....if there are any.

 

HOPING something I've shared in this thread causes someone's brain to remember a similar issue they had, and a solution.  HOPING...

Link to comment
Share on other sites

I have only one Ryzen 5 machine that uses amdgpu driver (the rest are creaky old Bulldozer and Jaguar based machines that have ancient AMD video and use the Radeon driver.) My Ryzen machine runs Windows.

So I haven't seen much in the way of the problems you have been having. What I have had with EFI and Linux are glitches if the fastboot and secure boot options in the Setup are NOT disabled. I assume you have looked at those items already. Update: I see you got rid of secure boot.

I hate upgrading the firmware too but in this case I would give it a try.

Edited by raymac46
Link to comment
Share on other sites

securitybreach

Yeah, I also have no issues with my ryzen cpu but it doesn't have an included APU. I think it is graphics related but could be wrong.

Link to comment
Share on other sites

I googled the AMD Ryzen5 CPU/GPU and confirmed it's linux-friendly before I bought the machine.  In fact, the rave reviews for Ryzen cpus is what drew me to this machine.  I prefer AMD cpus and have an A10 in one machine, an FX-120 in another, and a Ryzen 5 in a 3rd.  I seek AMD motherboards and cpus in my builds.

 

But I think my oversight in this one is specifically the RavenRidge/Vega series....I've got a Ryzen 5 PRO 2400GE in this little ThinkCentre Tiny, which seems to be a mobile/laptop CPU.  Not sure what the Raven Ridge/Vega classes mean (I'm not up on the Zen, Bulldozer, Jaguar, etc... terminology), but I see LOTS of posts about issues with that chipset.  But in all fairness to me....the issues seemed to get resolved around the 5.15 kernel.  With my Debian build using a 6.10 kernel, I assumed it wouldn't be an issue.  Also in fairness to me, my LiveMedia has indicated ZERO issues in testing on machines.  The one thing I didn't do while testing in LiveMedia was to let the session sleep, and wake it back up.  This is one of those bugs that only gets discovered upon installation to bare metal and subsequent USE.

 

Other than these zombie Xclient connections using up all my available connections, the cpu and graphics are actually performing quite impressively.  I can't even say for certain it's a graphics issue.  For all I know, this RavenRidge/Vega cpu/gpu is performing in a stellar manner.  All i know for CERTAIN, is that SOMETHING is causing lxqt-config-mon to get respawned over and over during sleep, and ONLY during sleep.  I've been monitoring in terminal while using and the number of xclient connections being monitored barely goes over 120 with NUMEROUS applications open, and several tabs open on browser.  Closing those applications reduces the number of xclients back to the 40-something being monitored in default state.

 

I think Ray is right....I think I'm going to have to flash the BIOS.  If for no other reason, to be thorough and investigate every possibility.  Makes sense that the one thing I haven't tried is the very thing to solve the problem.  That's gonna be another fun adventure, because Lenovo flash instructions involve installing from Windows, or Lenovo Advantage (Windows application), or boot from CD.  But this machine doesn't have a CD.  So I think I'm going to have to put the ISO on a flash drive with "dd" and hope it acts like a CD.  More fun...

Link to comment
Share on other sites

15 hours ago, raymac46 said:

 

Thanks for that link Ray!  Some information and suggestions in there I haven't tried yet, but willing to give them a go.  I note that "update the BIOS" is one of them....I'll save that for LAST.  I also note that I don't really have a "freeze" of my desktop.  That's what I called it initially, because I couldn't open applications from menu, launchers, fluxbox root menu, or even terminal (although terminal gave me the clue about xclient connections).  I could still maneuver my mouse, use keyboard to type in terminal (if i left it open BEFORE the freeze), and mouse over things to highlight selection, mouse over launchers and tint2 to "reveal" them, etc...  I've had frozen graphics before where I couldn't do ANYTHING except push power button to restart.  That's not what I have here.  I can use the desktop just fine.....I just can't start ANY new programs that require a GUI, as all my xclient connections have been used up.

 

Now i just need to figure out WHY, and how to prevent that from happening.  Good find ^.  Thanks!

  • Like 1
Link to comment
Share on other sites

well dayummm....another head scratcher of a curveball.  following SBs suggestion, I downloaded an ISO and booted from LiveUSB to test hardware.  In order to isolate the Debian/apt family from the issue, I downloaded OpenSuse KDE Leap 15.5 and booted from that.  I chose OpenSUSE to isolate the Debian family as a variable, and chose KDE to isolate LXQT as a variable, while retaining the QT underpinnings.  After installing lsof and xrestop from OpenSUSE repos, I opened both in terminal and let the live system sleep and returned later to wake it up.  There appear to be NO zombie processes being spawned.  After waking from sleep, there are only 34 xclient connections, with only 2 xerrors.  Inxi confirms the LiveSession of OpenSUSE is using the amdgpu driver, and kernel 5.14.21

 

this seems to suggest that hardware is fine (even though there's a July 2023 firmware update recommended) and driver is fine.  presumably a newer kernel (6.1.10 in my machine) is fine, although kernel regressions are possible; i'm guessing unlikely for a more recent and still very popular CPU.  I think this means I'm looking at a Debian issue, or an LXQT issue.  I guess next step is to boot from a non-LXQT Debian LiveMedia and see if the problem re-appears.  If it does, I think I've got a Debian problem.  If it does not, I think I have an LXQT problem.

 

Of course, none of this precludes a 2-layered problem that manifests only with a combination of 2 variables.  But if that's the case, I think I'm in trouble, cuz those are practically impossible to troubleshoot! 

Link to comment
Share on other sites

securitybreach

Ok, well at least you ruled out a hardware issue and have something to compare as far as services and such perhaps related to your issue.

 

  • Agree 1
Link to comment
Share on other sites

11 minutes ago, raymac46 said:

One other suggestion is to try Endeavour OS LXQt on your machine. That rules out Debian while keeping LXQt in play.

Once again...we think alike.  But you say "6" and I say "half a dozen".

 

With OpenSUSE KDE acting fine, I loaded up Debian 12 LXQT (to test whether MY customizations were an issue).  Debian 12 LXQT shows similar issues, so I think anything that I did (or the mother Spiral distro did) isn't the culprit.  Now loading up Debian LXDE to see those results.  If issues go away, it's LXQT.  Or a combination of LXQT & BIOS.  If issues remain, it's Debian.  Or a combination of Debian and BIOS.

Link to comment
Share on other sites

With regard to a BIOS update one thing you can try is to use an old usb version of Hiren's boot disk which has a Mini XP resource. Then you can install the BIOS update from "within" Windows. I did this on an old junker with no CD. Don't know how it would work on a newer machine.

 

https://www.tomshardware.com/news/amd-raven-ridge-cpus-ryzen-vega,36547.html

 

Raven Ridge looks like an entirely mainstream AMD graphics solution so it should be OK with the amdgpu driver.

Edited by raymac46
  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...