this post was submitted on 23 Aug 2024
29 points (85.4% liked)

Linux

47976 readers
1053 users here now

From Wikipedia, the free encyclopedia

Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).

Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.

Rules

Related Communities

Community icon by Alpár-Etele Méder, licensed under CC BY 3.0

founded 5 years ago
MODERATORS
 

Should I struggle through constant crashes to get my 7900gre with 16gb of vram working, possibly through the headache of ONNX? Can anyone report their own success or offer advice? AMD on linux is generally lovely, SD with AMD on linux, not so much. It was much better with my RTX2080 on linux but gaming was horrible with NVIDIA drivers. I feel I could do more with the 16GB AMD card if stability wasn't so bad. I currently have both cards running to the horror of my PSU. A1111 does NOT want to see the NVIDIA card, only the AMD. Something about the version of pytorch? More work to be done there.

  • Having a much better time back on Cinnamon default instead of Wayland. Oops!

** It heard me. Crashed again on an x/y plot but due to being away from Wayland I was able to see the terminal dump: amdgpu thermal overload! shutdown initiated! That'll do it! Finally something easy to fix. Wonder why thermal throttling isn't kicking in to control runaway? Will stress it once more and clock the temps this time.

Temps were exceeding 115C, phew! No idea why the default amdgpu driver has no fan control but they're ripping like they should now. Monitoring temps has restored system stability. Using multiple amd/nvidia dedicated venv folders and careful driver choice/installation were the keys to multigpu success.

top 24 comments
sorted by: hot top controversial new old
[–] [email protected] 1 points 2 months ago

Anything with. Nvidia will actually be a better experience outside of Wayland on Linux, most people don't want to acknowledge it but it really is the case

[–] [email protected] 4 points 2 months ago

My experience is that AMDs virtual memory system for VRAM is buggy and those bugs cause kernel crashes. A few tips:

  1. If running both cards is overstressing your PSU you might be suffering from voltage drops when your GPU draws maximum power. I was able to run games absolutely fine on my previous PSU, but running diffusion models caused it to collapse. Try just a single card to see if it helps stability.

  2. Make sure your kernel is as recent as possible. There have been a number of fixes in the 6.x series, and I have seen stability go up. Remember: docker images still use your host OS kernel.

  3. If you can, disable the desktop (e.g. systemctl isolate multi-user.target, and run the web gui over the network to another machine. If you're running ComfyUI, that means adding --listen to the command line options. It's normally the desktop environment that causes the crashes when it tries to access something in VRAM that has been swapped to normal RAM to make room for your models. Giving the whole GPU to the one task boosts stability massively. It's not the desktop environment's fault. The GPU driver should handle the situation.

  4. When you get a crash, often it's just that the GPU has crashed and not the machine (Won't be true of a power supply issue). sshing in and shutting down cleanly can save your filesystems the trauma of a hard reboot. If you don't have another machine, grab a ssh client for your phone like Juice SSH on android. (Not affiliated. It just works for me)

  5. Using rocm-smi to reset the card after a crash might bring things back, but not always. Obviously you have to do this over the network as your display has gone.

  6. Be aware of your VRAM usage (amdgpu_top) and try to avoid overcommitting it. It sucks, but if you can avoid swapping VRAM everything goes better. Low memory modes on the tools can help. ComfyUI has --low-vram for example and it more aggressively removes things from VRAM when it's finished using them. Slows down generations a bit, but better than crashing.

With this I've been running SDXL on a 8GB RX7600 pretty successfully (~1s per iteration). I've been thinking about upgrading but I think I'll wait for the RX8000 series now. It's possible the underlying problem is something with the GPU hardware as AMD are definitely improving things with software changes, but not solving it once and for all. I'm also hopeful that they will upgrade the VRAM across the range. The 16GB 7600XT says to me that they know <16GB isn't practical anymore, so the high-end also has to go up, right?

[–] [email protected] 4 points 2 months ago (1 children)
[–] [email protected] 2 points 2 months ago (1 children)

I might take the docker route for the ease of troubleshooting if nothing else. So very sick of hard system freezes/crashes while kludging through the troubleshooting process. Any words of wisdom?

[–] [email protected] 2 points 2 months ago (1 children)

Assume I'm an amature and bad at this ;P

In any case you might try a docker-compose.yml

version: "3.8"
# Compose file build variables set in .env
services:
  supervisor:
    platform: linux/amd64
    build:
      context: ./build
      args:
        PYTHON_VERSION: ${PYTHON_VERSION:-3.10}
        PYTORCH_VERSION: ${PYTORCH_VERSION:-2.2.2}
        WEBUI_TAG: ${WEBUI_TAG:-}
        IMAGE_BASE: ${IMAGE_BASE:-ghcr.io/ai-dock/python:${PYTHON_VERSION:-3.10}-cuda-11.8.0-base-22.04}
      tags:
        - "ghcr.io/ai-dock/stable-diffusion-webui:${IMAGE_TAG:-cuda-11.8.0-base-22.04}"
        
    image: ghcr.io/ai-dock/stable-diffusion-webui:${IMAGE_TAG:-cuda-11.8.0-base-22.04}
    
    devices:
      - "/dev/dri:/dev/dri"
      # For AMD GPU
      #- "/dev/kfd:/dev/kfd"
    
    volumes:
      # Workspace
      - ./workspace:${WORKSPACE:-/workspace/}:rshared
      # You can share /workspace/storage with other non-WEBUI containers. See README
      #- /path/to/common_storage:${WORKSPACE:-/workspace/}storage/:rshared
      # Will echo to root-owned authorized_keys file;
      # Avoids changing local file owner
      - ./config/authorized_keys:/root/.ssh/authorized_keys_mount
      - ./config/provisioning/default.sh:/opt/ai-dock/bin/provisioning.sh
    
    ports:
        # SSH available on host machine port 2222 to avoid conflict. Change to suit
        - ${SSH_PORT_HOST:-2222}:${SSH_PORT_LOCAL:-22}
        # Caddy port for service portal
        - ${SERVICEPORTAL_PORT_HOST:-1111}:${SERVICEPORTAL_PORT_HOST:-1111}
        # WEBUI web interface
        - ${WEBUI_PORT_HOST:-7860}:${WEBUI_PORT_HOST:-7860}
        # Jupyter server
        - ${JUPYTER_PORT_HOST:-8888}:${JUPYTER_PORT_HOST:-8888}
        # Syncthing
        - ${SYNCTHING_UI_PORT_HOST:-8384}:${SYNCTHING_UI_PORT_HOST:-8384}
        - ${SYNCTHING_TRANSPORT_PORT_HOST:-22999}:${SYNCTHING_TRANSPORT_PORT_HOST:-22999}
   
    environment:
        # Don't enclose values in quotes
        - DIRECT_ADDRESS=${DIRECT_ADDRESS:-127.0.0.1}
        - DIRECT_ADDRESS_GET_WAN=${DIRECT_ADDRESS_GET_WAN:-false}
        - WORKSPACE=${WORKSPACE:-/workspace}
        - WORKSPACE_SYNC=${WORKSPACE_SYNC:-false}
        - CF_TUNNEL_TOKEN=${CF_TUNNEL_TOKEN:-}
        - CF_QUICK_TUNNELS=${CF_QUICK_TUNNELS:-true}
        - WEB_ENABLE_AUTH=${WEB_ENABLE_AUTH:-true}
        - WEB_USER=${WEB_USER:-user}
        - WEB_PASSWORD=${WEB_PASSWORD:-password}
        - SSH_PORT_HOST=${SSH_PORT_HOST:-2222}
        - SSH_PORT_LOCAL=${SSH_PORT_LOCAL:-22}
        - SERVICEPORTAL_PORT_HOST=${SERVICEPORTAL_PORT_HOST:-1111}
        - SERVICEPORTAL_METRICS_PORT=${SERVICEPORTAL_METRICS_PORT:-21111}
        - SERVICEPORTAL_URL=${SERVICEPORTAL_URL:-}
        - WEBUI_BRANCH=${WEBUI_BRANCH:-}
        - WEBUI_FLAGS=${WEBUI_FLAGS:-}
        - WEBUI_PORT_HOST=${WEBUI_PORT_HOST:-7860}
        - WEBUI_PORT_LOCAL=${WEBUI_PORT_LOCAL:-17860}
        - WEBUI_METRICS_PORT=${WEBUI_METRICS_PORT:-27860}
        - WEBUI_URL=${WEBUI_URL:-}
        - JUPYTER_PORT_HOST=${JUPYTER_PORT_HOST:-8888}
        - JUPYTER_METRICS_PORT=${JUPYTER_METRICS_PORT:-28888}
        - JUPYTER_URL=${JUPYTER_URL:-}
        - SERVERLESS=${SERVERLESS:-false}
        - SYNCTHING_UI_PORT_HOST=${SYNCTHING_UI_PORT_HOST:-8384}
        - SYNCTHING_TRANSPORT_PORT_HOST=${SYNCTHING_TRANSPORT_PORT_HOST:-22999}
        - SYNCTHING_URL=${SYNCTHING_URL:-}
        #- PROVISIONING_SCRIPT=${PROVISIONING_SCRIPT:-}

install.sh

sudo pacman -S docker
sudo pacman -S docker-compose

update.sh

#!/bin/bash
# https://stackoverflow.com/questions/49316462/how-to-update-existing-images-with-docker-compose

sudo docker-compose pull
sudo docker-compose up --force-recreate --build -d
sudo docker image prune -f

start.sh

#!/bin/bash
sudo docker-compose down --remove-orphans && sudo docker-compose up
[–] [email protected] 2 points 2 months ago (1 children)

What a treat! I just got done setting up a second venv within the sd folder. one called amd-venv the other nvidia-venv. Copied the webui.sh and webui-user.sh scripts and made separate flavors of those as well to point to the respective venv. Now If I just had my nvidia drivers working I could probably set my power supply on fire running them in parallel.

[–] [email protected] 1 points 2 months ago

Excellent, did my test config last month for a friend, I was having trouble on bare metal even though I typically prefer, and in this sense it was nice to have a image I could turn on and off as needed easily.

[–] [email protected] 2 points 2 months ago (2 children)

Well I finally got the nvidia card working to some extent. On the recommended driver it only works in lowvram. medvram maxes vram too easily on this driver/cuda version for whatever reason. Does anyone know the current best nvidia driver for sd on linux? Perhaps 470, the other provided by the LM driver manager..?

[–] [email protected] 2 points 2 months ago

I'm on 550, never had any problems with using ComfyUI or sd-webui. Docker makes it easier to get out of dependency hell imo

[–] [email protected] 3 points 2 months ago (1 children)

SD works fine for me with: Driver Version: 525.147.05 CUDA Version: 12.0

I use this docker container: https://github.com/AbdBarho/stable-diffusion-webui-docker

You will also need to install the nvidia container toolkit if you use docker containers: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

[–] [email protected] 1 points 2 months ago* (last edited 2 months ago)

Thank you!! I may rely on this heavily. Too many different drivers to try willy-nilly. I am in the process of attempting with this guide/driver for now. Will report back with my luck or misfortunes https://hub.tcno.co/ai/stable-diffusion/automatic1111-fast/

version for whatever reason. Does anyone know the current best nvidia driver for

[–] [email protected] 2 points 2 months ago

There are likely automatic checks in the startup script. I don't use A1111 any more in favor of Comfy and I only have a 3080Ti with 16 GB (mobile version). I can run within issues. The only time I have issues with anything AI related is when I need Nvidia's proprietary compiler nvcc. I need nvcc to hack around with things like llama.cpp. With nvcc, it can have issues with the open source driver

[–] [email protected] 1 points 2 months ago (1 children)

If you don't understand how models use host caching, start there.

Outside of that, you're asking people to simplify everything into a quick answer, and there is none.

ONNX is the "universal" standard, ensure you didn't accidentally convert the input model into something else by accident, but more importantly, ensure when you run it and automatically convert, that the works are actually done on the GPU. ONNX defaults to CPU.

[–] [email protected] -1 points 2 months ago (1 children)

I started reading into the ONNX business here https://rocm.blogs.amd.com/artificial-intelligence/stable-diffusion-onnx-runtime/README.html Didn't take long to see that was beyond me. Has anyone distilled an easy to use model converter/conversion process? One I saw required a HF token for the process, yeesh

[–] [email protected] 1 points 2 months ago (1 children)

No.

As I said, you're trying to distill people's profession into an easy to digest guide about "make it work". Nothin like that exists.

Same way you can't just get a job doing "doctor stuff", or "build junk".

[–] [email protected] 2 points 2 months ago

Since only one of us is feeling helpful, here is a 6 minute video for the rest of us to enjoy https://www.youtube.com/watch?v=lRBsmnBE9ZA

[–] [email protected] 7 points 2 months ago* (last edited 2 months ago) (1 children)

I have run Stable Diffusion models successfully with my ancient Vega 64 with 8 gb vram. However, it does occasionally run out of memory and crash when I run models that want all 8 gigs. I have to run it without a proper DE(openbox, falkon browser with one tab only) if I dont want it to crash frequently.

[–] [email protected] 2 points 2 months ago (3 children)

How bad are your crashes? Mine will either freeze the system entirely or crash the current lightdm session, sometimes recovering, sometimes freezing anyway. Needs power cycle to rescue. What is the DE you speak of? openbox?

[–] [email protected] 2 points 2 months ago

yes, mine are similar. I used to run kde plasma while generating but plasma took too much vram, so now im using icewm. I noticed that the crashes happen when something needed vram when its already all used, so thats why icewm reduces crashes, since its very light on resources.

[–] [email protected] 1 points 2 months ago (1 children)

That seems strange. Perhaps you should stress-test your GPU/system to see if it's a hardware problem.

[–] [email protected] 1 points 2 months ago (1 children)

I had that concern as well with it being a new card. It performs fine in gaming as well as in every glmark benchmark so far. I have it chalked up to amd support being in experimenntal status on linux/SD. Any other stress tests you recommend while I'm in the return window!? lol

[–] [email protected] 2 points 2 months ago (1 children)

I've used this before: https://github.com/wilicc/gpu-burn?tab=readme-ov-file

Yeah, it may be a driver issue, Nvidia/pytorch handles OOM gracefully on my system.

[–] [email protected] 1 points 2 months ago

Ah, thanks. It is my AMD card causing crashes with SD in my experience. NVIDIA is native to CUDA hence the stability.

[–] [email protected] 2 points 2 months ago

the reset situation may improve in the not too distant future: https://www.phoronix.com/news/AMDGPU-Per-Ring-Resets