Small guide to run Llama.cpp on windows with discrete AMD GPU

fatboy93@lemm.ee · edit-2 10 months ago

Small guide to run Llama.cpp on windows with discrete AMD GPU

fatboy93@lemm.ee · 10 months ago

I did post this on reddit first, since this community never pops on my feed and I was unsure if its inactive. But here it goes here as well!

adhdplantdev@lemm.ee · 10 months ago

How fast is the model inference on your local using this method?

fatboy93@lemm.ee · edit-2 10 months ago

I’m just going to cheat here a bit and use chatGPT to summarize this, since I don’t want to do the calculation wrong. Hope it makes sense. I’m just excited to share this!

########## Integrated GPU #########

Total inference time = Load time + Sample time + Prompt eval time + Eval time

Total inference time = 26205.90 ms + (6.34 ms/sample * 103 samples) + 29234.08 ms + 118847.32 ms

Total inference time = 26205.90 ms + 653.02 ms + 29234.08 ms + 118847.32 ms

Total inference time = 174940.32 ms

So, the total inference time is approximately 174940.32 ms.

########## Discrete GPU 6800M ######### Total inference time = Load time + Sample time + Prompt eval time + Eval time

Total inference time = 60188.90 ms + (3.58 ms/sample * 103 samples) + 7133.18 ms + 13003.63 ms

Total inference time = 60188.90 ms + 368.74 ms + 7133.18 ms + 13003.63 ms

Total inference time = 80594.45 ms

So, the total inference time is approximately 80594.45 ms. #####################################

Taking the difference Discrete - Integrated : 94345.87 ms.

Which is close to about 53% faster or about 1.5 minutes faster. The integrated GPU takes close to 175 seconds and the discrete finishes in about 81 seconds.

I do think that adding more RAM at some point could definitely help in improving the loading times, since the laptop has currently about 16Gb RAM.