Running Dell branded Nvidia gtx 1060 in non-dell system

Hey everyone, I'm looking to buy used Nvidia GeForce GXT 1060 to run some ML tutorials. I got a good deal on Dell OEM one. Are there any pitfalls in running one in a non-dell system? More specifically nothing even close to that i.e amd fx CPU and AMD 970 chipset. Alex. -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

| From: Alex Volkov via talk <talk@gtalug.org> | I'm looking to buy used Nvidia GeForce GXT 1060 to run some ML | tutorials. The advantage of nvidia over AMD is wider support. CUDA is nvidia-only (but AMD's ROCm is intended to be easy to port to from CUDA). The disadvantage is that nvidia's stuff is closed source. Yuck. nvidia also has terrible licensing terms that can force you to buy more expensive cards. You probably won't be hit by this: <https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/> In general, nvidia does more "price discrimination". But AMD is not immune: AMD sells "workstation" cards for extra money. For raw computing power per dollar, my impression is that AMD can be a better deal. | I got a good deal on Dell OEM one. Are there any pitfalls in running one | in a non-dell system? More specifically nothing even close to that i.e | amd fx CPU and AMD 970 chipset. There are no problems that I know of. I used a Dell OEM nvidia card many years ago without issue. Some OEM cards are a little crippled. My 5-year-old desktop came with an OEM AMD card. The specs said "1920x1200 max resolution" but also said "Dual Link DVI" (which is only needed for higher resolutions). So I assumed that it could do 2560x1600 like the non-OEM versions. It could not. (My best guess is that they cheaped-out on a TDMS chip and did not, in fact, support dual-link, but I had no way to test.) So check out the specs. Bonus hint: before buying the card, make sure it will fit in your system: - I had a problem with an RX 570 being too long for my computer's motherboard - many cards now require extra power connectors that your power supply might not support. And the number of pins on those connectors changed in recent years. - you may need a power supply with more capacity. - with more power comes more heat -- will your case handle that? (Probably)

Hey Hugh, Thank you for your reply, see my comments below. On 2019-07-22 9:30 a.m., D. Hugh Redelmeier via talk wrote:
| From: Alex Volkov via talk <talk@gtalug.org>
| I'm looking to buy used Nvidia GeForce GXT 1060 to run some ML | tutorials.
The advantage of nvidia over AMD is wider support. CUDA is nvidia-only (but AMD's ROCm is intended to be easy to port to from CUDA).
I have another system with Ryzen 5 2400G and was hoping to run ROCm on it, but as it turns out -- ROCm doesn't fully support AMD cards with built-in graphics. I still can install discreet card into that system but the solution is not as cheap as getting a used GTX off craigslist.
The disadvantage is that nvidia's stuff is closed source. Yuck.
nvidia also has terrible licensing terms that can force you to buy more expensive cards. You probably won't be hit by this: <https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/>
Yes. I haven't yet figured out how to fix screen tearing with proprietary Nvidia drivers. As for ML, I had to register on their website to download cuDNN packages required for for tensorflow. Packages are for ubuntu, but they seem to work on debian. Nvidia asks some pretty invasive questions about how card is going to be used (which I don't yet know), so randomly checking off boxes and giving them one of the email addresses where I dump all of the subscriptions helps.
In general, nvidia does more "price discrimination". But AMD is not immune: AMD sells "workstation" cards for extra money.
I like AMD approach more, they just giving tensorflow-rocm binary on pypi -- https://pypi.org/project/tensorflow-rocm/ Nvidia, however, makes you jump through some hoops -- https://developer.nvidia.com/cudnn
For raw computing power per dollar, my impression is that AMD can be a better deal.
That seems to start to work around $500 price point, but I want something to toy with, so I can't justify that expense right now. Used GTX 1060 gives large enough performance to try things out, if I grow out if it, I'll switch to cards that have good ROCm support.
| I got a good deal on Dell OEM one. Are there any pitfalls in running one | in a non-dell system? More specifically nothing even close to that i.e | amd fx CPU and AMD 970 chipset.
There are no problems that I know of. I used a Dell OEM nvidia card many years ago without issue.
Some OEM cards are a little crippled. My 5-year-old desktop came with an OEM AMD card. The specs said "1920x1200 max resolution" but also said "Dual Link DVI" (which is only needed for higher resolutions). So I assumed that it could do 2560x1600 like the non-OEM versions. It could not. (My best guess is that they cheaped-out on a TDMS chip and did not, in fact, support dual-link, but I had no way to test.) So check out the specs.
Turns out there are lot of $200 - $250 GTX 1060 cards being sold on kijiji but the price for which they actually sell is much lower. I offered $160 to 3 sellers, got reply from two, one was the dell card which I was unsure about, the other was non-oem MSI with dual fans. I went with MSI. More fans more better.
Bonus hint: before buying the card, make sure it will fit in your system:
- I had a problem with an RX 570 being too long for my computer's motherboard
- many cards now require extra power connectors that your power supply might not support. And the number of pins on those connectors changed in recent years.
- you may need a power supply with more capacity.
- with more power comes more heat -- will your case handle that? (Probably)
I'm really glad back in 2013 I bought decent mid-atx Antec (Sonata II?) with 500W power supply with enough room inside and 2x 120mm fans. It has 2x 6-pin 12V connector. After minor AMDectomy where I removed old Radeon that doesn't really do anything besides displaying things, I was able to install and run the card without any hardware changes to the rest of the system. I got to the part when I'm able to run some tutorials on tensorflow, which I do believe run on the CPU because there seem to be a software bug somewhere related to LD_LIBRARY_PATH that I still haven't figured out. 2019-07-21 22:29:58.204355: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2019-07-21 22:29:58.367642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7715 pciBusID: 0000:01:00.0 2019-07-21 22:29:58.367858: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory 2019-07-21 22:29:58.367982: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory 2019-07-21 22:29:58.368112: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory 2019-07-21 22:29:58.368234: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory 2019-07-21 22:29:58.368369: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory 2019-07-21 22:29:58.368498: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory 2019-07-21 22:29:58.374333: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-07-21 22:29:58.374376: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices... 2019-07-21 22:29:58.374999: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA 2019-07-21 22:29:58.405862: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3214800000 Hz 2019-07-21 22:29:58.407424: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a967b9ab00 executing computations on platform Host. Devices: 2019-07-21 22:29:58.407486: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined> 2019-07-21 22:29:58.563751: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a967b97fa0 executing computations on platform CUDA. Devices: 2019-07-21 22:29:58.563835: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce GTX 1060 6GB, Compute Capability 6.1 2019-07-21 22:29:58.564029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-07-21 22:29:58.564055: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 2019-07-21 22:30:00.791745: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile. W0721 22:30:01.203895 140325207855424 deprecation_wrapper.py:119] From classify_image.py:85: The name tf.gfile.GFile is deprecated. Please use tf.io.gfile.GFile instead. giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89107) indri, indris, Indri indri, Indri brevicaudatus (score = 0.00779) lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00296) custard apple (score = 0.00147) earthstar (score = 0.00117) Alex.

| From: Alex Volkov via talk <talk@gtalug.org> | I have another system with Ryzen 5 2400G and was hoping to run ROCm on it, but | as it turns out -- ROCm doesn't fully support AMD cards with built-in | graphics. I still can install discreet card into that system but the solution | is not as cheap as getting a used GTX off craigslist. In January I saw cheap Radeo RX 580s on Kijiji too. I haven't looked recently. One advantage of AMD over nvidia is that larger memories are more common. It's a shame about ROCm's lack of APU support. Parts of it are there. <https://rocm.github.io/hardware.html> The iGPU in AMD APUs The following APUs are not fully supported by the ROCm stack. “Carrizo” and “Bristol Ridge” APUs “Raven Ridge” APUs These APUs are enabled in the upstream Linux kernel drivers and the ROCm Thunk. Support for these APUs is enabled in the ROCm OpenCL runtime. However, support for them is not enabled in our HCC compiler, HIP, or the ROCm libraries. In addition, because ROCm is currently focused on discrete GPUs, AMD does not make any claims of continued support in the ROCm stack for these integrated GPUs. In addition, these APUs may may not work due to OEM and ODM choices when it comes to key configurations parameters such as inclusion of the required CRAT tables and IOMMU configuration parameters in the system BIOS. As such, APU-based laptops, all-in-one systems, and desktop motherboards may not be properly detected by the ROCm drivers. You should check with your system vendor to see if these options are available before attempting to use an APU-based system with ROCm.

Yes. Unfortunately I went though the debugging process before I got to the paragraph. On the upside I think I got a lightning talk out of it, that I'll try to present at the next meeting. Alex. On 2019-08-05 11:12 a.m., D. Hugh Redelmeier via talk wrote:
| From: Alex Volkov via talk <talk@gtalug.org>
| I have another system with Ryzen 5 2400G and was hoping to run ROCm on it, but | as it turns out -- ROCm doesn't fully support AMD cards with built-in | graphics. I still can install discreet card into that system but the solution | is not as cheap as getting a used GTX off craigslist.
In January I saw cheap Radeo RX 580s on Kijiji too. I haven't looked recently.
One advantage of AMD over nvidia is that larger memories are more common.
It's a shame about ROCm's lack of APU support. Parts of it are there.
<https://rocm.github.io/hardware.html>
The iGPU in AMD APUs
The following APUs are not fully supported by the ROCm stack.
“Carrizo” and “Bristol Ridge” APUs “Raven Ridge” APUs
These APUs are enabled in the upstream Linux kernel drivers and the ROCm Thunk. Support for these APUs is enabled in the ROCm OpenCL runtime. However, support for them is not enabled in our HCC compiler, HIP, or the ROCm libraries. In addition, because ROCm is currently focused on discrete GPUs, AMD does not make any claims of continued support in the ROCm stack for these integrated GPUs.
In addition, these APUs may may not work due to OEM and ODM choices when it comes to key configurations parameters such as inclusion of the required CRAT tables and IOMMU configuration parameters in the system BIOS. As such, APU-based laptops, all-in-one systems, and desktop motherboards may not be properly detected by the ROCm drivers. You should check with your system vendor to see if these options are available before attempting to use an APU-based system with ROCm.
--- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

On Tue, Aug 6, 2019 at 4:23 PM Alex Volkov via talk <talk@gtalug.org> wrote:
Yes.
Unfortunately I went though the debugging process before I got to the paragraph. On the upside I think I got a lightning talk out of it, that I'll try to present at the next meeting.
Alex.
Alex, I don't know how much your intending to do with that GPU or otherwise. If your just using Nvidia I can't help you as mentioned but if your interested in GPU workloads was looking at the AMDGPU backend for LLVM. Not sure if there is one that targets Nvidia cards but it may be of interest to you as you would be able to compile directly for the GPU rather than using an API to access it. Not sure about Nvidia so double check that. Here is the official documentation for AMD through: https://llvm.org/docs/AMDGPUUsage.html If your using it for machine learning it may be helpful to be aware of it as you could compile the libraries if possible onto the GPU target rather than access than indirectly through the CPU. Again not sure of what libraries but you should for most of the popular ones and that may increase throughput a lot as it's direct assembly for the card not abstracted. As for GPU memory that may be a issue as Hugh mentioned depending on the size of the workload. I don't think it would matter for your tutorials but going across the PCI bus is about as bad as cache misses for CPUs so best to not have them if possible. If you were able to find a 6GB version that would be more than enough for most workloads excluding professional. 1060s were shipped with either 3 or 6GB so that may be something for card you ordered to check. Retail I recall it being about a 30-50 Canadian difference and for double the RAM it was a good detail at the time if you bought one. Hopefully that helps a little, Nick P.S. Not aware but I'm assuming there is one for gcc as well if you would prefer that for your development or learning.
On 2019-08-05 11:12 a.m., D. Hugh Redelmeier via talk wrote:
| From: Alex Volkov via talk <talk@gtalug.org>
| I have another system with Ryzen 5 2400G and was hoping to run ROCm on it, but | as it turns out -- ROCm doesn't fully support AMD cards with built-in | graphics. I still can install discreet card into that system but the solution | is not as cheap as getting a used GTX off craigslist.
In January I saw cheap Radeo RX 580s on Kijiji too. I haven't looked recently.
One advantage of AMD over nvidia is that larger memories are more common.
It's a shame about ROCm's lack of APU support. Parts of it are there.
<https://rocm.github.io/hardware.html>
The iGPU in AMD APUs
The following APUs are not fully supported by the ROCm stack.
“Carrizo” and “Bristol Ridge” APUs “Raven Ridge” APUs
These APUs are enabled in the upstream Linux kernel drivers and the ROCm Thunk. Support for these APUs is enabled in the ROCm OpenCL runtime. However, support for them is not enabled in our HCC compiler, HIP, or the ROCm libraries. In addition, because ROCm is currently focused on discrete GPUs, AMD does not make any claims of continued support in the ROCm stack for these integrated GPUs.
In addition, these APUs may may not work due to OEM and ODM choices when it comes to key configurations parameters such as inclusion of the required CRAT tables and IOMMU configuration parameters in the system BIOS. As such, APU-based laptops, all-in-one systems, and desktop motherboards may not be properly detected by the ROCm drivers. You should check with your system vendor to see if these options are available before attempting to use an APU-based system with ROCm.
--- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
--- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

| From: xerofoify via talk <talk@gtalug.org> | On Tue, Aug 6, 2019 at 4:23 PM Alex Volkov via talk <talk@gtalug.org> wrote: | I don't know how much your intending to do with that GPU or otherwise. He said he wants to do human (himself) learning about machine learning. Less cute way of saying this: he want to experiment and play with ML. | If your just using Nvidia I can't help | you as mentioned but if your interested in GPU workloads was looking | at the AMDGPU backend for LLVM. Most people learning (or doing) ML pick a tall stack of software and the learn almost nothing about the underlaying hardware. I admint that sometimes performance issues poke their way through those levels. If I remember correctly, the base of the stack that Alex was playing with was Google's TensorFlow. Of course there is stuff below that but the less he has to know about it the better. See Alex's discussion about getting TensorFlow to work on AMD. If I understood him correctly (maybe not), the normal route to TesorFlow on AMD is through ROCm, and that won't work on an APU. Too bad. My guess: even if he could run ROCm, he might hit some hangups with TensorFlow since the most used path to TensorFlow is Nvidia cards and (I think) Cuda. It's always easier to follow a well-worn path. I, on the other hand, think I'm interested in the raw hardware. I have not put any time into this but I intend to (one of many things I want to do). | Not sure if there is one that targets Nvidia cards but it may be of | interest to you as you would be able to | compile directly for the GPU rather than using an API to access it. | Not sure about Nvidia so double check | that. As I understand it: - LCC targets raw AMD GPU hardware - that's probably not very useful because runtime support is needed for what you could consider OS-like functionality and that isn't provided. + scheduling + communicating with the host computer + failure handling and diagnostics - Separately, a custom version of LCC is used as a code generator (partly(?) at runtime!) for OpenCL. I think that AMD tries to "upstream" their LCC changes but this is never soon enough. - I think that nvidia also has a custom LCC but does not try to upsteam all of their goodies (LCC is not copylefted). I may be wrong about much or all of this. I would like to know an accurate, comprehensive, comprehensible source for this kind of information. | Here is the official documentation for AMD through: | https://llvm.org/docs/AMDGPUUsage.html Thanks. I'll have a look. | If your using it for machine learning it may be helpful to be aware of | it You'd think so but few seem to bother. There's enough to get ones head around at the higher levels of abstraction. Much ML seems to be done via cook-books. | Hopefully that helps a little, I'd love to hear a GTALUG talk about the lower levels. Perhaps a lightning talk next week would be a good place to start.

On Wed, Aug 7, 2019 at 9:54 AM D. Hugh Redelmeier via talk <talk@gtalug.org> wrote:
| From: xerofoify via talk <talk@gtalug.org>
| On Tue, Aug 6, 2019 at 4:23 PM Alex Volkov via talk <talk@gtalug.org> wrote:
| I don't know how much your intending to do with that GPU or otherwise.
He said he wants to do human (himself) learning about machine learning. Less cute way of saying this: he want to experiment and play with ML.
| If your just using Nvidia I can't help | you as mentioned but if your interested in GPU workloads was looking | at the AMDGPU backend for LLVM.
Most people learning (or doing) ML pick a tall stack of software and the learn almost nothing about the underlaying hardware. I admint that sometimes performance issues poke their way through those levels.
If I remember correctly, the base of the stack that Alex was playing with was Google's TensorFlow. Of course there is stuff below that but the less he has to know about it the better.
See Alex's discussion about getting TensorFlow to work on AMD. If I understood him correctly (maybe not), the normal route to TesorFlow on AMD is through ROCm, and that won't work on an APU. Too bad.
My guess: even if he could run ROCm, he might hit some hangups with TensorFlow since the most used path to TensorFlow is Nvidia cards and (I think) Cuda. It's always easier to follow a well-worn path.
I, on the other hand, think I'm interested in the raw hardware. I have not put any time into this but I intend to (one of many things I want to do).
| Not sure if there is one that targets Nvidia cards but it may be of | interest to you as you would be able to | compile directly for the GPU rather than using an API to access it. | Not sure about Nvidia so double check | that.
As I understand it:
- LCC targets raw AMD GPU hardware
- that's probably not very useful because runtime support is needed for what you could consider OS-like functionality and that isn't provided.
+ scheduling
+ communicating with the host computer
+ failure handling and diagnostics
- Separately, a custom version of LCC is used as a code generator (partly(?) at runtime!) for OpenCL. I think that AMD tries to "upstream" their LCC changes but this is never soon enough.
- I think that nvidia also has a custom LCC but does not try to upsteam all of their goodies (LCC is not copylefted).
I may be wrong about much or all of this. I would like to know an accurate, comprehensive, comprehensible source for this kind of information.
| Here is the official documentation for AMD through: | https://llvm.org/docs/AMDGPUUsage.html
Thanks. I'll have a look.
| If your using it for machine learning it may be helpful to be aware of | it
You'd think so but few seem to bother. There's enough to get ones head around at the higher levels of abstraction.
Much ML seems to be done via cook-books.
| Hopefully that helps a little,
I'd love to hear a GTALUG talk about the lower levels. Perhaps a lightning talk next week would be a good place to start. ---
I won't be giving a talk about this but Alex and me were going to plan for me to do some sort of Q and A session in the fall. Mostly related to this or other compiler questions. I've found a lot of people have questions especially about optimizations and backends. Not sure what month but this will give you a heads up. AMD's hardware is open source and OpenCL is in both compilers these days. The biggest difference is that AMD cards have a backend for targeting the actual hardware. On the other hand Nvidia doesn't at least have a open source one. So you can compile to native assembly for AMDGPUs after a certain release. It mentions it in the guide. I posted. If your really interested on this page at the bottom are the reference manuals for the GPUs themselves and may help you with debugging if your going to play around with one: https://developer.amd.com/resources/developer-guides-manuals/ That should be enough for now, Nick
Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

Hey Hugh, I want to add a quick note on how tensorflow library is structured: * tensorflow -- main logic package, uses cpu for all computation * tensorflow-gpu -- NVIDIA CUDA optimized package, depends on tensorflow, has the following dependencies on CUDA -- https://www.tensorflow.org/install/gpu * tensorflow-rocm -- AMD fork of tensorflow that depends on install system ROCM packages. In other words, if you want to run tensorflow models install tensorflow $ pip install tensorflow If you want to run tensorflow faster, with CUDA: <lots of system-level package installation magic> $ pip install tensorflow tensorflow-gpu If you want to run tensorflow on AMD hardware: <so system-level package installation magic> $ pip install tensorflow-rocm Going back to the comment you made previously, that you've seen not expensive Radeon RX 580s on kijiji. This all comes back to the hardware constraints that I currently have -- I have Ryzen 5 system in a small package with 160W power supply that can take this kind of hardware, but I need to upgrade case and power supply in order to do that, which is not the purpose the system currently fulfils. The other system I have is from 4 years ago, based on AM3+ platform, so it's only has PCI v2 (there's a mention of PCI v3 requirements in ROCm documentation), probably no exposed CRAT tables, and sketchy IOMMU support. Just installing the card into it would likely not work, and upgrading the system to something that will support ROCm stack properly would cost more then $500, whereas just getting gtx off kijiji would be less than $200. Alex. On 2019-08-07 9:54 a.m., D. Hugh Redelmeier via talk wrote:
| From: xerofoify via talk <talk@gtalug.org>
| On Tue, Aug 6, 2019 at 4:23 PM Alex Volkov via talk <talk@gtalug.org> wrote:
| I don't know how much your intending to do with that GPU or otherwise.
He said he wants to do human (himself) learning about machine learning. Less cute way of saying this: he want to experiment and play with ML.
| If your just using Nvidia I can't help | you as mentioned but if your interested in GPU workloads was looking | at the AMDGPU backend for LLVM.
Most people learning (or doing) ML pick a tall stack of software and the learn almost nothing about the underlaying hardware. I admint that sometimes performance issues poke their way through those levels.
If I remember correctly, the base of the stack that Alex was playing with was Google's TensorFlow. Of course there is stuff below that but the less he has to know about it the better.
See Alex's discussion about getting TensorFlow to work on AMD. If I understood him correctly (maybe not), the normal route to TesorFlow on AMD is through ROCm, and that won't work on an APU. Too bad.
My guess: even if he could run ROCm, he might hit some hangups with TensorFlow since the most used path to TensorFlow is Nvidia cards and (I think) Cuda. It's always easier to follow a well-worn path.
I, on the other hand, think I'm interested in the raw hardware. I have not put any time into this but I intend to (one of many things I want to do).
| Not sure if there is one that targets Nvidia cards but it may be of | interest to you as you would be able to | compile directly for the GPU rather than using an API to access it. | Not sure about Nvidia so double check | that.
As I understand it:
- LCC targets raw AMD GPU hardware
- that's probably not very useful because runtime support is needed for what you could consider OS-like functionality and that isn't provided.
+ scheduling
+ communicating with the host computer
+ failure handling and diagnostics
- Separately, a custom version of LCC is used as a code generator (partly(?) at runtime!) for OpenCL. I think that AMD tries to "upstream" their LCC changes but this is never soon enough.
- I think that nvidia also has a custom LCC but does not try to upsteam all of their goodies (LCC is not copylefted).
I may be wrong about much or all of this. I would like to know an accurate, comprehensive, comprehensible source for this kind of information.
| Here is the official documentation for AMD through: | https://llvm.org/docs/AMDGPUUsage.html
Thanks. I'll have a look.
| If your using it for machine learning it may be helpful to be aware of | it
You'd think so but few seem to bother. There's enough to get ones head around at the higher levels of abstraction.
Much ML seems to be done via cook-books.
| Hopefully that helps a little,
I'd love to hear a GTALUG talk about the lower levels. Perhaps a lightning talk next week would be a good place to start. --- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

| From: Alex Volkov via talk <talk@gtalug.org> | The other system I have is from 4 years ago, based on AM3+ platform, so it's | only has PCI v2 (there's a mention of PCI v3 requirements in ROCm | documentation), probably no exposed CRAT tables, and sketchy IOMMU support. | Just installing the card into it would likely not work, and upgrading the | system to something that will support ROCm stack properly would cost more then | $500, whereas just getting gtx off kijiji would be less than $200. I just assumed that my Haswell systems have PCIe v3.0, but I don't actually know. <https://rocm.github.io/ROCmPCIeFeatures.html> Googling for a way of testing for PCIe v3 from the linux command line isn't too rewarding. In the output of "lspci -vv" I see 8GT/s so I think that I have PCIe 3.0. LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us According to https://en.wikipedia.org/wiki/PCI_Express 1.0: 2.6 GT/s 2.0: 5.0 GT/s 3.0: 8.0 GT/s 4.0: 16.0 GT/s

On Mon, Aug 12, 2019 at 2:12 AM D. Hugh Redelmeier via talk <talk@gtalug.org> wrote:
| From: Alex Volkov via talk <talk@gtalug.org>
| The other system I have is from 4 years ago, based on AM3+ platform, so it's | only has PCI v2 (there's a mention of PCI v3 requirements in ROCm | documentation), probably no exposed CRAT tables, and sketchy IOMMU support. | Just installing the card into it would likely not work, and upgrading the | system to something that will support ROCm stack properly would cost more then | $500, whereas just getting gtx off kijiji would be less than $200.
I just assumed that my Haswell systems have PCIe v3.0, but I don't actually know.
<https://rocm.github.io/ROCmPCIeFeatures.html>
Googling for a way of testing for PCIe v3 from the linux command line isn't too rewarding. In the output of "lspci -vv" I see 8GT/s so I think that I have PCIe 3.0.
LnkCap: Port #2, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <256ns, L1 <8us
According to https://en.wikipedia.org/wiki/PCI_Express 1.0: 2.6 GT/s 2.0: 5.0 GT/s 3.0: 8.0 GT/s 4.0: 16.0 GT/s --- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
For reference yes thats correct Hugh. If you look through the Intel Assembly Manual it probably mentions it there . Through I don't think that Haswell has the SHA512 extension in hardware. Not sure when that was added but I believe it was in a later version. Nick

| From: xerofoify via talk <talk@gtalug.org> | For reference yes thats correct Hugh. If you look through the Intel | Assembly Manual it probably mentions | it there . By "it", I understand you to mean "PCIe 3.0 support". That's a bit odd for an "Assembly Manual". Do you mean this manual? <https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf> "Intel(TM) 64 and IA-32 ArchitecturesSoftware Developer's Manual" Volume 1 of 9. I don't see PCIe mentioned there. Nor would I expect it. There is mention of PCI in an example of using the MOVNTDQA instruction. | Through I don't think that Haswell has the SHA512 extension | in hardware. Not sure when that | was added but I believe it was in a later version. I don't see SHA-512 mentioned in that manual. I see 5.18 "INTEL(TM) SHA EXTENSIONS" but that does not cover SHA-512. Is this described elsewhere? Is this related to the PCIe version in some way?

On Mon, Aug 12, 2019 at 6:11 PM D. Hugh Redelmeier via talk <talk@gtalug.org> wrote:
| From: xerofoify via talk <talk@gtalug.org>
| For reference yes thats correct Hugh. If you look through the Intel | Assembly Manual it probably mentions | it there .
By "it", I understand you to mean "PCIe 3.0 support". That's a bit odd for an "Assembly Manual".
Do you mean this manual? <https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf>
"Intel(TM) 64 and IA-32 ArchitecturesSoftware Developer's Manual" Volume 1 of 9.
I don't see PCIe mentioned there. Nor would I expect it. There is mention of PCI in an example of using the MOVNTDQA instruction.
It was odd before but instructions can touch or swap with PCI so that's why. PCI is not like USB or other protocols it requires overhead on the CPU side if that makes sense including lanes/instructions to a lesser degree. It may not be mentioned for assembly manuals directly but in other hardware documentation very likely.
| Through I don't think that Haswell has the SHA512 extension | in hardware. Not sure when that | was added but I believe it was in a later version.
I don't see SHA-512 mentioned in that manual. I see 5.18 "INTEL(TM) SHA EXTENSIONS" but that does not cover SHA-512.
Is this described elsewhere?
That would be it. Sorry my member was incorrect its up to SHA256 currently.
Is this related to the PCIe version in some way? Sorry it was just a side comment. ---
Nick
Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

| From: xerofoify via talk <talk@gtalug.org> | | On Mon, Aug 12, 2019 at 6:11 PM D. Hugh Redelmeier via talk | <talk@gtalug.org> wrote: | > <https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf> | > | > "Intel(TM) 64 and IA-32 ArchitecturesSoftware Developer's Manual" | > Volume 1 of 9. | > | > I don't see PCIe mentioned there. Nor would I expect it. There is | > mention of PCI in an example of using the MOVNTDQA instruction. | > | It was odd before but instructions can touch or swap with PCI so that's why. PCI | is not like USB or other protocols it requires overhead on the CPU side if that | makes sense including lanes/instructions to a lesser degree. It may not be | mentioned for assembly manuals directly but in other hardware documentation | very likely. The architecture seen by a program is often separated from bus issues. PCI has historically been addressed as part of the memory address space (as opposed to the IO address space). Once caches were introduced, software needed to be able to make sure that it didn't cause misbehaviour in PCI bus operations. When talking to a device, you usually (but not always) wish the cache to be bypassed. Historically on x86 (post i486), you did that using the MTR Registers. I'm sure that has since been changed since there were too few of those. But the ideas are there. See 18.3.1 "Memory-Mapped I/O". If you look at 12.10.3 "Streaming Load Hint Instruction", you will see a discussion of this issue and the MOVNTDQA instruction. That's the context of the example referencing PCI. There is no need for the PCIe version to bleed into the abstract X86 architecture. BTW "WC" means "Write Combining". Memory so-designated (e.g. by an MTRR) is uncached but the processor may combine writes. This, for example, is often used for accessing graphics card buffers. Without write combining, many more writes would be required. Interestingly, on the machine I'm using to compose this email, /proc/mtrr shows 7 registers with write-back and one uncachable. None is write combining.

On Mon, Aug 12, 2019 at 9:01 PM D. Hugh Redelmeier via talk <talk@gtalug.org> wrote:
| From: xerofoify via talk <talk@gtalug.org> | | On Mon, Aug 12, 2019 at 6:11 PM D. Hugh Redelmeier via talk | <talk@gtalug.org> wrote: | > <https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf> | > | > "Intel(TM) 64 and IA-32 ArchitecturesSoftware Developer's Manual" | > Volume 1 of 9. | > | > I don't see PCIe mentioned there. Nor would I expect it. There is | > mention of PCI in an example of using the MOVNTDQA instruction. | > | It was odd before but instructions can touch or swap with PCI so that's why. PCI | is not like USB or other protocols it requires overhead on the CPU side if that | makes sense including lanes/instructions to a lesser degree. It may not be | mentioned for assembly manuals directly but in other hardware documentation | very likely.
The architecture seen by a program is often separated from bus issues. PCI has historically been addressed as part of the memory address space (as opposed to the IO address space).
Once caches were introduced, software needed to be able to make sure that it didn't cause misbehaviour in PCI bus operations. When talking to a device, you usually (but not always) wish the cache to be bypassed.
Historically on x86 (post i486), you did that using the MTR Registers. I'm sure that has since been changed since there were too few of those. But the ideas are there. See 18.3.1 "Memory-Mapped I/O".
If you look at 12.10.3 "Streaming Load Hint Instruction", you will see a discussion of this issue and the MOVNTDQA instruction. That's the context of the example referencing PCI. There is no need for the PCIe version to bleed into the abstract X86 architecture.
BTW "WC" means "Write Combining". Memory so-designated (e.g. by an MTRR) is uncached but the processor may combine writes. This, for example, is often used for accessing graphics card buffers. Without write combining, many more writes would be required.
Interestingly, on the machine I'm using to compose this email, /proc/mtrr shows 7 registers with write-back and one uncachable. None is write combining. ---
Hugh, That's correct. I wasn't sure if the manual mentions it directly as related to PCI express but the correct way of doing this is DMA. DMA or direct memory access and the amount that may be buffered from an io range these days is dependent on the memory model of the CPU. This includes the assembly used to access it, again this does matter to the discussion as Alex is dealing with GPUs and going across the PCI express bus is very expensive, its basically a cache miss for GPUs. Not sure if MTRRs or PATs which seem to be the current version would help here as that's for mapping a region of memory to the CPU not dealing with PCI bus latency issues. You should attempt to map the region in memory but again you need to poll the GPU so often in order to get the data so doesn't seem as good as DMA which does not wait on the processor and just is filled directly. Maybe I misspoken in stating that PCI was mentioned directly but that memory models related to this are and the amount that may be DMAed is dependent on the processor architecture. Sorry for the misunderstanding, Nick
Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

Hey Nick, See my replies below On 2019-08-06 8:09 p.m., xerofoify via talk wrote:
On Tue, Aug 6, 2019 at 4:23 PM Alex Volkov via talk <talk@gtalug.org> wrote:
Yes.
Unfortunately I went though the debugging process before I got to the paragraph. On the upside I think I got a lightning talk out of it, that I'll try to present at the next meeting.
Alex.
Alex, I don't know how much your intending to do with that GPU or otherwise. If your just using Nvidia I can't help you as mentioned but if your interested in GPU workloads was looking at the AMDGPU backend for LLVM. Not sure if there is one that targets Nvidia cards but it may be of interest to you as you would be able to compile directly for the GPU rather than using an API to access it. Not sure about Nvidia so double check that.
Here is the official documentation for AMD through: https://llvm.org/docs/AMDGPUUsage.html
I haven't gotten that far into it, just ran a few ML tutorials, I haven't yet created any of my own models yet, my knowledge is limited to trying out tensorflow package and getting tensorflow-gpu (nvidia bindings) verifiably working with the hardware. Turns out NVIDIA drivers have this nice feature of falling back to CPU processing when there's an error -- this is useful when needing to get things done at any cost, not so much when attempting to debug the issue. So far I mostly used the card for hw-accelerated h264 encoding through ffmpeg.
If your using it for machine learning it may be helpful to be aware of it as you could compile the libraries if possible onto the GPU target rather than access than indirectly through the CPU. Again not sure of what libraries but you should for most of the popular ones and that may increase throughput a lot as it's direct assembly for the card not abstracted.
Thanks for the advice, I'm not that far along in the process to use this information. You seem to know a lot on the topic of optimizing workloads on GPU, would you like to come to our meeting next Tuesday and give a 5-10 minute talk on this? -- https://gtalug.org/meeting/2019-08/
As for GPU memory that may be a issue as Hugh mentioned depending on the size of the workload. I don't think it would matter for your tutorials but going across the PCI bus is about as bad as cache misses for CPUs so best to not have them if possible. If you were able to find a 6GB version that would be more than enough for most workloads excluding professional. 1060s were shipped with either 3 or 6GB so that may be something for card you ordered to check. Retail I recall it being about a 30-50 Canadian difference and for double the RAM it was a good detail at the time if you bought one.
There seem to be a lot of gamers who upgraded to 1080 selling used 1060 6GB for a reasonable price. I got MSI GTX 1060 6GB version. So far with h264 encoding I've noticed that there's significant processing drop, when the card finishes encoding a chunk of data, then saturates PCI bus. Alex.
Hopefully that helps a little,
Nick
P.S. Not aware but I'm assuming there is one for gcc as well if you would prefer that for your development or learning.
On 2019-08-05 11:12 a.m., D. Hugh Redelmeier via talk wrote:
| From: Alex Volkov via talk <talk@gtalug.org>
| I have another system with Ryzen 5 2400G and was hoping to run ROCm on it, but | as it turns out -- ROCm doesn't fully support AMD cards with built-in | graphics. I still can install discreet card into that system but the solution | is not as cheap as getting a used GTX off craigslist.
In January I saw cheap Radeo RX 580s on Kijiji too. I haven't looked recently.
One advantage of AMD over nvidia is that larger memories are more common.
It's a shame about ROCm's lack of APU support. Parts of it are there.
<https://rocm.github.io/hardware.html>
The iGPU in AMD APUs
The following APUs are not fully supported by the ROCm stack.
“Carrizo” and “Bristol Ridge” APUs “Raven Ridge” APUs
These APUs are enabled in the upstream Linux kernel drivers and the ROCm Thunk. Support for these APUs is enabled in the ROCm OpenCL runtime. However, support for them is not enabled in our HCC compiler, HIP, or the ROCm libraries. In addition, because ROCm is currently focused on discrete GPUs, AMD does not make any claims of continued support in the ROCm stack for these integrated GPUs.
In addition, these APUs may may not work due to OEM and ODM choices when it comes to key configurations parameters such as inclusion of the required CRAT tables and IOMMU configuration parameters in the system BIOS. As such, APU-based laptops, all-in-one systems, and desktop motherboards may not be properly detected by the ROCm drivers. You should check with your system vendor to see if these options are available before attempting to use an APU-based system with ROCm.
--- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
--- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

On Wed, Aug 7, 2019 at 11:25 AM Alex Volkov <subscriptions@flamy.ca> wrote:
Hey Nick,
See my replies below
Hey Alex, Below are some final comments.
On 2019-08-06 8:09 p.m., xerofoify via talk wrote:
On Tue, Aug 6, 2019 at 4:23 PM Alex Volkov via talk <talk@gtalug.org> wrote:
Yes.
Unfortunately I went though the debugging process before I got to the paragraph. On the upside I think I got a lightning talk out of it, that I'll try to present at the next meeting.
Alex.
Alex, I don't know how much your intending to do with that GPU or otherwise. If your just using Nvidia I can't help you as mentioned but if your interested in GPU workloads was looking at the AMDGPU backend for LLVM. Not sure if there is one that targets Nvidia cards but it may be of interest to you as you would be able to compile directly for the GPU rather than using an API to access it. Not sure about Nvidia so double check that.
Here is the official documentation for AMD through: https://llvm.org/docs/AMDGPUUsage.html
I haven't gotten that far into it, just ran a few ML tutorials, I haven't yet created any of my own models yet, my knowledge is limited to trying out tensorflow package and getting tensorflow-gpu (nvidia bindings) verifiably working with the hardware.
Turns out NVIDIA drivers have this nice feature of falling back to CPU processing when there's an error -- this is useful when needing to get things done at any cost, not so much when attempting to debug the issue.
So far I mostly used the card for hw-accelerated h264 encoding through ffmpeg.
The only warning here and it may not matter for you personally is the encoding picture isn't as great as on a CPU at least to my knowledge. If your encoding raw video or some things like uncompressed bluray quality its a huge deal but otherwise you may not notice. Again if your just encoding for twitch or youtube quality its probably fine.
If your using it for machine learning it may be helpful to be aware of it as you could compile the libraries if possible onto the GPU target rather than access than indirectly through the CPU. Again not sure of what libraries but you should for most of the popular ones and that may increase throughput a lot as it's direct assembly for the card not abstracted.
Thanks for the advice, I'm not that far along in the process to use this information.
You seem to know a lot on the topic of optimizing workloads on GPU, would you like to come to our meeting next Tuesday and give a 5-10 minute talk on this? -- https://gtalug.org/meeting/2019-08/
As for GPU memory that may be a issue as Hugh mentioned depending on the size of the workload. I don't think it would matter for your tutorials but going across the PCI bus is about as bad as cache misses for CPUs so best to not have them if possible. If you were able to find a 6GB version that would be more than enough for most workloads excluding professional. 1060s were shipped with either 3 or 6GB so that may be something for card you ordered to check. Retail I recall it being about a 30-50 Canadian difference and for double the RAM it was a good detail at the time if you bought one.
There seem to be a lot of gamers who upgraded to 1080 selling used 1060 6GB for a reasonable price. I got MSI GTX 1060 6GB version.
So far with h264 encoding I've noticed that there's significant processing drop, when the card finishes encoding a chunk of data, then saturates PCI bus.
Alex.
Odd it should forward it in batches. That seems like a missed optimization to me. Not sure if your aware but most after market cards i.e. OEM or board partners overclock there cards out of the box. Not sure how much that card is overclocked but it would be up to 300mhz than a reference card. I always recommend aftermarket unless validation of the cards is key and than just go for Quardo as their more validated. Nick
Hopefully that helps a little,
Nick
P.S. Not aware but I'm assuming there is one for gcc as well if you would prefer that for your development or learning.
On 2019-08-05 11:12 a.m., D. Hugh Redelmeier via talk wrote:
| From: Alex Volkov via talk <talk@gtalug.org>
| I have another system with Ryzen 5 2400G and was hoping to run ROCm on it, but | as it turns out -- ROCm doesn't fully support AMD cards with built-in | graphics. I still can install discreet card into that system but the solution | is not as cheap as getting a used GTX off craigslist.
In January I saw cheap Radeo RX 580s on Kijiji too. I haven't looked recently.
One advantage of AMD over nvidia is that larger memories are more common.
It's a shame about ROCm's lack of APU support. Parts of it are there.
<https://rocm.github.io/hardware.html>
The iGPU in AMD APUs
The following APUs are not fully supported by the ROCm stack.
“Carrizo” and “Bristol Ridge” APUs “Raven Ridge” APUs
These APUs are enabled in the upstream Linux kernel drivers and the ROCm Thunk. Support for these APUs is enabled in the ROCm OpenCL runtime. However, support for them is not enabled in our HCC compiler, HIP, or the ROCm libraries. In addition, because ROCm is currently focused on discrete GPUs, AMD does not make any claims of continued support in the ROCm stack for these integrated GPUs.
In addition, these APUs may may not work due to OEM and ODM choices when it comes to key configurations parameters such as inclusion of the required CRAT tables and IOMMU configuration parameters in the system BIOS. As such, APU-based laptops, all-in-one systems, and desktop motherboards may not be properly detected by the ROCm drivers. You should check with your system vendor to see if these options are available before attempting to use an APU-based system with ROCm.
--- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
--- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

On 2019-07-20 2:47 p.m., Alex Volkov via talk wrote:
I'm looking to buy used Nvidia GeForce GXT 1060 to run some ML tutorials.
Did this work out for you? I find myself in the market for a CUDA-capable card to run Meshroom — https://alicevision.org/#meshroom — a well-regarded photogrammetry suite. It only works on CUDA-equipped systems. I don't need to spend much. Technically, the package will run on my 2013 Samsung Chronos ultrabook with a GT 640M graphics card, but it's so slow and hot that it's not worth the bother. cheers, Stewart

Hey Stewart, I've been using the card for mostly video encoding thus far. I haven't had the time to do a lot of ML on it. The short answer is Yes, but it might be a bit of a pain to set it up. Try one of the supported systems by Nvidia to see if you need to jump through fewer hoops than I did. I didn't do any benchmarks so I can only say subjective things about it -- sped-up editing in kdenlive feels a lot faster, I needed to get a version of the program with disabled HW acceleration and my experience with it was a lot more frustrating. Final video encoding speed doesn't seem to be that much faster, but I think I haven't yet tuned all of the parameters. As I think I mentioned before, gamers are upgrading to GTX 1080 and there are a lot of GTX 1060 6GB (6GB is important, the original version came with 3GB) cards on the market which you can get for less than $200 used. If you have a desktop computer with decent power supply and PCIe (v2 or v3) this is pretty reasonable option. I did a talk on how to get the card working back in August -- https://youtu.be/eMu7ynAwECY?t=2m53s You need to jump though some hoops configuring cuda, it might be worthwhile trying to install it on a supported system. i.e. Ubuntu LTS. There are also a lot of different libraries that have different licensing from nvidia, so even when most of things work, some might still not, the most recent example for me -- scale_npp, this is accelerated video-rescaling I used for downscaling 4K and creating proxies -- https://developer.nvidia.com/ffmpeg Everything was working except for this thing, I had to recompile it myself. This is one of the things that seem to be at least twice as fast as doing this on 8-core CPU. So if something not working or not giving you expected performance look in the logs. Alex. On 2019-10-02 8:55 p.m., Stewart C. Russell via talk wrote:
On 2019-07-20 2:47 p.m., Alex Volkov via talk wrote:
I'm looking to buy used Nvidia GeForce GXT 1060 to run some ML tutorials.
Did this work out for you? I find myself in the market for a CUDA-capable card to run Meshroom — https://alicevision.org/#meshroom — a well-regarded photogrammetry suite. It only works on CUDA-equipped systems.
I don't need to spend much. Technically, the package will run on my 2013 Samsung Chronos ultrabook with a GT 640M graphics card, but it's so slow and hot that it's not worth the bother.
cheers, Stewart --- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk
participants (5)
-
Alex Volkov
-
Alex Volkov
-
D. Hugh Redelmeier
-
Stewart C. Russell
-
xerofoify