Date: April 30, 2018
Author(s): Rob Williams
There hasn’t been a great deal of movement on the ProViz side of the graphics card market in recent months, so now seems like a great time to get up to speed on the current performance outlook. Equipped with 12 GPUs, one multi-GPU config, current drivers, and a gauntlet of tests, let’s find out which cards deserve your attention.
We’ve been keeping busy with workstation-related content at SmartKevin recently, with articles involving a performance look at Chaos Group’s upcoming V-Ray 4.0 and AMD’s Radeon ProRender, as well as a recap of what we learned at NVIDIA’s GTC 2018.
One thing we haven’t posted lately is an updated performance look across the entire fleet of workstation GPUs we have available to us. The last time we tackled WS GPU performance in any depth was following the launch of AMD’s Radeon RX Vega series, where we discovered that Vega isn’t to be messed with on the compute side.
In this updated look, the Vega 64 makes a return with updated numbers, and the RX 580 has joined in on the fun as well, to give us a look at non-Pro Polaris performance. That comes in addition to NVIDIA’s gaming-bound GeForce GTX 1080 Ti, and also the TITAN Xp (x2). For actual pro cards, we have the Quadro P2000, P4000, P5000, P6000, and Radeon Pro WX 3100, WX 4100, WX 5100, and WX 7100.
A handful of GPUs are missing from the list above, such as AMD’s Frontier Edition and WX 9100, as well as NVIDIA’s Quadro GV100 and TITAN V. The worst of these to leave out are the TITAN V and WX 9100, but I’ll be covering the possibilities of those cards as the article progresses.
As always, the tests we chose to run on these GPUs tackle many different scenarios, including rendering, encoding, crypto and other mathematics, viewport interactions, and a bit of gaming. There’s also something a bit special: this article introduces our first deep-learning benchmarks, which will pave the way for more comprehensive looks in the future.
To get a move on, let’s take a look at the current product stacks from both AMD and NVIDIA:
|Cores||Base MHz||Peak FP32||Memory||Bandwidth||TDP||Price|
|RX Vega 64||4096||1247||12.6 TFLOPS||8 GB 2||483.8 GB/s||295W||$499|
|RX 580||2304||1257||6.2 TFLOPS||8 GB 1||256 GB/s||185W||$229|
|Frontier||4096||1382||13.1 TFLOPS||16 GB 2||484 GB/s||300W||$999|
|WX 9100||4096||1200||12.3 TFLOPS||16 GB 3||484 GB/s||250W||$2,199|
|WX 7100||2304||900||5.73 TFLOPS||8 GB 1||224 GB/s||130W||$799|
|WX 5100||1792||926||3.89 TFLOPS||8 GB 1||160 GB/s||75W||$499|
|WX 4100||1024||925||2.46 TFLOPS||4 GB 1||96 GB/s||50W||$399|
|WX 3100||512||1219||1.25 TFLOPS||4 GB 1||96 GB/s||50W||$200|
|WX 2100||512||1219||1.25 TFLOPS||2 GB 1||48 GB/s||35W||$200|
|Notes||1 GDDR5; 2 HBM2; 3 HBM2 + ECC|
An italicized name means we don’t have that card for testing.
Prices listed as MSRP, retail price may vary.
We don’t have one here, but I feel like the Frontier Edition would be the best overall choice in AMD’s lineup for those who want top-end performance across a range of scenarios. I’m not entirely sure how its strengths compare to RX Vega, but given current GPU pricing, the Frontier Edition can typically be had for $900, which isn’t a massive premium over the Vega 64 at miner-fueled pricing.
I’d wager that the WX 9100 has more key optimizations than the Frontier Edition for higher-end workloads, but this information is hard to guess without access to the card. That card is an obvious choice for those with critical workloads, since the HBM2 has ECC capabilities.
For general compute where scenario-driven optimizations are not needed, RX Vega 64 is impossible to beat within AMD’s own lineup. You’ll see some great performance in specific areas throughout the article, especially with regards to OpenCL rendering and cryptography. In an SRP world, the Vega 64’s price tag delivers a healthy wallop of performance to the dollar.
NVIDIA’s lineup is a bit larger than AMD’s, especially on the top-end, where two Volta cards sit.
|Cores||Base MHz||Peak FP32||Memory||Bandwidth||TDP||Price|
|TITAN V||5120||1200||14.9 TFLOPS||12 GB 2||653 GB/s||250W||$3,000|
|TITAN Xp||3840||1405||12.1 TFLOPS||12 GB 4||548 GB/s||250W||$1,199|
|GTX 1080 Ti||3584||1480||11.8 TFLOPS||11 GB 4||484 GB/s||250W||$649|
|GV100||5120||1200||14.9 TFLOPS||32 GB 3||870 GB/s||250W||$8,999|
|P6000||3840||1417||11.8 TFLOPS||24 GB 5||432 GB/s||250W||$4,999|
|P5000||2560||1607||8.9 TFLOPS||16 GB 5||288 GB/s||180W||$1,999|
|P4000||1792||1227||5.3 TFLOPS||8 GB 4||243 GB/s||105W||$799|
|P2000||1024||1370||3.0 TFLOPS||5 GB 4||140 GB/s||75W||$399|
|P1000||640||1354||1.9 TFLOPS||4 GB 4||80 GB/s||47W||$299|
|P620||512||1354||1.4 TFLOPS||2 GB 4||80 GB/s||40W||$199|
|P600||384||1354||1.2 TFLOPS||2 GB 4||64 GB/s||40W||$179|
|P400||256||1070||0.6 TFLOPS||2 GB 4||32 GB/s||30W||$139|
|Notes||1 GDDR5; 2 HBM2; 3 HBM2 + ECC; 4 GDDR5X; 5 GDDR5X + ECC|
An italicized name means we don’t have that card for testing.
Prices listed as MSRP, retail price may vary.
It’s hard to call a card like the TITAN V a good “bang-for-the-buck” with a $3,000 price tag, but it’s admittedly the most lucrative card of the bunch to me – even more so than the GV100. Sure, that 32GB of ECC memory is nice, but given the Quadro-like performance delivered by the TITAN Xp in many cases, a TITAN V would offer a considerable performance gain over a card like the P6000. That’s ignoring the fact that TITAN V includes Tensor cores as well, which as we learned last month is being used to complement NVIDIA’s AI-driven denoiser (but that’s only one of countless possibilities).
For easier-to-stomach pricing, the TITAN Xp at $1,200 offers tremendous value where more serious workloads are concerned; eg: CATIA, and Siemens NX (the latter of which exhibits a 20x performance boost over GTX 1080 Ti). For raw compute, the 1080 Ti is an obvious choice, especially with its large 11GB GDDR5X. With the performance delta so tight between the 1080 Ti and TITAN Xp, it’s little surprise that NVIDIA decided to transplant some Quadro optimizations to the higher-end option.
On the following pages, the results of our WS GPU test gauntlet will be seen. As mentioned before, the tests chosen cover a wide-range of scenarios, from rendering to compute, and includes the use of both synthetic benchmarks and tests with real-world applications from the likes of Adobe and Autodesk.
12 GPUs are being tested for this article, although because we had a second TITAN Xp on-hand, dual-GPU results will also appear throughout – if the dual-GPU configuration isn’t found in a chart, it means there was no performance scaling whatsoever across the multiple cards.
Here are the specs of the test machine:
|SmartKevin Workstation Test System|
|Processor||Intel Core i9-7980XE (18-core; 2.6GHz)|
|Motherboard||ASUS ROG STRIX X299-E GAMING|
|Memory||HyperX FURY (4x16GB; DDR4-2666 16-18-18)|
|Graphics||AMD Radeon RX Vega 64 8GB (Radeon 18.3.3)|
AMD Radeon RX 580 8GB (Radeon 18.3.3)
AMD Radeon Pro WX 7100 8GB (Radeon Pro 18.Q1)
AMD Radeon Pro WX 5100 8GB (Radeon Pro 18.Q1)
AMD Radeon Pro WX 4100 4GB (Radeon Pro 18.Q1)
AMD Radeon Pro WX 3100 4GB (Radeon Pro 18.Q1)
NVIDIA TITAN Xp 12GB (GeForce 391.01)
NVIDIA GeForce GTX 1080 Ti 11GB (GeForce 391.01)
NVIDIA Quadro P6000 24GB (Quadro 391.03)
NVIDIA Quadro P5000 16GB (Quadro 391.03)
NVIDIA Quadro P4000 8GB (Quadro 391.03)
NVIDIA Quadro P2000 4GB (Quadro 391.03)
|Storage||Kingston KC1000 960GB M.2 SSD|
|Power Supply||Corsair 80 Plus Gold AX1200|
|Chassis||Corsair Carbide 600C Inverted Full-Tower|
|Cooling||NZXT Kraken X62 AIO Liquid Cooler|
|Et cetera||Windows 10 Pro build 16299|
Ubuntu 16.04 (4.13 kernel)
|For an in-depth pictorial look at this build, head here.|
Benchmark results are categorized and spread across the next five pages. On page 2, AMD’s ProRender and Chaos Group’s V-Ray take on Autodesk’s 3ds Max, while the Cadalyst benchmark is run through AutoCAD. Page 3 is home to our encode tests, as well as synthetic rendering benchmarks that you can run at home, for comparison’s sake.
SPEC produces so many benchmarks worthy of inclusion in our workstation GPU content, that it’s earned itself its own page. So on page 4, SPECviewperf helps us gain an understanding of viewport performance across 8 different applications. SPECapc 3ds Max 2015 and Maya 2017 finish things up with exhaustive tests in their namesake Autodesk products.
Like SPEC, Sandra’s test suite is large, so page 5 is dedicated to three of its tests: Cryptography, Financial Analysis, and Scientific Analysis. After a fair bit of research and tweaking, we’re proud to announce our very first set of deep learning benchmarks on page 6.
Some quick and dirty gaming benchmarks are featured on page 7: UL’s 3DMark and VRMark, as well as Unigine’s Superposition. Finally, the last page includes power results, as well as the final thoughts.
So without further ado, let’s get this train moving.
One of the best use cases for GPUs in rendering is to improve a scene’s lighting, something achieved through numerous ray tracing renderers found on the market, including AMD’s ProRender, and Chaos Group’s V-Ray.
ProRender can make use of NVIDIA hardware (but with a “warning”), and likewise, AMD will work with V-Ray – although I am still working on finding an ideal project that renders the same on both GeForce and Radeon.
Up first is ProRender in Autodesk’s 3ds Max 2017. 2018 is officially supported by AMD, even if it’s really difficult to find that out on the official website. I have not yet heard anything about support for 2019, but since ProRender development is still in full force, it’d be nice to see support added soon for Autodesk’s newest 3ds Max.
To test, the scene below was rendered with 250 iterations at 1080p. A production render would be better set to something like 2,500, but 250 is still plenty to separate the men from the boys in benchmarking. You can grab the project file yourself for free from the ProRender GitHub page.
Is it a surprise that AMD performs well here? As in, very well? By all appearances, ProRender doesn’t “favor” AMD’s hardware in the same way another renderer may run better on one architecture over another. ProRender runs entirely through OpenCL, so the better-tuned the hardware, the better the performance.
In this case, AMD’s strong OpenCL performance helps propel the Vega 64 to the top of the chart, sitting only behind the TITAN Xp. Another way to look at it is: Vega 64 also performs better than the more expensive GTX 1080 Ti. Not too shabby.
As you can see near the top of the chart, the dual-GPU configuration didn’t fare well at all. Based on what I’ve learned since testing, multi-GPU works fine for Radeon, so I’m not entirely sure what the problem was on the green side to weaken performance over a single card.
For a much more thorough look at ProRender performance, you should check out a dedicated article we posted a few weeks ago.
For V-Ray, Autodesk’s 3ds Max 2019 is used as the base. Whereas AMD’s Radeon ProRender supports a limited number of 3ds Max versions, Chaos supports every version all the way back to 2013, and with day one support for 2019, it’s safe to say Chaos’ developers are on top of things.
To test, the freely available ‘Teaset’ scene (found here) is rendered at 1080p with largely default settings. The changes involve the time limit and samples limit being disabled, and noise limit being set to 0.25, to allow a render to last just long enough to generate some meaningful results.
I need to do more testing with V-Ray on AMD hardware before introducing Radeons into this set of results, but initial experiences dictate that this particular project will not be suitable for all GPUs. On AMD hardware, elements of the scene don’t render properly, so some hooks seem to be off. V-Ray supports OpenCL just fine, so I ultimately would like to find a project that renders the exact same across CUDA and OpenCL. Once I find that, I’ll introduce AMD into these results.
That all said, I know that this is a GPU-focused article, but hot damn can a good CPU contribute a lot to a ray traced render – at least in this particular case. In time, I’d like to test CPU+GPU rendering in a handful of projects to get a better overall look on things.
Clearly, the faster the GPU, the faster the render (at least this one), but clearly, it’d be silly to run a top-end GPU on a small CPU when both can complement each other so well.
For a much more thorough look at V-Ray performance, you should check out a dedicated article we posted a few weeks ago.
Some of SPEC’s benchmarks on the following page take a look at CAD performance, but AutoCAD is left out. So with the help of Cadalyst, a benchmark produced by the people at the website of the same name, the application’s 3D performance is tested (along with I/O and CPU, but that isn’t needed here).
It’s not too often that we see a clear divide like the one here, where the entire AMD and NVIDIA product stacks are separated. I don’t hear AMD talk about AutoCAD much, and this performance is probably why. This is not to say that it’s “poor”, but if you work with AutoCAD day in and day out, NVIDIA gets the clear performance nod.
To test the accelerated encoding perks of different GPUs, Adobe’s Premiere Pro 2018 is used. For production, the best use of GPUs is to render the countless number of filters, or accelerate the scaling down to lower resolutions. Encoding one 1080p video to another might not exhibit much of a speed-up (if one at all) on the GPU, but 4K to 1080p could benefit.
Three projects are used for testing here. The Music Video one came straight from NVIDIA, so it’s clearly optimized for CUDA, but encodes fine on OpenCL. The 4K (~300Mbps) and 8K (~1.1Gbps) RED files are processed as straight-forward reencodes to lower resolutions.
Dropping the requirement for rendering effects, AMD claws its way to the very top, beating out NVIDIA’s entire stack, by a single second (and yes, it’s repeatedly 1 second quicker in both tests). 1 second difference is obviously pretty minor in the grand scheme, even when you take a full-length project into account, but for both 4K and 8K, you’ll definitely want to go higher than the WX 5100 series.
AMD might kick Intel’s ass on the CPU side of Cinebench, but it falls short against NVIDIA on the GPU side. But, so does NVIDIA’s own GeForce lineup, which falls quite a bit behind the Quadro cards, and surprisingly, behind AMD’s WX 7100, RX 580, and Vega 64. For the ultimate performance, NVIDIA has the definite lead here; even the P2000 beats out AMD’s top offerings.
This V-Ray benchmark is different from the real-world scene render on the previous page. This test is completely free to download, and uses CUDA for NVIDIA, and OpenCL for AMD. I do admittedly question the accuracy of the latter, and the results above can highlight why.
An obvious discrepancy here is that the WX 7100 falls behind every single other lower-tier WX card. What’s more, I often get very different results from run-to-run on Radeon cards in general, whereas I don’t on NVIDIA. For example, one run on a single card might be 4 minutes, whereas the second run will be 3 minutes (or vice-versa; I’ve had it start low and go higher on the second run).
On AMD, take these results with a grain of salt. On all of the AMD cards, I ran the test as many times as it took to get a scalable result, but with the WX 7100, no amount of runs would change that result. It’s not the first time I’ve run into issues with V-Ray bench’s results, and it’s unfortunate that fresh issues have seemingly replaced the old ones. If you happen to run AMD, and don’t have this same issue (especially with the WX 7100 in particular), please leave a comment.
OctaneBench is a CUDA-only benchmark, hence the complete lack of Team Red. These results pretty much scale as one would expect, especially since Quadro has no apparent advantage here; the faster the hardware, the better the performance. Especially if you pile on additional GPUs.
With LuxMark, we once again see AMD’s strong OpenCL performance delivering great results, sliding the RX Vega 64 right in behind NVIDIA’s TITAN Xp. Oddly, the Quadro P5000 runs worse than the P4000 in two of the renders, which is something I can’t explain. I encountered the same results last summer, so whatever causes the difference isn’t something as simple as a driver issue.
When it comes to benchmarking hardware for serious use cases, there is no place better to look than SPEC. I’ve dubbed the folks there as “the masters of benchmarking”, as each one of SPEC’s tools are meticulously crafted by professionals to deliver results as relevant and accurate as possible – a goal shared by us at SmartKevin.
Three SPEC suites are used for testing here, starting with SPECviewperf, for viewport performance across nine applications. Finally, SPECapc 3ds Max 2015 and Maya 2017 finish up the page to help us gauge performance in the respective Autodesk applications. We used to include SPECwpc, but it’s far better suited for testing complete systems, not individual components.
Results from 3ds Max through SPEC may seem redundant given the real-world testing in the same application a couple of pages ago, but it’s important to note the distinction between the tests: our tests exercise the rendering performance, whereas SPECviewperf tests the viewport performance inside of the application. AKA: the big window where the magic happens. If GPU A equals GPU B in rendering time, it may still differ in viewport performance, due to driver optimizations (one of the big reasons a pro card carry heftier price tags).
With the above tests, there’s no strong advantage for Quadro over GeForce, but there may be one for Radeon Pro over Radeon, based on the WX 7100 vs. Vega 64 Maya result. The 1080 Ti looks to be the best value card of this bunch, almost matching the performance of its bigger (twice as expensive) brother, and being noticeably better than the slower-clocked P6000.
The P6000 got slapped by the 1080 Ti in the previous set of tests, but it strikes back hard in the medical and energy tests. Given the scenarios, I’d assume that was because of the 24GB of memory, but my logs show that no more than 3.5GB of VRAM was used at any point during the entire viewperf run… so, I’d attest it to general Quadro driver optimizations.
That especially seems likely when you consider the fact that the P5000 ranks so far ahead of Vega 64, even though the latter would defeat the former by a 5-10% margin in a gaming match-up. It’s clear where some NVIDIA driver R&D dollars have had focus.
Siemens NX is one of the highest-end CAD applications on the market, so it’s no surprise that NVIDIA has put a ton of focus into optimizing performance for it. Seeing as though SNX’s default renderer is based on NVIDIA’s CUDA-based Iray, some increased performance is expected, but the delta between top and bottom here is extreme.
Before moving on, don’t miss the sad-looking 1080 Ti sitting at the very bottom. GeForce gets absolutely no Siemens NX love, but TITAN gets love for both that and CREO – just not as much as Quadro, as we see the slower-clocked P6000 outpacing the TITAN Xp.
Like Siemen’s NX, Dassault’s CATIA and SolidWorks are competing high-end CAD applications, and once again, we can see the clear advantage Quadro users have. Whereas the TITAN Xp performed admirably against the P6000 in SNX, it falls to half performance in SolidWorks. The CATIA result is less disjointed, but interestingly, the P5000 matches the performance of the TITAN Xp, but like the P6000, it offers incredible gains in SolidWorks.
If not for the RX 580 being included here, the fact that regular Radeons were not optimized for 3ds Max would escape me. AMD does have some clear optimization, as the slower WX 7100 outperforms the top Polaris card in the lineup. The Vega 64 can’t even match it, but comes a lot closer thanks to its added capability.
At 1080p, there’s another clear divide between AMD and NVIDIA, but things shake up to the most minor of degrees by swapping the positions of the Vega 64 and… lowly P2000. NVIDIA has clear advantages here as the P6000 outperforms the faster TITAN Xp once again.
SPEC’s Maya test paints a similar picture to the 3ds Max one, but AMD gained in some unexpected places. The WX 7100 outperforms every other Radeon, which makes me think that a card like the WX 9100 could at least match the top NVIDIA options here. At 4K, the Vega 64 takes the AMD reign.
Overall, NVIDIA leads the pack, not just with the Quadro offerings, but TITAN and GeForce as well. Optimizations on that camp mean you get huge value with a GPU like the GTX 1080 Ti, and a lot of value with mainstream Quadros like the P4000. But AMD offers a lot of value at the same performance spot, and even better at 4xAA. Only 1080 Ti, TITAN Xp, and P5000+ best it.
On the previous page, I mentioned that SPEC is an organization that crafts some of the best, most comprehensive benchmarks going, and in a similar vein, I can compliment SiSoftware. This is a company that thrives on offering support for certain technologies before those technologies are even available to the consumer. In that regard, its Sandra benchmark might seem a little bleeding-edge, but at the same time, its tests are established, refined, and accurate across multiple runs.
While Sandra offers a huge number of benchmarks, just three of the GPU ones are focused on: Cryptography, Financial Analysis, and Scientific Analysis. Some of the results are a bit too complex for a graph, so a handful of tables are coming your way.
It’s hard to talk about these results without first drawing attention to the top: Vega 64 manages to match SLI’d TITAN Xps (which actually scale properly) in both the encrypt/decrypt process, improves-upon the crypto test against the TITAN Xp, and comes close to the same GPU for the hashing test. Where crypto and hashing are concerned, nothing can touch the dual-GPU test, but this will come as a surprise to no GPU miner.
For cryptography, AMD easily wins here. The company must care an awful lot about crypto, because we’ve seen similar gains on the Ryzen side of things as well. For a fun comparison, look at the WX 3100 versus the Vega 64. The difference is close to being 1:10, but of course, they target very different customer bases. Even so, there’s a lot of value to be had from Vega.
|Sandra 2017 – Financial Analysis (FP32)|
|NVIDIA TITAN Xp x 2||26 G/s||4.4 M/s||11.1 M/s|
|NVIDIA TITAN Xp||14 G/s||2.3 M/s||5.7 M/s|
|NVIDIA GeForce GTX 1080 Ti||11.6 G/s||2.1 M/s||5.38 M/s|
|NVIDIA Quadro P6000||11.6 G/s||2.2 M/s||5.9 M/s|
|AMD Radeon RX Vega 64||9.3 G/s||2.7 M/s||4.2 M/s|
|NVIDIA Quadro P5000||7.8 G/s||1.7 M/s||4.2 M/s|
|NVIDIA Quadro P4000||6.6 G/s||845.6 k/s||2.2 M/s|
|AMD Radeon RX 580||5.8 G/s||1.5 M/s||2.3 M/s|
|AMD Radeon Pro WX 7100||5.3 G/s||1.3 M/s||1.9 M/s|
|NVIDIA Quadro P2000||3.8 G/s||653.7 k/s||1.6 M/s|
|AMD Radeon Pro WX 5100||3.7 G/s||530.3 k/s||736.2 k/s|
|AMD Radeon Pro WX 4100||2.2 G/s||497.8 k/s||728 k/s|
|AMD Radeon Pro WX 3100||2.5 G/s||320.6 k/s||467.4 k/s|
|Results in options-per-second. 1 GOPS = 1,000 MOPS; 1 MOPS = 1,000 kOPS.|
|Sandra 2017 – Financial Analysis (FP64)|
|NVIDIA TITAN Xp x 2||2.7 G/s||274 k/s||554 k/s|
|AMD Radeon RX Vega 64||2.1 G/s||181 k/s||515.1 k/s|
|NVIDIA TITAN Xp||1.5 G/s||143.4 k/s||297.2 k/s|
|NVIDIA GeForce GTX 1080 Ti||1.4 G/s||135.4 k/s||265.8 k/s|
|NVIDIA Quadro P6000||1.3 G/s||131.3 k/s||271.3 k/s|
|AMD Radeon RX 580||1.1 G/s||90.1 k/s||280.4 k/s|
|NVIDIA Quadro P5000||908.7 M/s||91.7 k/s||188.4 k/s|
|AMD Radeon Pro WX 7100||962.6 M/s||81.27 k/s||239.2 k/s|
|NVIDIA Quadro P4000||565.9 M/s||55.5 k/s||110.7 k/s|
|AMD Radeon Pro WX 5100||456.2 M/s||52.7 k/s||108.8 k/s|
|AMD Radeon Pro WX 4100||384 M/s||34 k/s||95.2 k/s|
|NVIDIA Quadro P2000||359.6 M/s||36 k/s||74.8 k/s|
|AMD Radeon Pro WX 3100||219.1 M/s||17.9 k/s||54.8 k/s|
|Results in options-per-second. 1 GOPS = 1,000 MOPS; 1 MOPS = 1,000 kOPS.|
Not a single one of the GPUs featured here supports proper double-precision performance (like a Radeon Instinct, NVIDIA Tesla, or Quadro GP100/GV100 would provide), so all of the performance on that side of the fence is kind of pointless, because no one revolves their work around FP64 and opts for capped hardware. That said, when we compare the Vega 64 to 1080 Ti, there are some clear advantages from the red team.
For the most part, raw performance pretty much dictates the ranking here, so NVIDIA’s TITAN Xp keeps glued to the top for the single-precision tests. The rest of the cards slot in pretty much where we’d expect. Neither vendor seems to have exclusive optimizations for these computations, but overall, Vega 64 once again brings some great performance for its (SRP) price point.
|Sandra 2017 – Scientific Analysis (FP32)|
|NVIDIA TITAN Xp x 2||13 TFLOPS||503.2 GFLOPS||10.2 TFLOPS|
|NVIDIA TITAN Xp||6.8 TFLOPS||257.5 GFLOPS||5.2 TFLOPS|
|NVIDIA Quadro P6000||6.6 TFLOPS||157.2 GFLOPS||5.08 TFLOPS|
|NVIDIA GeForce GTX 1080 Ti||6 TFLOPS||216.3 GFLOPS||5 TFLOPS|
|AMD Radeon RX Vega 64||6 TFLOPS||326.9 GFLOPS||4.8 TFLOPS|
|NVIDIA Quadro P5000||4.6 TFLOPS||106.7 GFLOPS||3.5 TFLOPS|
|NVIDIA Quadro P4000||3.1 TFLOPS||128.8 GFLOPS||1.8 TFLOPS|
|AMD Radeon RX 580||3.5 TFLOPS||227.6 GFLOPS||3.2 TFLOPS|
|AMD Radeon Pro WX 7100||2.8 TFLOPS||205 GFLOPS||2.2 TFLOPS|
|NVIDIA Quadro P2000||1.9 TFLOPS||86 GFLOPS||1.6 TFLOPS|
|AMD Radeon Pro WX 5100||1.1 TFLOPS||143.2 GFLOPS||860.8 GFLOPS|
|AMD Radeon Pro WX 4100||1.1 TFLOPS||83 GFLOPS||875.3 GFLOPS|
|AMD Radeon Pro WX 3100||750.6 GFLOPS||69.5 GFLOPS||646.4 GFLOPS|
|GEMM = General Matrix Multiply; FFT = Fast Fourier Transform; N-Body = N-Body Simulation.|
|Sandra 2017 – Scientific Analysis (FP64)|
|NVIDIA TITAN Xp x 2||661.8 GFLOPS||365.5 GFLOPS||482.4 GFLOPS|
|AMD Radeon RX Vega 64||608.6 GFLOPS||154.8 GFLOPS||460.3 GFLOPS|
|NVIDIA TITAN Xp||357.2 GFLOPS||198.2 GFLOPS||279.3 GFLOPS|
|AMD Radeon RX 580||342.3 GFLOPS||88.7 GFLOPS||223.9 GFLOPS|
|NVIDIA GeForce GTX 1080 Ti||336.8 GFLOPS||166.5 GFLOPS||266.4 GFLOPS|
|NVIDIA Quadro P6000||322.7 GFLOPS||133.4 GFLOPS||252.6 GFLOPS|
|AMD Radeon Pro WX 7100||299.6 GFLOPS||81.7 GFLOPS||201.6 GFLOPS|
|NVIDIA Quadro P5000||225.5 GFLOPS||84.8 GFLOPS||180.8 GFLOPS|
|AMD Radeon Pro WX 5100||148.7 GFLOPS||58.9 GFLOPS||114.9 GFLOPS|
|NVIDIA Quadro P4000||133.7 GFLOPS||87 GFLOPS||113.9 GFLOPS|
|AMD Radeon Pro WX 4100||113.6 GFLOPS||33.2 GFLOPS||84.6 GFLOPS|
|NVIDIA Quadro P2000||89.1 GFLOPS||54.3 GFLOPS||83.7 GFLOPS|
|AMD Radeon Pro WX 3100||66.4 GFLOPS||33.4 GFLOPS||50.1 GFLOPS|
|GEMM = General Matrix Multiply; FFT = Fast Fourier Transform; N-Body = N-Body Simulation.|
Once again, the Vega 64 proved one of the mightiest options for the double-precision test, and it battles nicely with the (technically) faster GTX 1080 Ti. For single-precision, a single TITAN Xp proves best, but the more, the merrier: the scaling seen is fantastic.
In recent years, deep-learning seemed to come out of nowhere, yet it’s already become one of the most important aspects of our computing. To solve complicated problems, serious hardware is needed, and in many cases, a single GPU is not going to fit the bill.
Whether we’re using our processors to answer complex biology puzzles, find images quicker, or detect someone’s emotion, one of the beauties of deep-learning is that the available frameworks have been built with scalability in mind. Our tests here were run on one or two GPUs, but if we had 1,000, we’d still see expected scaling.
At the moment, only two deep-learning tests are included here (a similar GEMM and FFT test is on the previous page), both of which are built around CUDA. When we’re able to find a real-world deep-learning test that works equally on both AMD and NVIDIA, we’ll explore it. We are in early days for this kind of testing.
The results below may appear to be simple, but they’re of utmost importance for a lot of deep-learning work. In this case, GEMM (general matrix-to-matrix multiplication) is accelerated through CUDA’s cuBLAS library, enabling huge performance that makes CPUs look truly inept for the same kind of (extremely scalable) workload.
CUDA allows us to exercise GEMM using half-, single-, and double-precision, which means that any GPU supporting FP16 or FP64 are going to see tremendous performance gains. Unfortunately, none of the GPUs we have here support either, so for these results (and the Caffe2 ones, for that matter), I’m including NVIDIA’s own TITAN V results for the sake of being able to deliver a fuller picture. If you’re wondering if it’s inclusion actually matters, take a gander:
Clearly, any GPU that supports half- or double-precision is going to smash any that don’t, so naturally, the TITAN V, with its uncapped performance, cleans house, while scaling expectedly on the single-precision front. The TITAN V is spec’d for ~25 TFLOPS half-precision, but when the Tensors become engaged, the performance soars through the roof.
It’s worth noting that AMD’s RX Vega offers extremely good half-precision performance as well; about ~20 TFLOPS with the Vega 64, or ~25 TFLOPS with boosting. Given the differences with Tensors, it seems clear that AMD could do well to take a similar route as NVIDIA with its Instinct GPUs.
We saw above that if a GPU has Tensor cores, the gains in deep-learning are going to be significant. Whereas those results were a simple performance number (despite it being a grueling, long test), this Caffe2 test helps give a better perspective to the gains across one or more GPUs, helping you make the right purchase decision if you have hundreds of thousands, or many millions of images to train with.
This particular scenario values the GPU’s memory highly; the larger the memory pool, the larger the batches can be. Using a batch size of 32, which means 32 images are sent to the neural network each iteration, VRAM usage is around ~8.6 GB. This means that our basic test here couldn’t be run on the Quadro P2000, or P4000.
The larger the VRAM, and the computation, the more images can be processed in a batch. If the size is doubled to 64, that increases the VRAM requirement to about ~14.5 GB. Finally, a batch size of 96 requires ~20 GB. With all of that in mind, here are the results:
For such a simple chart, there’s an awful lot to talk about here. Let’s start with the P6000, since it has the largest amount of memory (24GB). To my surprise, the P6000 managed to scale just fine all the way up to a batch size of 96, thanks to its many GBs in the VRAM. As noted above, each iteration (32, 64, 96) requires more and more memory, and only the P6000 could handle a batch size of 96 on its own (without available Tensors).
That leads us to the TITAN Xp. As a single card, a batch size of 32 must be used (14.5 GB is too much for the 12GB buffer). However, because this is compute and not graphics, combining two TITAN Xps allows us to use those larger batch sizes – the GPUs simply split up the workload.
Then there’s the TITAN V. Its 12GB buffer escaped its fate as a bottleneck because of the Tensor cores; when the Tensor can work in conjunction with the regular CUDA cores, memory usage is reduced. To which extent, I’m not sure, but a test that’s 20 GB on one GPU and is able to run inside of 12 GB on the TITAN V is an impressive enough takeaway.
Based on these Caffe2 results, it seems likely that a GPU like the Quadro GV100 could greatly outperform the TITAN V thanks to its increased memory (12GB > 32GB).
Gaming performance is generally not a big focus for professional GPU lineups, but the fact of the matter is, they can game. That especially applies to the top-tier cards, as they usually perform pretty much identically to their gaming series brethren.
On the AMD side, Radeon Pro users can opt to use a Radeon Adrenaline gaming driver in addition to the standard RPro driver, which means that any gaming optimizations the company delivers to Radeon should also apply on Radeon Pro – as long as you are using the gaming driver at the time. However, that driver doesn’t seem to get updated too often; as of the time of writing, the most recent update was from December.
To my knowledge, NVIDIA’s professional drivers do not include the same per-title optimizations like GeForces do, so some performance might be lacking, but not too much.
The GeForce, TITAN, and Radeon cards in this lineup used their respective gaming drivers, while the professional cards were tested using their primary drivers (eg: Enterprise for Radeon Pro, and the latest WHQL for Quadro).
To get a quick gauge on the performance of our workstation GPU collection in gaming, we use UL’s (formally Futuremark’s) 3DMark and Unigine’s Superposition.
Multi-GPU scaling isn’t perfect in every game, but when extra horsepower is properly taken advantage of, it can result in some very obvious gains. For single GPU, the TITAN Xp dominates, to be expected when you compare its spec sheet to the others. The 1080 Ti falls not too far behind that, but it keeps a fair distance ahead of the Vega 64.
Quadro doesn’t game better than GeForce, so these results expectedly scale. Well, for the most part. In the DX12 Time Spy test, the P6000 and 1080 Ti become equal, despite the Ti outperforming the P6000 in the DX11 Fire Strike test.
Note: These results are explicitly gaming-related, and do not represent the performance of virtual reality in enterprise applications.
VRMark’s Orange test represents the current state of VR, or perhaps where it was one year ago. Every GPU above the Vega 64 and P5000 will deliver great VR performance for current content, while the P4000 and RX 580 cut it close. No current GPU delivers suitable performance for the forward-looking Blue test.
Because multi-GPU for VR is virtually (heh) non-existent right now, not even VRMark will take advantage of the extra horsepower (which is why the SLI result is suddenly absent).
Like VRMark, Superposition looks stunning, but it fails to support multiple graphics cards. We still have at least one interesting result, though, in that the P5000 outperforms the Vega 64 in all but the heaviest test. This is a GTX 1080-equivalent card, but with more modest clocks, but still performs great in comparison to its GeForce sibling.
To test the power consumption of our workstation graphics card fleet, we utilize a combination of hardware and software tools. Those include a Kill-a-Watt monitor (that plugs into the wall), and also SiSoftware’s Sandra to help push the GPUs to their limit – with the help of the GPU Processing test.
Once the test PC is booted to the desktop, it’s left to sit until the CPU and memory are effectively idle. At this point, the idle wattage is recorded off of the Kill-a-Watt, if it’s likewise stable. Sandra’s GP Processing test is then kicked-off to stress the GPU, which is so effective that it only takes 30 – 60 seconds to get a stable peak.
To tackle the obvious “what?” in the chart above, the WX 4100 does in fact use less power than the WX 3100 when running Sandra’s general GPU processing test. The 1080 Ti and TITAN Xp are effectively equal in power draw, even though the TITAN Xp has additional cores. The power-hungriest card is AMD’s RX Vega 64, drawing a considerable amount of power over GTX 1080-esque P5000.
Meanwhile, the WX 7100 and P4000 class cards (and lower) basically sip power from the socket in comparison to the bigger cards. Overall, pretty expected results here.
A performance look at workstation graphics cards is far more difficult to summarize than with gaming options. Gaming GPUs typically scale pretty expectedly from game title to game title, but on the professional side of the market, the right optimizations can make a low-performance card outperform the competition’s high-end card. As I like to say, it pays to know your workload.
Both AMD and NVIDIA had many strengths, but to keep things simple, let’s tackle them one at a time.
With its GPUs, especially Vega, AMD delivers some explosive cryptography performance, pretty much matching the same kind of domination the company gives us with Zen (Threadripper is a real crypto beast). The company also exhibited great ray tracing performance with LuxMark and V-Ray on Vega, and perhaps not surprisingly, it slayed the Radeon ProRender test.
Overall, NVIDIA has more performance leads than AMD, thanks in part to the company’s aggressive and broad R&D driver efforts. In some cases, the Quadro line will perform an order of magnitude better than GeForce thanks to certain optimizations, such as Siemens NX, which performs 22x better on the TITAN Xp, and 25x better on Quadro P6000, over the GTX 1080 Ti.
Both Autodesk’s 3ds Max and Maya tend to run better on NVIDIA, but in many cases, AMD’s (~$649 SRP) Radeon RX Vega 64 kept pretty close to the NVIDIA competition. But, these solutions are some that lets us see lower-end Quadros (eg: P4000) perform better than the technically superior Vega 64. Again, “know your workload”.
On the deep-learning page, we saw what is made possible with Tensor cores, and while the entire focus of that page was on training and the like, those cores can also be used in rendering when using AI denoisers. NVIDIA has talked a lot about this technology since last fall, and recently, many renderer houses have announced plans to implement support it, such as Chaos Group. As with the DL/AI tests, Tensors can increase rendering denoising performance at least fivefold; it’s truly impressive to see.
Ultimately, there’s no such thing as a one-size-fits-all on the workstation side of the GPU market. In some cases, AMD performs better than NVIDIA, and vice versa. For general overall performance, NVIDIA offers the fastest GPU on the planet with its $3,000 TITAN V, which is 20-25% faster than the $1,200 TITAN Xp in single-precision workloads, so if money is no obstacle for you, it’d be hard to go wrong.
On the AMD side, even though we didn’t have one for testing, the Frontier Edition, at about $900, seems like great value for workstation users. It’s effectively a Vega 64, but with twice as much memory (16GB), which as we found out on the deep-learning page, is important for those kinds of workloads. And since Vega has uncapped half-precision performance (but a lack of Tensors), it’s an alluring choice (especially when the Vega 64 is experiencing inflated pricing; it almost makes the FE a no-brainer).
Copyright © 2017 SmartKevin