HiSilicon's Kirin 950 proved to be a breakout product for the Huawei subsidiary, ultimately finding a home in many of Huawei's flagship phones, including the Mate 8, P9, P9 Plus, and Honor 8. Its big.LITTLE combination of four A72 and four A53 CPU cores manufactured on TSMC's 16nm FF+ FinFET process delivered excellent performance and efficiency. Somewhat surprisingly, it turned out to be one of the best, if not the best, implementation of ARM's IP we've seen.
Because of the 950's success, we were eager to see what improvements the Kirin 960 could offer. In our review of the Huawei Mate 9, the first device to use the new SoC, we saw gains in most of our performance and battery life tests relative to the Mate 8 and its Kirin 950 SoC. Now it's time to dive a little deeper and answer some of our remaining questions: How does IPC compare between the A73, A72, and other CPU cores? How is memory performance impacted by the A73's microarchitecture changes? Does CPU efficiency improve? How much more power do the extra GPU cores consume?
HiSilicon High-End Kirin SoC Lineup SoC Kirin 960 Kirin 955 Kirin 950 CPU 4x Cortex-A73 @2.36GHz4x Cortex-A53 @1.84GHz 4x Cortex-A72 @2.52GHz4x Cortex-A53 @1.81GHz 4x Cortex-A72 @2.30GHz4x Cortex-A53 @1.81GHz GPU ARM Mali-G71MP81037MHz ARM Mali-T880MP4900MHz Memory 2x 32-bit LPDDR4 @ 1866MHz29.9GB/s 2x 32-bit LPDDR3 @ 933MHz (14.9GB/s)or 2x 32-bit LPDDR4 @ 1333MHz (21.3GB/s)(hybrid controller) Interconnect ARM CCI-550 ARM CCI-400 Storage UFS 2.1 eMMC 5.0 ISP/Camera Dual 14-bit ISP(Improved) Dual 14-bit ISP940MP/s Encode/Decode 2160p30 HEVC & H.264Decode & Encode
2160p60 HEVCDecode
1080p H.264Decode & Encode
2160p30 HEVCDecode
Integrated Modem Kirin 960 Integrated LTE(Category 12/13)DL = 600Mbps4x20MHz CA, 64-QAMUL = 150Mbps2x20MHz CA, 64-QAM Balong Integrated LTE(Category 6)DL = 300Mbps2x20MHz CA, 64-QAMUL = 50Mbps1x20MHz CA, 16-QAM Sensor Hub i6 i5 Mfc. Process TSMC 16nm FFC TSMC 16nm FF+
The Kirin 960 is the first SoC to use ARM's latest A73 CPU cores, which seems fitting considering the Kirin 950 was the first to use ARM's A72. Its CPU core frequencies see a negligible increase relative to the Kirin 950: 1.81GHz to 1.84GHz for the four A53s and 2.30GHz to 2.36GHz for the four A73s. Setting the peak operating point for the A73 cores lower than the 2.52GHz used by Kirin 955's A72 cores, and lower still than the 2.8GHz that ARM targets for 16nm, is an interesting and deliberate choice by HiSilicon to limit the CPU's power envelope, allowing the bigger GPU to take a larger chunk.
We've already discussed the A73's microarchitecture in depth, so I'll just summarize a few of the highlights. For starters, the A73 stems from the A17 and does not belong to the A15/A57/A72 Austin family tree. This means the differences between the A72 and A73 are more substantial than the small change in product numbering would suggest, particularly in the CPU's front end.
The biggest difference is a reduction in decoder width, which is now 2-wide instead of 3-wide like the A72. This sounds like a downgrade on paper; however, there's likely some workloads where the A72's instruction fetch block fails to consistently saturate the decoder, so the actual performance impact of the A73's narrower decode stage may not be that severe.
In many cases, instruction dispatch throughput should actually improve relative to the A72. The A73's shorter pipeline reduces front-end latency, including 1-2 fewer cycles for the decoder, which can decode most instructions in a single cycle, and 1 less cycle for the fetch stage. The L1 instruction cache doubles in size and is optimized for better throughput, and changes to the instruction fetch block reduce instruction bubbles. ARM also says the A73 includes a new, more accurate branch predictor, with a larger BTAC (Branch Target Address Cache) structure and a new 64-entry "micro-BTAC" for accelerating branch prediction.
There are several other changes to the front end too, not to mention further along the pipeline, but it should be obvious by now that the A73 is a very different beast than the A72, grown from a different design philosophy. While the Austin family (A72) targeted industrial and low-power server applications in addition to mobile, the A73 focuses specifically on mobile, where power and area become an even higher priority. ARM says the A73 consumes 20%-30% less power than the A72 (same process, same frequency) and is up to 25% smaller (same process, same performance targets).
When it comes to Kirin 960's GPU, however, HiSilicon is clearly chasing performance instead of efficiency. With its previous SoCs, the Kirin 950/955 in particular, HiSilicon was criticized for using four-core Mali configurations while Samsung packed in eight or twelve Mali cores in its Exynos SoCs and Qualcomm squeezed more ALU resources into its Adreno GPUs. This was not entirely justified, though, because the Kirin 950's Mali-T880MP4 GPU was capable of playing nearly any game available at acceptable frame rates and the performance difference between the Mate 8 (Kirin 950), Samsung Galaxy S7 edge (Snapdragon 820), and Galaxy S7 (Exynos 8890) after reaching thermal equilibrium is minimal.
Whether in response to this criticism or to enable future use cases such as VR/AR, HiSilicon has significantly increased the Kirin 960's peak GPU performance. Not only is it the first to use ARM's latest Mali-G71 GPU, but it doubles core count to eight and boosts the peak frequency to 1037MHz, 15% higher than the 950's smaller GPU.
The Mali-G71 uses ARM's new Bifrost microarchitecture, which moves from an SIMD ISA that relied on Instruction Level Parallelism (ILP) to a scalar ISA designed to take advantage of Thread Level Parallelism (TLP) like modern desktop GPU architectures from Nvidia and AMD. I'm not going to explain the difference in depth here, but basically this change allows better utilization of the shader cores, increasing throughput and performance. ARM's previous Midgard microarchitecture needed to extract 4 instructions from a single thread and execute them concurrently to achieve full utilization of a single shader core, which is not easy to do consistently. In contrast, Bifrost can group 4 separate threads together on a shader core and execute a single instruction from each one, which is more inline with modern graphics and compute workloads.
Now that we have a better understanding for Kirin 960's design goals—better efficiency for the CPU and higher peak performance for the GPU—and a summary of the hardware changes HiSilicon made to achieve them, we're ready to see how the performance and power consumption of the Kirin 960 compares to the 950/955 and other recent SoCs.
We'll begin our Kirin 960 performance evaluation by investigating the A73's integer and floating-point IPC with some synthetic tests. Then we'll see how the changes to its memory system affect memory latency and bandwidth. Finally, after completing the lower-level tests, we'll see how Huawei's Mate 9 and its Kirin 960 SoC perform when running some real-world workloads.
Our first look at the A73's integer performance comes from SPECint2000, the integer component of the SPEC CPU2000 benchmark developed by the Standard Performance Evaluation Corporation. This collection of single-threaded tests allows us to compare IPC for competing CPU microarchitectures. The scores below are not officially validated numbers, which requires the test to be supervised by SPEC, but we've done our best to choose appropriate compiler flags and to get the tests to pass internal validation.
SPECint2000 - Estimated ScoresARMv8 / AArch64 Kirin 960 Kirin 950(% Advantage) Exynos 7420(% Advantage) Snapdragon 821(% Advantage) 164.zip 1217 1094(11.3%) 940(29.5%) 1273(-4.4%) 175.vpr 4118 3889(5.9%) 2857(44.1%) 1687(144.1%) 176.gcc 2157 1864(15.7%) 1294(66.7%) 1746(23.5%) 181.mcf 1118 664(68.3%) 928(20.5%) 1200(-6.8%) 186.crafty 2222 2083(6.7%) 1176(88.9%) 1613(37.8%) 197.parser 1395 1208(15.5%) 933(49.5%) 1059(31.8%) 252.eon 3421 3333(2.6%) 2453(39.5%) 3714(-7.9%) 253.perlmk 1748 1651(5.8%) 1216(43.8%) 1513(15.5%) 254.gap 5238 4583(14.3%) 3438(52.4%) 4231(23.8%) 255.vortex 2111 1863(13.3%) 1473(43.3%) 1712(23.3%) 256.bzip2 1402 1220(15.0%) 1079(29.9%) 1172(19.6%) 300.twolf 2479 2521(-1.7%) 1887(31.4%) 847(192.6%)
The Kirin 960's A73 CPU is about 11% faster on average than the Kirin 950's A72. In addition to the front-end changes discussed on the previous page and the changes to the memory system discussed in the next section, the A73's integer pipelines have undergone a few tweaks as well. Where the A72 had 3 integer ALUs—2 simple ALUs for basic operations such as addition and shifting and 1 dedicated multi-cycle ALU for complex operations such as multiplication, division, and multiply-accumulate—the A73 only has 2 integer ALUs that are capable of performing both basic and complex operations. This affects performance in different ways. For example, because only one of the A73's ALUs can handle multiplication while the other handles division, the time to execute multiply or division operations sees no change; however, while an ALU is occupied with a multi-cycle instruction, it cannot execute simple instructions like the A72's dedicated pipelines can, leading to a potential per formance loss. Multiply-accumulate operations, which require both of the A73's pipelines, incur a similar penalty. It's not all bad, however. Workloads that perform parallel arithmetic or use certain other complex instructions can see double the execution throughput on A73 versus A72.
Note that the table above does not account for differences in CPU frequency. The Kirin 960's frequency advantage over the Kirin 950 and Snapdragon 821 is less than 3%, making these numbers easier to compare, but its advantage over the Exynos 7420 is a little over 12%. The chart below accounts for this by dividing the estimated SPECint2000 ratio score by CPU frequency, making IPC comparisons easier.
Despite the substantial microarchitectural differences between the A73 and A72, the A73's integer IPC is only 11% higher than the A72's. This is likely the result of improvements in one area being partially offset by regressions in another. Still, assuming ARM's power reduction claims hold true, this is not a bad result.
The gap between the A73 and A57 increases to 29%. The integer performance for Qualcomm's custom Kryo core is well behind ARM's A73 and A72 cores, essentially matching the A57's IPC.
Geekbench 4 - Integer PerformanceSingle Threaded Kirin 960 Kirin 950(% Advantage) Exynos 7420(% Advantage) Snapdragon 821(% Advantage) AES 911.3 MB/s 935.6 MB/s(-2.59%) 795.8 MB/s(14.52%) 559.1 MB/s(63.00%) LZMA 3.03 MB/s 2.87 MB/s(5.69%) 2.28 MB/s(33.33%) 2.20 MB/s(38.09%) JPEG 16.1 Mpixels/s 15.5 Mpixels/s(3.66%) 14.1 Mpixels/s(13.95%) 21.6 Mpixels/s(-25.62%) Canny 22.5 Mpixels/s 26.8 Mpixels/s(-16.06%) 23.6 Mpixels/s(-4.80%) 30.3 Mpixels/s(-25.77%) Lua 1.70 MB/s 1.55 MB/s(10.13%) 1.20 MB/s(41.94%) 1.47 MB/s(16.14%) Dijkstra 1.53 MTE/s 1.14 MTE/s(33.53%) 0.92 MTE/s(65.12%) 1.39 MTE/s(9.57%) SQLite 51.6 Krows/s 43.5 Krows/s(18.62%) 34.0 Krows/s(51.99%) 36.7 Krows/s(40.73%) HTML5 Parse 8.30 MB/s 6.79 MB/s(22.19%) 6.37 MB/s(30.25%) 7.61 MB/s(9.02%) HTML5 DOM 2.17 Melems/s 1.92 Melems/s(12.82%) 1.26 Melems/s(72.91%) 0.37 Melems/s(489.09%) Histogram Equalization 48.7 Mpixels/s 57.0 Mpixels/s(-14.56%) 50.6 Mpixels/s(-3.66%) 51.2 Mpixels/s(-4.82%) PDF Rendering 44.8 Mpixels/s 45.5 Mpixels/s(-1.47%) 39.7 Mpixels/s(12.93%) 53.0 Mpixels/s(-15.36%) LLVM 194.4 functions/s 167.9 functions/s(15.76%) 128.6 functions/s(51.14%) 113.5 functions/s(71.20%) Camera 5.45 images/s 5.45 images/s(0.00%) 4.95 images/s(10.17%) 7.19 images/s(-24.12%)
The updated Geekbench 4 workloads give us a second look at integer IPC. Similar to the SPECint2000 results, we see Kirin 960 showing 5% to 15% gains over Kirin 950 in several of the tests, but there's a bit more variation overall. The Kirin 960 is actually slower than Kirin 950 in some tests, and, in the case of Canny and Histogram Equalization, its A73 is even slower than the Exynos 7420's A57. It also falls behind Qualcomm's Kryo in the JPEG, PDF Rendering, and Camera tests. The tests where the Kirin 960 does well—HTML5 Parse, HTML5 DOM, and SQLite—are very common workloads, though, which should translate into better real-world performance.
The chart above accounts for differences in CPU frequency, making it easier to directly compare IPC. Overall the A73 shows only about a 4% improvement over the A72 and about a 12% improvement over the A57 in this group of workloads, considerably less than what we saw in SPECint2000; however, with margins ranging from 33.5% in Dijkstra to -16.1% in Canny, it's impossible to make any sweeping statements about the A73's integer performance being better or worse than the A72's.
Qualcomm's Kryo CPU falls just behind the A57 once again despite posting better results in many of the Geekbench integer tests. Its poor performance in LLVM and HTML5 DOM weighs heavily on its overall score.
I've also included results for ARM's in-order A53 companion core. The A73's integer IPC is 1.7x to 2x higher overall, which illustrates why octa-core A53 SoCs are so much slower, particularly in Web browsing, than designs that use 2-4 big cores (A73/A72/A57) instead of 4 additional A53s.
Geekbench 4 - Floating Point PerformanceSingle Threaded Kirin 960 Kirin 950(% Advantage) Exynos 7420(% Advantage) Snapdragon 821(% Advantage) SGEMM 10.7 GFLOPS 13.9 GFLOPS(-23.44%) 11.9 GFLOPS(-10.36%) 12.2 GFLOPS(-12.57%) SFFT 2.89 GFLOPS 2.26 GFLOPS(27.73%) 2.62 GFLOPS(10.39%) 3.21 GFLOPS(-10.07%) N-Body Physics 838.4 Kpairs/s 896.9 Kpairs/s(-6.52%) 634.5 Kpairs/s(32.14%) 1156.7 Kpairs/s(-27.51%) Rigid Body Physics 5891.4 FPS 6497.4 FPS(-9.33%) 4662.7 FPS(26.35%) 7171.3 FPS(-17.85%) Ray Tracing 221.9 Kpixels/s 216.9 Kpixels/s(2.30%) 136.1 Kpixels/s(63.07%) 298.3 Kpixels/s(-25.59%) HDR 7.46 Mpixels/s 7.57 Mpixels/s(-1.45%) 7.17 Mpixels/s(4.09%) 10.8 Mpixels/s(-30.90%) Gaussian Blur 23.6 Mpixels/s 28.6 Mpixels/s(-17.37%) 24.4 Mpixels/s(-2.94%) 48.5 Mpixels/s(-51.27%) Speech Recognition 12.8 Words/s 8.9 Words/s(44.14%) 10.2 Words/s(25.49%) 10.9 Words/s(17.43%) Face Detection 501.2 Ksubs/s 518.9 Ksubs/s(-3.42%) 435.5 Ksubs/s(15.09%) 685.0 Ksubs/s(-26.83%)
With the exception of SFFT and Speech Recognition, the Kirin 960 is generally a little slower than the Kirin 950 in Geekbench 4's floating-point workloads. This is a bit of a surprise considering that the A73's NEON execution units are relatively unchanged from the A72's design, with reduced latency for specific instructions improving NEON performance by 5%, according to ARM. These results are even harder to interpret after factoring in the A73's lower-latency front end and improvements to its fetch block and memory subsystems. It's possible that some of these tests are limited by the A73's narrower decode stage, but given the variation in workloads, this is probably not true for every case. It will be interesting to see if A73 implementations from other SoC vendors show similar results.
After accounting for the differences in CPU frequency, floating-point IPC for the Kirin 960's A73 is 3% to 5% lower overall than the A72 but about 3% higher than the older A57. These results, which are a geometric mean of the floating-point subtest scores, are certainly closer to what I would expect, but hide the large performance variation from one workload to the next.
It's pretty obvious that floating-point performance was Qualcomm's focus for its custom Kryo core. While integer IPC was no better than ARM's A57, Kryo's floating-point IPC is 23% higher than the A72 in Geekbench 4, with particularly strong results in the Gaussian Blur and HDR tests.
ARM made several improvements to the A73's memory system. Both L1 caches see an increase in size, with the I-cache growing from 48KB (A72) to 64KB and the D-cache doubling in size to 64KB. The A73 includes several other changes, such as enhanced prefetching, that should improve cache performance too.
The A73 still has 2 AGUs like the A72, but they are now capable of both load and store operations instead of having each AGU dedicated to a single operation like in the A72, which should improve the issue rate into main system memory.
The Kirin 960's larger 64KB L1 cache maintains a steady latency of 1.27ns versus 1.74ns for the Kirin 950, a 27% improvement that far exceeds the 2.6% difference in CPU frequency, highlighting the A73's L1 cache improvements. L2 cache latency is essentially the same, but Kirin 960 again shows a 27% latency improvement over Kirin 950 when accessing main memory, which should be beneficial for the latency sensitive CPUs.
Memory bandwidth results are less definitive, however. The Kirin 960 shows up to a 30% improvement in L1 read bandwidth over the Kirin 950 depending on the access pattern used, although L1 write bandwidth is lower by nearly the same amount. The 960's L2 cache bandwidth is also lower for both read and write by up to about 30%.
The two graphs above, which show reading and writing NEON instructions using two threads, help to illustrate Kirin 960's memory bandwidth. When reading instructions, Kirin 960's L1 cache outperforms the 950's, but bandwidth drops once it hits the L2 cache. The Kirin 950 outpaces the 960 when writing to both L1 and L2, only falling below the 960's bandwidth when writing to system memory. This reduction in cache bandwidth could help explain the Kirin 960's performance regression in several of Geekbench 4's floating-point tests.
Geekbench 4 - Memory PerformanceSingle Threaded Kirin 960 Kirin 950(% Advantage) Exynos 7420(% Advantage) Snapdragon 821(% Advantage) Memory Copy 4.55 GB/s 3.67 GB/s(23.87%) 3.61 GB/s(26.04%) 7.82 GB/s(-41.84%) Memory Latency 12.1 Mops/s 9.6 Mops/s(25.39%) 5.6 Mops/s(115.67%) 6.6 Mops/s(81.82%) Memory Bandwidth 15.5 GB/s 9.2 GB/s(69.28%) 7.5 GB/s(105.84%) 13.5 GB/s(14.53%)
While the Kirin 960's L1/L2 cache performance is mixed, it holds a clear advantage over the Kirin 950 when using system memory. Memory latency improves by 25%, about the same amount our internal testing shows, and memory bandwidth improves by 69%. The A73's two load/store AGUs are likely responsible for a large chunk of the additional memory bandwidth, with the Mate 9's higher memory bus frequency helping some too.
Now it's time to see how Kirin 960's lower-level CPU and memory results translate into real-world performance, keeping in mind that OEMs can influence the balance between performance and battery life in a number of ways, including adjusting thermal limits and parameters that govern CPU scheduler and DVFS behavior, which is one reason why two devices with the same SoC can perform differently.
PCMark includes several realistic workloads that stress the CPU, GPU, RAM, and NAND storage using Android API calls many common apps use. The Mate 9 and its Kirin 960 SoC land at the top of each chart, outpacing the Mate 8 and its Kirin 950 by 15% overall and the top-performing Snapdragon 821 phones by up to 20%.
The Mate 9's advantage over the Mate 8 is only 4% in the Web Browsing test, but it's still the fastest phone we've tested so far. Integer performance is not the Kryo CPU's strength, and in this integer-heavy test all of the Snapdragon 820/821 phones fall behind SoCs using ARM's A72 and A73 CPUs, with LeEco's Le Pro3, the highest performing Snapdragon 821 phone, finishing 18% slower than the Mate 9.
The Writing test performs a variety of operations, including PDF processing and file encryption (both integer workloads), along with some memory operations and even reading and writing some files to internal NAND, and it tends to generate frequent, short bursts of activity on the big CPU cores. This seems to suit the Mate 9 just fine, because it extends its performance advantage over the Mate 8 to 23%. There's a pretty big spread between the Snapdragon 820/821 phones; the LeEco Le Pro3, the best performer in the family, is 40% faster than the Galaxy S7 edge, a prime example of how other hardware components and OEM software tinkering can affect the overall user experience.
The Data Manipulation test is another primarily integer workload that measures how long it takes to parse chunks of data from several different file types and then records the frame rate while interacting with dynamic charts. In this test, the Mate 9 is 30% faster than the Mate 8 and 37% faster than the Pixel XL.
All of the Snapdragon 820/821 phones perform well in the Kraken JavaScript test, pulling ahead of the Mate 9 by a small margin. The P9 uses Kirin 955's 7% CPU frequency advantage to help it keep up with the Mate 9 in Kraken and JetStream. The Mate 9 still pulls ahead by 11% in WebXPRT 2015, though, and outperforms the Mate 8 by 10% to 19% in all three tests. The Moto Z Play Droid, the only phone in the charts to use an octa-core A53 CPU configuration, cannot even manage half the performance of the Mate 9, which is similar to what our integer IPC tests show.
The Kirin 960 showed mixed results in our lower-level CPU and memory testing, pulling ahead of the Kirin 950 in some areas while falling behind in others. But when looking at system level tests using real-world workloads, the Mate 9 and its Kirin 960 are the clear winners. There are many hardware and software layers between you and the SoC, which is why it's important not to use an SoC benchmark to test system performance and a system benchmark, such as PCMark, to test CPU performance.
CPU Power Consumption
Taking into account Kirin 950's excellent performance and power efficiency and ARM's claim that its A73 CPU consumes 20%-30% less power than Kirin 950's A72 cores (same process, same frequency), it's only logical to expect Kirin 960 to be the new efficiency king. Before earning that distinction, however, the 960's A73 cores need to be physically implemented on silicon, and there are many factors—process and cell library selection, critical path optimizations, etc.—that ultimately determine processor efficiency.
To get a feel for CPU power consumption, I used a power virus with different thread counts to artificially load the cores. Using each device's onboard fuel gauge, the active power was calculated by subtracting the device's idle power, where it was doing nothing except displaying a static screen, from the total power for the given scenario. This method compensates for the power used by the display and other hardware components, but it's not perfect; there's no way to separate power consumed by certain necessary blocks, such as SoC interconnects, memory controllers, or DRAM, so the figures below include some additional overhead. This is especially true for the "1 Core" figures, where SoC interconnects and busses first ramp to higher frequencies.
System Active Power: CPU Load+ Per CPU Core Increments (mW) SoC 1 Core 2 Cores 3 Cores 4 Cores Kirin 960Cortex-A73@2.362GHz 1812 2845 4082 5312 - +1033 +1237 +1230 Kirin 955Cortex-A72@2.516GHz 1755 2855 4040 5010 - +1100 +1185 +970 Kirin 950Cortex-A72@2.304GHz 1347 2091 2844 3711 - +744 +753 +867 Exynos 7420Cortex-A57@2.1GHz 1619 2969 4186 5486 - +1350 +1217 +1300 Snapdragon 810 v2.1Cortex-A57@1.958GHz 2396 5144 8058 not allowed - +2748 +2914 - Snapdragon 820Kryo@2.150GHz / 1.594GHz 2055 3330 4147 4735 - +1275(2.150GHz) +817(1.594GHz) +588(1.594GHz) Snapdragon 821Kryo@2.342GHz / 1.594GHz 1752 3137 3876 4794 - +1385(2.342GHz) +739(1.594GHz) +918(1.594GHz) Kirin 960Cortex-A53@1.844GHz 654 885 1136 1435 - +231 +251 +299 Kirin 935Cortex-A53@2.2GHz 1062 1769 2587 3311 - +707 +818 +724
Surprisingly, the Kirin 960's big CPU cores consume more power than the Kirin 950's A72s—up to 43% more! This is a complete reversal from ARM's goals for the A73, which were to reduce power consumption and improve sustained performance by reducing the thermal envelope. There's no way for us to know for sure why the Kirin 960 uses more power at its highest operating point, but it's likely a combination of implementation and process.
The Kirin 950 uses TSMC's 16FF+ FinFET process, but HiSilicon switches to TSMC's 16FFC FinFET process for the Kirin 960. The newer 16FFC process reduces manufacturing costs and die area to make it competitive in mid- to low-end markets, giving SoC vendors a migration path from 28nm. It also claims to reduce leakage and dynamic power by being able to run below 0.6V, making it suitable for wearable devices and IoT applications. Devices targeting price-sensitive markets, along with ultra low-power wearable devices, tend to run at lower frequencies, however, not 2.36GHz like Kirin 960. It's possible that pushing the less performance-oriented 16FFC process, which targets lower voltages/frequencies, to higher frequencies that lay beyond its peak efficiency point may partially explain the higher power consumption relative to 16FF+.
The differences we're seeing between Kirin 960 and 950 are unlikely to come from the difference in process alone, however. Implementation plays an even bigger role and allows a semiconductor company to get the most performance/power/area from a given process. HiSilicon did a great job with the Kirin 950 on 16FF+, which is why its efficiency is so good. This was always going to be a tough act to follow, and despite the similarities between 16FF+ and 16FFC from a design perspective, it's still a different process with different requirements. It's impossible to say how close HiSilicon came to the optimal solution, though, because we have no other examples of A73 on 16FFC for comparison.
The Kirin 960's peak power figures are actually very close to what I measured for Kirin 955, the higher-clocked version of the Kirin 950. Its per-core increases are similar to the Exynos 7420's lower-frequency A57 cores too, only about 50mW less.
The Kirin 960's A73 cores consume less power than the two high-performance Kryo cores in Snapdragon 820/821, though, using up to 2.8W for two cores versus 3.1W to 3.3W for two Kryo cores. The quad-core Snapdragons' remaining two cores run at a lower peak frequency and consume less power, nullifying Kirin 960's power advantage when using 3-4 cores.
Despite the higher power consumption at the CPU's highest operating points, Huawei's Mate 9 actually does very well in our battery life tests. Its 13.25 hours of screen on time in our Wi-Fi Web Browsing test is a full 3 hours more than the Mate 8, and its nearly 10 hours in PCMark 2.0 is 27% better than the Mate 8. These real-world battery life results seem to be at odds with our CPU power measurements.
The graph above shows the Mate 9's total system power consumption while running the PCMark 2.0 performance tests (all radios were turned off and the display's brightness was calibrated to only 10 nits to better isolate the power consumption of the internal components). With the exception of some power spikes caused by increased activity while loading the next test, total power consumption remains below 3W and generally below 2W, well under the 5.3W we measured from Kirin 960's four big cores.
I'm showing this graph because most of the apps we use everyday behave similarly to PCMark, where we see threads migrate from the little cores to the big cores and back again and DVFS working hard to match CPU frequency with load (actually, most apps would show significantly more CPU idle time, so PCMark is still a bit extreme in this regard). Many workloads will only use 1-2 big cores too, like we see here with PCMark. With only 2 cores at their max operating point, the Kirin 960 only consumes 754mW more power than Kirin 950 instead of 1601mW more when using 4 cores. So while CPU efficiency is certainly important, we need to frame it in terms of real-world workloads, and we also cannot forget the impact software (scheduler, CPUfreq, CPUidle) has on overall battery life.
Looking at power alone can be misleading; a device may use more power than another, but if it completes the task in less time, it may actually use less energy, leading to longer battery life. For both of the graphs above, the phones' radios were turned off and their displays calibrated to only 10 nits (the lowest common setting) to reduce the impact of different screen sizes and efficiencies from skewing the results.
In the first graph, which shows the total energy consumed by each phone when running the PCMark 2.0 performance tests, the Mate 9 consumes 16% more energy overall than the Mate 8 (despite my efforts to minimize display influence, the P9's energy consumption is slightly lower than the Mate 8's, which is likely because of its smaller screen). The Video and Photo Editing tests, which employ the GPU, show some of the biggest percent differences, but the Writing test, which makes frequent use of the CPU's big cores, also shows a larger than average difference. The LeEco Le Pro3 and its Snapdragon 821 SoC actually consumes more energy than the Mate 9 in the Data Manipulation and Writing tests, where it has to use its 2 high-performance Kryo cores, but less in the Video and Photo Editing tests that use the GPU.
The second graph divides the PCMark score by the energy consumed to show efficiency. Because of the Mate 9's better performance, it's actually 7% more efficient than the Mate 8 in the Writing test and 17% more efficient in the Data Manipulation test. The Mate 9's GPU efficiency is the worst of the group, judging by its scores in the Video and Photo Editing tests. In contrast, the Pro3's Adreno 530 GPU posts the highest efficiency values in these tests.
The Mate 9 lasts longer than the Mate 8 in the PCMark battery test despite its Kirin 960 SoC consuming more energy, so Huawei must have reduced energy consumption elsewhere to compensate. The display is the most obvious place to look, and the graph above clearly shows that the Mate 9's display is more efficient. At 200 nits, the value we use for our battery tests, the Mate 9 shows an estimated 19% power reduction. In the time it takes to run PCMark, this translates to 82 J of energy, nearly erasing the 102 J difference between the Mate 9 and Mate 8. I suspect the difference in display power may actually be a little bigger, but I lack the equipment to make a more precise measurement. This still does not account for all of the Mate 9's power savings, however, but a full accounting is beyond the scope of this article.
CPU Thermal Stability
Our CPU throttling test uses the same power virus we used above with two threads running on two of the big A73 CPU cores for a duration of about 30 minutes. The goal is to determine a device's ability to sustain peak CPU performance without throttling and potentially reducing user experience. This is a product of CPU power consumption, the device's ability to dissipate heat, and the device's thermal safety limits.
The Mate 8 and its Kirin 950 are able to sustain peak performance with two A72 cores indefinitely, a remarkable feat. The Mate 9 does not fare as well because of Kirin 960's elevated power use; however, it still manages to hold two of its A73 cores at peak frequency for 11.3 minutes and does not throttle enough to affect performance in a noticeable way for 20 minutes, which is still a very good result. I cannot think of any CPU-centric workloads for a phone that would load two big cores for anywhere near this long, so it's safe to say that CPU throttling is not a problem for the Mate 9. It will be interesting to see if this holds true for Huawei's smaller phones such as the P10, which will not be able to dissipate heat as readily as the big, aluminum Mate 9.
GPU Power Consumption
The Kirin 960 adopts ARM's latest Mali-G71 GPU, and unlike previous Kirin SoCs that tried to balance performance and power consumption by using fewer GPU cores, the 960's 8 cores show a clear focus on increasing peak performance. More cores also means more power and raises concerns about sustained performance.
We measure GPU power consumption using a method that's similar to what we use for the CPU. Running the GFXBench Manhattan 3.1 and T-Rex performance tests offscreen, we calculate the system load power by subtracting the device's idle power from its total active power while running each test, using each device's onboard fuel gauge to collect data.
GFXBench Manhattan 3.1 Offscreen Power Efficiency(System Load Power) Mfc. Process FPS Avg. Power(W) Perf/WEfficiency LeEco Le Pro3 (Snapdragon 821) 14LPP 33.04 4.18 7.90 fps/W Galaxy S7 (Snapdragon 820) 14LPP 30.98 3.98 7.78 fps/W Xiaomi Redmi Note 3(Snapdragon 650) 28HPm 9.93 2.17 4.58 fps/W Meizu PRO 6 (Helio X25) 20Soc 9.42 2.19 4.30 fps/W Meizu PRO 5 (Exynos 7420) 14LPE 14.45 3.47 4.16 fps/W Nexus 6P (Snapdragon 810 v2.1) 20Soc 21.94 5.44 4.03 fps/W Huawei Mate 8 (Kirin 950) 16FF+ 10.37 2.75 3.77 fps/W Huawei Mate 9 (Kirin 960) 16FFC 32.49 8.63 3.77 fps/W Galaxy S6 (Exynos 7420) 14LPE 16.62 4.63 3.59 fps/W Huawei P9 (Kirin 955) 16FF+ 10.59 2.98 3.55 fps/W
The Mate 9's 8.63W average is easily the highest of the group and simply unacceptable for an SoC targeted at smartphones. With the GPU consuming so much power, it's basically impossible for the GPU and even a single A73 CPU core to run at their highest operating points at the same time without exceeding a 10W TDP, a value more suitable for a large tablet. The Mate 9 allows its GPU to hit 1037MHz too, which is a little silly. For comparison, the Exynos 7420 on Samsung's 14LPE FinFET process, which also has an 8 core Mali GPU (albeit an older Mali-T760), only goes up to 772MHz, keeping its average power below 5W.
The Mate 9's average power is 3.1x higher than the Mate 8's, but because peak performance goes up by the same amount, efficiency turns out to be equal. Qualcomm's Adreno 530 GPU in Snapdragon 820/821 is easily the most efficient with this workload, and despite achieving about the same performance of Kirin 960, it uses less than half the power.
GFXBench T-Rex Offscreen Power Efficiency(System Load Power) Mfc. Process FPS Avg. Power(W) Perf/WEfficiency LeEco Le Pro3 (Snapdragon 821) 14LPP 94.97 3.91 24.26 fps/W Galaxy S7 (Snapdragon 820) 14LPP 90.59 4.18 21.67 fps/W Galaxy S7 (Exynos 8890) 14LPP 87.00 4.70 18.51 fps/W Xiaomi Mi5 Pro (Snapdragon 820) 14LPP 91.00 5.03 18.20 fps/W Apple iPhone 6s Plus (A9) [OpenGL] 16FF+ 79.40 4.91 16.14 fps/W Xiaomi Redmi Note 3(Snapdragon 650) 28HPm 34.43 2.26 15.23 fps/W Meizu PRO 5 (Exynos 7420) 14LPE 55.67 3.83 14.54 fps/W Xiaomi Mi Note Pro(Snapdragon 810 v2.1) 20Soc 57.60 4.40 13.11 fps/W Nexus 6P (Snapdragon 810 v2.1) 20Soc 58.97 4.70 12.54 fps/W Galaxy S6 (Exynos 7420) 14LPE 58.07 4.79 12.12 fps/W Huawei Mate 8 (Kirin 950) 16FF+ 41.69 3.58 11.64 fps/W Meizu PRO 6 (Helio X25) 20Soc 32.46 2.84 11.43 fps/W Huawei P9 (Kirin 955) 16FF+ 40.42 3.68 10.98 fps/W Huawei Mate 9 (Kirin 960) 16FFC 99.16 9.51 10.42 fps/W
Things only get worse for Kirin 960 in T-Rex, where average power increases to 9.51W and GPU efficiency drops to the lowest value of any device we've tested. As another comparison point, the Exynos 8890 in Samsung's Galaxy S7, which uses a wider 12 core Mali-T880 GPU at up to 650MHz, averages 4.7W and is only 12% slower, making it 78% more efficient.
All of the flagship SoCs we've tested from Apple, Qualcomm, and Samsung manage to stay below a 5W ceiling in this test, and even then these SoCs are unable to sustain peak performance for very long before throttling back because of heat buildup. Ideally, we like to see phones remain below 4W in this test, and pushing above 5W just does not make any sense.
The Kirin 960's higher power consumption has a negative impact on the Mate 9's battery life while gaming. It runs for 1 hour less than the Mate 8, a 22% reduction that would be more pronounced it the Mate 9 did not throttle back GPU frequency during the test. Ultimately, the Mate 9's runtime is similar to other flagship phones (with smaller batteries), while providing similar or better performance. To reconcile Kirin 960's high GPU power consumption with the Mate 9's acceptable battery life in our gaming test, we need to look more closely at its behavior over the duration of the test.
GPU Thermal Stability
The Mate 9 only maintains peak performance for about 1 minute before reducing GPU frequency, dropping frame rate to 21fps after 8 minutes, a 38% reduction relative to the peak value. It reaches equilibrium after about 30 minutes, with frame rate hovering around 19fps, which is still better than the phones using Kirin 950/955 that peak at 11.5fps with sustained performance hovering between 9-11fps. It's also as good as or better than phones using Qualcomm's Snapdragon 820/821 SoCs. The Moto Z Force Droid, for example, can sustain a peak performance of almost 18fps for 12 minutes, gradually reaching a steady-state frame rate of 14.5fps, and the LeEco Pro 3 sustains 19fps after dropping from a peak value of 33fps.
In the lower chart, which shows how the Mate 9's GPU frequency and power consumption change during the first 15 minutes of the gaming battery test, we can see that once GPU frequency drops to 533MHz, average power consumption drops below 4W, a sustainable value that still results in performance on par with other flagship SoCs after they've throttled back too. This suggests that Huawei/HiSilicon should have chosen a more sensible peak operating point for Kirin 960's GPU of 650MHz to 700MHz. The only reason to push GPU frequency to 1037MHz (at least in a phone or tablet) is to make the device look better on a spec sheet and post higher peak scores in benchmarks.
Lowering GPU frequency would not improve Kirin 960's low GPU efficiency, however. Because we do not have any other Mali-G71 examples at this time, we cannot say if this is indicative of ARM's new GPU microarchitecture (I suspect not) or the result of HiSilicon's implementation and process choice.
HiSilicon's Kirin 950 delivered impressive performance and efficiency, raising our expectations for its successor. And on paper at least, the Kirin 960 seems better in every way. It incorporates ARM's latest IP, including A73 CPUs, the new Mali-G71 GPU with more cores, and a CCI-550 interconnect. It offers other improvements too, such as a new modem that supports higher LTE speeds and UFS 2.1 support. But when it comes to performance and efficiency, the Kirin 960 improves in some areas and regresses in others.
The Kirin 960's A73 CPU cores are marginally faster than the 950's A72 cores when handling integer workloads, with a more noticeable lead over Qualcomm's Kryo and the older A57. When looking at floating-point IPC, the opposite is true, with Qualcomm's Kryo and Kirin 950's A72 cores posting better results than the 960's A73.
Some of this performance regression may be explained by Kirin 960's memory performance. Both latency and read bandwidth improve for its larger 64KB L1 cache, but write bandwidth is lower than Kirin 950. The 960's L2 cache bandwidth is also lower for both read and write. Its latency to main memory improves by 25%, however, and bandwidth improves by an impressive 69%.
What's really disappointing (and puzzling) about Kirin 960, though, is that its CPU efficiency is actually worse than the 950's. ARM did a lot of work to reduce the A73's power consumption relative to the A72, but the Kirin 960's A73 cores see a substantial power increase over the 950's A72 cores. The poor efficiency numbers are likely a combination of HiSilicon's specific implementation and the switch to the 16FFC process. This was definitely an unexpected result considering the Mate 9's excellent battery life. Fortunately, Huawei was able to save power elsewhere, such as the display, to make up for the SoC's power increase, but it's difficult not to think about how much better the battery life could have been.
Power consumption for Kirin 960's GPU is even worse, with peak power numbers that are entirely inappropriate for a smartphone. Part of the problem is poor efficiency, again likely a combination of implementation and process, which is only made worse by an overly aggressive 1037MHz peak operating point that only serves to improve the spec sheet and benchmark results.
The Kirin 960 is difficult to categorize. It's definitely not a clear upgrade over the 950, but it does just enough things right that we cannot dismiss it outright either. For example, its generally improved integer performance and lower system memory latency give it an advantage over the 950 in many real-world workloads. We cannot completely condemn its GPU either, because its sustained performance, at least in the Mate 9's large aluminum chassis, is on par with or better than competing flagship phones, as is its battery life when gaming. Certainly the Mate 9 proves that Kirin 960 is a viable flagship SoC as long as Huawei puts in the effort to work around its flaws. But with a new generation of 10nm SoCs just around the corner, those flaws will only become more apparent.
Source:
HiSilicon Kirin 960: A Closer Look at Performance and Power