{"id":231819,"date":"2024-06-13T12:45:12","date_gmt":"2024-06-13T12:45:12","guid":{"rendered":"https:\/\/michigandigitalnews.com\/index.php\/2024\/06\/13\/nvidia-unveils-grouped-gemm-apis-in-cublas-12-5-to-boost-dl-and-hpc-performance\/"},"modified":"2025-06-25T17:17:13","modified_gmt":"2025-06-25T17:17:13","slug":"nvidia-unveils-grouped-gemm-apis-in-cublas-12-5-to-boost-dl-and-hpc-performance","status":"publish","type":"post","link":"https:\/\/michigandigitalnews.com\/index.php\/2024\/06\/13\/nvidia-unveils-grouped-gemm-apis-in-cublas-12-5-to-boost-dl-and-hpc-performance\/","title":{"rendered":"NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n<figure class=\"figure mt-2\">&#13;<br \/>\n                        <a href=\"https:\/\/image.blockchain.news:443\/features\/D8E08E86F8EDBDDCD68414CF49BDD8B1401B11A69515DFF98E6B2B03EE9CF9D7.jpg\" data-glightbox=\"\" data-gallery=\"image-popup\">&#13;<br \/>\n                            <img decoding=\"async\" class=\"rounded\" src=\"https:\/\/image.blockchain.news:443\/features\/D8E08E86F8EDBDDCD68414CF49BDD8B1401B11A69515DFF98E6B2B03EE9CF9D7.jpg\" alt=\"NVIDIA Unveils Grouped GEMM APIs in cuBLAS 12.5 to Boost DL and HPC Performance\"\/>&#13;<br \/>\n&#13;<br \/>\n                        <\/a>&#13;<br \/>\n                    <\/figure>\n<p>The latest release of the NVIDIA cuBLAS library, version 12.5, brings significant updates aimed at enhancing the functionality and performance of deep learning (DL) and high-performance computing (HPC) workloads, according to NVIDIA Technical Blog. Key updates include the introduction of Grouped GEMM APIs, improved matrix multiplication (matmul) performance on NVIDIA Hopper (H100 and H200) and Ada (L40S) GPUs, and enhanced performance tuning options.<\/p>\n<h2>Grouped GEMM APIs<\/h2>\n<p>The newly introduced Grouped GEMM APIs generalize batched APIs by allowing different matrix sizes, transpositions, and scaling factors to be grouped and executed in one kernel launch. This approach has shown a 1.2x speedup in certain scenarios, such as the generation phase of a mixture-of-experts (MoE) model with batch sizes of 8 and 64 and FP16 inputs and outputs.<\/p>\n<p>Two new sets of APIs support Grouped GEMM:<\/p>\n<ol>\n<li><a rel=\"nofollow\" href=\"https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html#cublas-t-gemmgroupedbatched\">cublas<t>gemmGroupedBatched<\/t><\/a> for FP32 (including TF32) and FP64 precisions.<\/li>\n<li><a rel=\"nofollow\" href=\"https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html#cublasgemmgroupedbatchedex\">cublasGemmGroupedBatchedEx<\/a> for FP16, BF16, FP32 (including TF32), and FP64 precisions.<\/li>\n<\/ol>\n<p>These APIs support variable shapes, transpositions, and scaling factors. Examples can be found on the <a rel=\"nofollow\" href=\"https:\/\/github.com\/NVIDIA\/CUDALibrarySamples\">NVIDIA\/CUDALibrarySamples<\/a> GitHub repository.<\/p>\n<h2>Latest LLM Matmul Performance on NVIDIA H100, H200, and L40S GPUs<\/h2>\n<p>Recent performance snapshots show significant speedups for Llama 2 70B and GPT3 training phases on NVIDIA H100, H200, and L40S GPUs. The H200 GPU, in particular, demonstrates nearly 3x and 5x speedups compared to the A100 for Llama 2 70B and GPT3 training phases, respectively. These improvements are measured without locking GPU clocks and account for the number of times each GEMM is repeated in the workload.<\/p>\n<figure><img decoding=\"async\" alt=\"speedup-gemm-only-fraction-e2e-workloads-2.png\" src=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2024\/06\/speedup-gemm-only-fraction-e2e-workloads-2.png\"\/><figcaption><em>Figure 1. Speedup of the GEMM-only fraction of e2e workloads<\/em><\/figcaption><\/figure>\n<h2>Library Performance and Benchmarking<\/h2>\n<p>Several enhancements have been made to runtime performance heuristics and performance tuning APIs. The cuBLAS library uses a recommender system at runtime to dispatch the fastest available configuration for any user-requested matmuls. This system is trained on actual timing data from a wide range of problems and configurations.<\/p>\n<figure><img decoding=\"async\" alt=\"gemm-sampling-kernel-families-cublas.png\" src=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2024\/06\/gemm-sampling-kernel-families-cublas.png\"\/><figcaption><em>Figure 2. Sampling of various GEMMs using multiple configurations in different kernel families<\/em><\/figcaption><\/figure>\n<p>For advanced users, the <a rel=\"nofollow\" href=\"https:\/\/docs.nvidia.com\/cuda\/cublas\/#cublasltmatmulalgogetheuristic\">cublasLtMatmulAlgoGetHeuristic<\/a> API enables performance tuning to achieve faster implementations. Examples of auto-tuning in cuBLAS can be found on the <a rel=\"nofollow\" href=\"https:\/\/github.com\/NVIDIA\/CUDALibrarySamples\/tree\/master\/cuBLASLt\/LtSgemmSimpleAutoTuning\">NVIDIA\/CUDALibrarySamples<\/a> repository.<\/p>\n<figure><a rel=\"nofollow\" href=\"https:\/\/github.com\/NVIDIA\/CUDALibrarySamples\/tree\/master\/cuBLASLt\/LtSgemmSimpleAutoTuning\"><img decoding=\"async\" alt=\"auto-tuning-cublas-1.png\" src=\"https:\/\/developer-blogs.nvidia.com\/wp-content\/uploads\/2024\/06\/auto-tuning-cublas-1.png\"\/><\/a><figcaption><em>Figure 4. An example of auto-tuning in cuBLAS<\/em><\/figcaption><\/figure>\n<h2>Better Functionality and Performance in cuBLASLt<\/h2>\n<p>Since cuBLAS 12.0, numerous enhancements have been introduced:<\/p>\n<ol>\n<li>Fused epilogue support parity between BF16 and FP16 precisions on NVIDIA Ampere and Ada.<\/li>\n<li>Additional fused epilogues on NVIDIA Hopper and Ampere.<\/li>\n<li>Support for FP8 on Ada GPUs and performance updates on Ada L4, L40, and L40S.<\/li>\n<li>Removal of M, N, and batch size limitations of cuBLASLt matmul API.<\/li>\n<li>Improved performance of heuristics cache for workloads with high eviction rate.<\/li>\n<li>cuBLAS symbols are available in CUDA Toolkit symbols for Linux repository.<\/li>\n<\/ol>\n<p>For more information on cuBLAS, see the <a rel=\"nofollow\" href=\"https:\/\/docs.nvidia.com\/cuda\/cublas\/index.html\">documentation<\/a> and <a rel=\"nofollow\" href=\"https:\/\/github.com\/NVIDIA\/CUDALibrarySamples\/tree\/master\/cuBLASLt\">samples<\/a>.<\/p>\n<p><span><i>Image source: Shutterstock<\/i><\/span>                    <!-- Divider --><\/p>\n<p>. . .<\/p>\n<h4>Tags<\/h4>\n<p>                    <!-- Divider --><\/p>\n<p>                    <!-- Author info START --><\/p>\n<p>                    <!-- Author info END --><br \/>\n                    <!-- Divider -->\n                <\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/blockchain.news\/news\/nvidia-grouped-gemm-apis-cublas-12-5\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] &#13; &#13; &#13; &#13; &#13; The latest release of the NVIDIA cuBLAS library, version 12.5, brings significant updates aimed at enhancing the functionality and<\/p>\n","protected":false},"author":1,"featured_media":231820,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[171],"tags":[],"_links":{"self":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/231819"}],"collection":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/comments?post=231819"}],"version-history":[{"count":0,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/posts\/231819\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media\/231820"}],"wp:attachment":[{"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/media?parent=231819"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/categories?post=231819"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/michigandigitalnews.com\/index.php\/wp-json\/wp\/v2\/tags?post=231819"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}