535.104.12

535.104.05
535.98
2026-01-29 04:29:47 +00:00 · 2023-09-25 18:56:22 +02:00 · 2023-08-22 15:09:37 +02:00 · 2023-08-08 18:28:38 +02:00 · 2023-07-31 18:17:14 +02:00 · 2023-07-18 16:00:22 +02:00
47 changed files with 3189 additions and 220 deletions
--- a/.github/ISSUE_TEMPLATE/10_functional_bug.yml
+++ b/.github/ISSUE_TEMPLATE/10_functional_bug.yml
@@ -1,5 +1,8 @@
 name: Report a functional bug 🐛
-description: Functional bugs affect operation or stability of the driver and/or hardware.
+description: |
+  Functional bugs affect operation or stability of the driver or hardware.
+
+  Bugs with the closed source driver must be reported on the forums (see link on New Issue page below).
 labels:
  - "bug"
 body:
@@ -18,14 +21,12 @@ body:
    description: "Which open-gpu-kernel-modules version are you running? Be as specific as possible: SHA is best when built from specific commit."
  validations:
    required: true
- type: dropdown
+- type: checkboxes
  id: sw_driver_proprietary
  attributes:
-    label: "Does this happen with the proprietary driver (of the same version) as well?"
+    label: "Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver."
    options:
-    - "Yes"
-    - "No"
-    - "I cannot test this"
+    - label: "I confirm that this does not happen with the proprietary driver package."
  validations:
    required: true
 - type: input
@@ -42,6 +43,14 @@ body:
    description: "Which kernel are you running? (output of `uname -a`, say if you built it yourself)"
  validations:
    required: true
+- type: checkboxes
+  id: sw_host_kernel_stable
+  attributes:
+    label: "Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels."
+    options:
+    - label: "I am running on a stable kernel release."
+  validations:
+    required: true
 - type: input
  id: hw_gpu_type
  attributes:
@@ -78,7 +87,10 @@ body:
  id: bug_report_gz
  attributes:
    label: nvidia-bug-report.log.gz
-    description: "Please reproduce the problem, after that run `nvidia-bug-report.sh`, and attach the resulting nvidia-bug-report.log.gz here."
+    description: |
+      Please reproduce the problem, after that run `nvidia-bug-report.sh`, and attach the resulting nvidia-bug-report.log.gz here.
+
+      Reports without this file will be closed.
    placeholder: You can usually just drag & drop the file into this textbox.
  validations:
    required: true
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@@ -1,14 +1,14 @@
 blank_issues_enabled: false
 contact_links:
+  - name: Report a bug with the proprietary driver
+    url: https://forums.developer.nvidia.com/c/gpu-graphics/linux/148
+    about: Bugs that aren't specific to the open source driver in this repository must be reported with the linked forums instead.
  - name: Report a cosmetic issue
    url: https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/categories/general
    about: We are not currently accepting cosmetic-only changes such as whitespace, typos, or simple renames. You can still discuss and collect them on the boards.
  - name: Ask a question
    url: https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/categories/q-a
    about: Unsure of what to click, where to go, what the process for your thing is? We're happy to help. Click to visit the discussion board and say hello!
-  - name: Report a bug with the proprietary driver
-    url: https://forums.developer.nvidia.com/c/gpu-graphics/linux/148
-    about: Bugs that aren't specific to the open source driver in this repository should be reported with the linked forums instead. If you are unsure on what kind of bug you have, feel free to open a thread in Discussions. We're here to help!
  - name: Suggest a feature
    url: https://github.com/NVIDIA/open-gpu-kernel-modules/discussions/categories/ideas
    about: Please do not open Issues for feature requests; instead, suggest and discuss new features on the Github discussion board. If you have a feature you worked on and want to PR it, please also open a discussion before doing so.
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -2,28 +2,18 @@

 ## Release 535 Entries

+### [535.104.12] 2023-09-25
+
+### [535.104.05] 2023-08-22
+
+### [535.98] 2023-08-08
+
+### [535.86.10] 2023-07-31
+
+### [535.86.05] 2023-07-18
+
 ### [535.54.03] 2023-06-14

-### [535.43.22] 2023-12-19
-
-### [535.43.20] 2023-12-08
-
-### [535.43.19] 2023-12-01
-
-### [535.43.16] 2023-11-06
-
-### [535.43.15] 2023-10-24
-
-### [535.43.13] 2023-10-11
-
-### [535.43.11] 2023-10-05
-
-### [535.43.10] 2023-09-27
-
-### [535.43.09] 2023-09-01
-
-### [535.43.08] 2023-08-17
-
 ### [535.43.02] 2023-05-30

 #### Fixed
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 # NVIDIA Linux Open GPU Kernel Module Source

 This is the source release of the NVIDIA Linux open GPU kernel modules,
-version 535.43.22.
+version 535.104.12.


 ## How to Build
@@ -17,7 +17,7 @@ as root:

 Note that the kernel modules built here must be used with GSP
 firmware and user-space NVIDIA GPU driver components from a corresponding
-535.43.22 driver release.  This can be achieved by installing
+535.104.12 driver release.  This can be achieved by installing
 the NVIDIA GPU driver from the .run file using the `--no-kernel-modules`
 option.  E.g.,

@@ -180,7 +180,7 @@ software applications.
 ## Compatible GPUs

 The open-gpu-kernel-modules can be used on any Turing or later GPU
-(see the table below). However, in the 535.43.22 release,
+(see the table below). However, in the 535.104.12 release,
 GeForce and Workstation support is still considered alpha-quality.

 To enable use of the open kernel modules on GeForce and Workstation GPUs,
@@ -188,7 +188,7 @@ set the "NVreg_OpenRmEnableUnsupportedGpus" nvidia.ko kernel module
 parameter to 1. For more details, see the NVIDIA GPU driver end user
 README here:

-https://us.download.nvidia.com/XFree86/Linux-x86_64/535.43.22/README/kernel_open.html
+https://us.download.nvidia.com/XFree86/Linux-x86_64/535.104.12/README/kernel_open.html

 In the below table, if three IDs are listed, the first is the PCI Device 
 ID, the second is the PCI Subsystem Vendor ID, and the third is the PCI
@@ -658,7 +658,6 @@ Subsystem Device ID.
 | NVIDIA A100-SXM4-80GB                           | 20B2 10DE 147F |
 | NVIDIA A100-SXM4-80GB                           | 20B2 10DE 1622 |
 | NVIDIA A100-SXM4-80GB                           | 20B2 10DE 1623 |
-| NVIDIA PG509-210                                | 20B2 10DE 1625 |
 | NVIDIA A100-SXM-64GB                            | 20B3 10DE 14A7 |
 | NVIDIA A100-SXM-64GB                            | 20B3 10DE 14A8 |
 | NVIDIA A100 80GB PCIe                           | 20B5 10DE 1533 |
@@ -666,7 +665,6 @@ Subsystem Device ID.
 | NVIDIA PG506-232                                | 20B6 10DE 1492 |
 | NVIDIA A30                                      | 20B7 10DE 1532 |
 | NVIDIA A30                                      | 20B7 10DE 1804 |
-| NVIDIA A30                                      | 20B7 10DE 1852 |
 | NVIDIA A800-SXM4-40GB                           | 20BD 10DE 17F4 |
 | NVIDIA A100-PCIE-40GB                           | 20F1 10DE 145F |
 | NVIDIA A800-SXM4-80GB                           | 20F3 10DE 179B |
@@ -750,8 +748,6 @@ Subsystem Device ID.
 | NVIDIA H100 PCIe                                | 2331 10DE 1626 |
 | NVIDIA H100                                     | 2339 10DE 17FC |
 | NVIDIA H800 NVL                                 | 233A 10DE 183A |
-| NVIDIA GH200 120GB                              | 2342 10DE 16EB |
-| NVIDIA GH200 480GB                              | 2342 10DE 1809 |
 | NVIDIA GeForce RTX 3060 Ti                      | 2414           |
 | NVIDIA GeForce RTX 3080 Ti Laptop GPU           | 2420           |
 | NVIDIA RTX A5500 Laptop GPU                     | 2438           |
--- a/kernel-open/Kbuild
+++ b/kernel-open/Kbuild
@@ -72,7 +72,7 @@ EXTRA_CFLAGS += -I$(src)/common/inc
 EXTRA_CFLAGS += -I$(src)
 EXTRA_CFLAGS += -Wall $(DEFINES) $(INCLUDES) -Wno-cast-qual -Wno-error -Wno-format-extra-args
 EXTRA_CFLAGS += -D__KERNEL__ -DMODULE -DNVRM
-EXTRA_CFLAGS += -DNV_VERSION_STRING=\"535.43.22\"
+EXTRA_CFLAGS += -DNV_VERSION_STRING=\"535.104.12\"

 ifneq ($(SYSSRCHOST1X),)
 EXTRA_CFLAGS += -I$(SYSSRCHOST1X)
--- a/kernel-open/conftest.sh
+++ b/kernel-open/conftest.sh
@@ -5743,23 +5743,25 @@ compile_test() {
            compile_check_conftest "$CODE" "NV_IOASID_GET_PRESENT" "" "functions"
        ;;

-        mm_pasid_set)
+        mm_pasid_drop)
            #
-            # Determine if mm_pasid_set() function is present
+            # Determine if mm_pasid_drop() function is present
+            #
+            # Added by commit 701fac40384f ("iommu/sva: Assign a PASID to mm
+            # on PASID allocation and free it on mm exit") in v5.18.
+            # Moved to linux/iommu.h in commit cd3891158a77 ("iommu/sva: Move
+            # PASID helpers to sva code") in v6.4.
            #
-            # mm_pasid_set() function was added by commit
-            # 701fac40384f07197b106136012804c3cae0b3de (iommu/sva: Assign a
-            # PASID to mm on PASID allocation and free it on mm exit) in v5.18.
-            # (2022-02-15).
            CODE="
            #if defined(NV_LINUX_SCHED_MM_H_PRESENT)
            #include <linux/sched/mm.h>
            #endif
-            void conftest_mm_pasid_set(void) {
-                mm_pasid_set();
+            #include <linux/iommu.h>
+            void conftest_mm_pasid_drop(void) {
+                mm_pasid_drop();
            }"

-            compile_check_conftest "$CODE" "NV_MM_PASID_SET_PRESENT" "" "functions"
+            compile_check_conftest "$CODE" "NV_MM_PASID_DROP_PRESENT" "" "functions"
        ;;

        drm_crtc_state_has_no_vblank)
--- a/kernel-open/nvidia-uvm/nvidia-uvm.Kbuild
+++ b/kernel-open/nvidia-uvm/nvidia-uvm.Kbuild
@@ -81,7 +81,7 @@ NV_CONFTEST_FUNCTION_COMPILE_TESTS += set_memory_uc
 NV_CONFTEST_FUNCTION_COMPILE_TESTS += set_pages_uc
 NV_CONFTEST_FUNCTION_COMPILE_TESTS += ktime_get_raw_ts64
 NV_CONFTEST_FUNCTION_COMPILE_TESTS += ioasid_get
-NV_CONFTEST_FUNCTION_COMPILE_TESTS += mm_pasid_set
+NV_CONFTEST_FUNCTION_COMPILE_TESTS += mm_pasid_drop
 NV_CONFTEST_FUNCTION_COMPILE_TESTS += migrate_vma_setup
 NV_CONFTEST_FUNCTION_COMPILE_TESTS += mmget_not_zero
 NV_CONFTEST_FUNCTION_COMPILE_TESTS += mmgrab
--- a/kernel-open/nvidia-uvm/uvm_ats_sva.h
+++ b/kernel-open/nvidia-uvm/uvm_ats_sva.h
@@ -32,19 +32,23 @@
 // For ATS support on aarch64, arm_smmu_sva_bind() is needed for
 // iommu_sva_bind_device() calls. Unfortunately, arm_smmu_sva_bind() is not
 // conftest-able. We instead look for the presence of ioasid_get() or
-// mm_pasid_set(). ioasid_get() was added in the same patch series as
-// arm_smmu_sva_bind() and removed in v6.0. mm_pasid_set() was added in the
+// mm_pasid_drop(). ioasid_get() was added in the same patch series as
+// arm_smmu_sva_bind() and removed in v6.0. mm_pasid_drop() was added in the
 // same patch as the removal of ioasid_get(). We assume the presence of
-// arm_smmu_sva_bind() if ioasid_get(v5.11 - v5.17) or mm_pasid_set(v5.18+) is
+// arm_smmu_sva_bind() if ioasid_get(v5.11 - v5.17) or mm_pasid_drop(v5.18+) is
 // present.
 //
 // arm_smmu_sva_bind() was added with commit
 // 32784a9562fb0518b12e9797ee2aec52214adf6f and ioasid_get() was added with
 // commit cb4789b0d19ff231ce9f73376a023341300aed96 (11/23/2020). Commit
 // 701fac40384f07197b106136012804c3cae0b3de (02/15/2022) removed ioasid_get()
-// and added mm_pasid_set().
-    #if UVM_CAN_USE_MMU_NOTIFIERS() && (defined(NV_IOASID_GET_PRESENT) || defined(NV_MM_PASID_SET_PRESENT))
-        #define UVM_ATS_SVA_SUPPORTED() 1
+// and added mm_pasid_drop().
+    #if UVM_CAN_USE_MMU_NOTIFIERS() && (defined(NV_IOASID_GET_PRESENT) || defined(NV_MM_PASID_DROP_PRESENT))
+        #if defined(CONFIG_IOMMU_SVA)
+            #define UVM_ATS_SVA_SUPPORTED() 1
+        #else
+            #define UVM_ATS_SVA_SUPPORTED() 0
+        #endif
    #else
        #define UVM_ATS_SVA_SUPPORTED() 0
    #endif
--- a/src/common/inc/nvBldVer.h
+++ b/src/common/inc/nvBldVer.h
@@ -36,25 +36,25 @@
 // and then checked back in. You cannot make changes to these sections without
 // corresponding changes to the buildmeister script
 #ifndef NV_BUILD_BRANCH
-    #define NV_BUILD_BRANCH             VK535_87
+    #define NV_BUILD_BRANCH             r537_13
 #endif
 #ifndef NV_PUBLIC_BRANCH
-    #define NV_PUBLIC_BRANCH             VK535_87
+    #define NV_PUBLIC_BRANCH             r537_13
 #endif

 #if defined(NV_LINUX) || defined(NV_BSD) || defined(NV_SUNOS)
-#define NV_BUILD_BRANCH_VERSION         "rel/gpu_drv/r535/VK535_87-145"
-#define NV_BUILD_CHANGELIST_NUM         (33667820)
+#define NV_BUILD_BRANCH_VERSION         "rel/gpu_drv/r535/r537_13-267"
+#define NV_BUILD_CHANGELIST_NUM         (33312039)
 #define NV_BUILD_TYPE                   "Official"
-#define NV_BUILD_NAME                   "rel/gpu_drv/r535/VK535_87-145"
-#define NV_LAST_OFFICIAL_CHANGELIST_NUM (33667820)
+#define NV_BUILD_NAME                   "rel/gpu_drv/r535/r537_13-267"
+#define NV_LAST_OFFICIAL_CHANGELIST_NUM (33312039)

 #else     /* Windows builds */
-#define NV_BUILD_BRANCH_VERSION         "VK535_87-22"
-#define NV_BUILD_CHANGELIST_NUM         (33665505)
+#define NV_BUILD_BRANCH_VERSION         "r537_13-7"
+#define NV_BUILD_CHANGELIST_NUM         (33274399)
 #define NV_BUILD_TYPE                   "Official"
-#define NV_BUILD_NAME                   "538.09"
-#define NV_LAST_OFFICIAL_CHANGELIST_NUM (33665505)
+#define NV_BUILD_NAME                   "537.39"
+#define NV_LAST_OFFICIAL_CHANGELIST_NUM (33274399)
 #define NV_BUILD_BRANCH_BASE_VERSION    R535
 #endif
 // End buildmeister python edited section
--- a/src/common/inc/nvUnixVersion.h
+++ b/src/common/inc/nvUnixVersion.h
@@ -4,7 +4,7 @@
 #if defined(NV_LINUX) || defined(NV_BSD) || defined(NV_SUNOS) || defined(NV_VMWARE) || defined(NV_QNX) || defined(NV_INTEGRITY) || \
    (defined(RMCFG_FEATURE_PLATFORM_GSP) && RMCFG_FEATURE_PLATFORM_GSP == 1)

-#define NV_VERSION_STRING               "535.43.22"
+#define NV_VERSION_STRING               "535.104.12"

 #else

--- a/src/common/inc/swref/published/hopper/gh100/dev_fb.h
+++ b/src/common/inc/swref/published/hopper/gh100/dev_fb.h
@@ -20,7 +20,7 @@
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
 * DEALINGS IN THE SOFTWARE.
 */
- 
+
 #ifndef __gh100_dev_fb_h_
 #define __gh100_dev_fb_h_
 #define NV_PFB_NISO_FLUSH_SYSMEM_ADDR_SHIFT                       8 /*       */
@@ -29,4 +29,25 @@
 #define NV_PFB_FBHUB_PCIE_FLUSH_SYSMEM_ADDR_HI           0x00100A38 /* RW-4R */
 #define NV_PFB_FBHUB_PCIE_FLUSH_SYSMEM_ADDR_HI_ADR             31:0 /* RWIVF */
 #define NV_PFB_FBHUB_PCIE_FLUSH_SYSMEM_ADDR_HI_ADR_MASK  0x000FFFFF /* ----V */
+
+#define NV_PFB_PRI_MMU_L2TLB_ECC_UNCORRECTED_ERR_COUNT               0x00100E78 /* RW-4R */
+#define NV_PFB_PRI_MMU_L2TLB_ECC_UNCORRECTED_ERR_COUNT               0x00100E78 /* RW-4R */
+#define NV_PFB_PRI_MMU_L2TLB_ECC_UNCORRECTED_ERR_COUNT_TOTAL               15:0 /* RWEVF */
+#define NV_PFB_PRI_MMU_L2TLB_ECC_UNCORRECTED_ERR_COUNT_TOTAL_INIT             0 /* RWE-V */
+#define NV_PFB_PRI_MMU_L2TLB_ECC_UNCORRECTED_ERR_COUNT_UNIQUE             31:16 /* RWEVF */
+#define NV_PFB_PRI_MMU_L2TLB_ECC_UNCORRECTED_ERR_COUNT_UNIQUE_INIT            0 /* RWE-V */
+
+#define NV_PFB_PRI_MMU_HUBTLB_ECC_UNCORRECTED_ERR_COUNT               0x00100E8C /* RW-4R */
+#define NV_PFB_PRI_MMU_HUBTLB_ECC_UNCORRECTED_ERR_COUNT               0x00100E8C /* RW-4R */
+#define NV_PFB_PRI_MMU_HUBTLB_ECC_UNCORRECTED_ERR_COUNT_TOTAL               15:0 /* RWEVF */
+#define NV_PFB_PRI_MMU_HUBTLB_ECC_UNCORRECTED_ERR_COUNT_TOTAL_INIT             0 /* RWE-V */
+#define NV_PFB_PRI_MMU_HUBTLB_ECC_UNCORRECTED_ERR_COUNT_UNIQUE             31:16 /* RWEVF */
+#define NV_PFB_PRI_MMU_HUBTLB_ECC_UNCORRECTED_ERR_COUNT_UNIQUE_INIT            0 /* RWE-V */
+
+#define NV_PFB_PRI_MMU_FILLUNIT_ECC_UNCORRECTED_ERR_COUNT               0x00100EA0 /* RW-4R */
+#define NV_PFB_PRI_MMU_FILLUNIT_ECC_UNCORRECTED_ERR_COUNT               0x00100EA0 /* RW-4R */
+#define NV_PFB_PRI_MMU_FILLUNIT_ECC_UNCORRECTED_ERR_COUNT_TOTAL               15:0 /* RWEVF */
+#define NV_PFB_PRI_MMU_FILLUNIT_ECC_UNCORRECTED_ERR_COUNT_TOTAL_INIT             0 /* RWE-V */
+#define NV_PFB_PRI_MMU_FILLUNIT_ECC_UNCORRECTED_ERR_COUNT_UNIQUE             31:16 /* RWEVF */
+#define NV_PFB_PRI_MMU_FILLUNIT_ECC_UNCORRECTED_ERR_COUNT_UNIQUE_INIT            0 /* RWE-V */
 #endif // __gh100_dev_fb_h_
--- a/src/common/inc/swref/published/hopper/gh100/dev_fbpa.h
+++ b/src/common/inc/swref/published/hopper/gh100/dev_fbpa.h
@@ -0,0 +1,29 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES
+ * SPDX-License-Identifier: MIT
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef __gh100_dev_fbpa_h_
+#define __gh100_dev_fbpa_h_
+
+#define NV_PFB_FBPA_0_ECC_DED_COUNT__SIZE_1               4 /*       */
+#define NV_PFB_FBPA_0_ECC_DED_COUNT(i)                   (0x009025A0+(i)*4) /* RW-4A */
+#endif // __gh100_dev_fbpa_h_
--- a/src/common/inc/swref/published/hopper/gh100/dev_ltc.h
+++ b/src/common/inc/swref/published/hopper/gh100/dev_ltc.h
@@ -0,0 +1,33 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES
+ * SPDX-License-Identifier: MIT
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef __gh100_dev_ltc_h_
+#define __gh100_dev_ltc_h_
+
+#define NV_PLTCG_LTC0_LTS0_L2_CACHE_ECC_UNCORRECTED_ERR_COUNT                  0x001404f8 /* RW-4R */
+#define NV_PLTCG_LTC0_LTS0_L2_CACHE_ECC_UNCORRECTED_ERR_COUNT_TOTAL                  15:0 /* RWIVF */
+#define NV_PLTCG_LTC0_LTS0_L2_CACHE_ECC_UNCORRECTED_ERR_COUNT_TOTAL_INIT           0x0000 /* RWI-V */
+#define NV_PLTCG_LTC0_LTS0_L2_CACHE_ECC_UNCORRECTED_ERR_COUNT_UNIQUE                31:16 /* RWIVF */
+#define NV_PLTCG_LTC0_LTS0_L2_CACHE_ECC_UNCORRECTED_ERR_COUNT_UNIQUE_INIT          0x0000 /* RWI-V */
+
+#endif // __gh100_dev_ltc_h_
--- a/src/common/inc/swref/published/hopper/gh100/dev_nv_xpl.h
+++ b/src/common/inc/swref/published/hopper/gh100/dev_nv_xpl.h
@@ -0,0 +1,52 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES
+ * SPDX-License-Identifier: MIT
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+#ifndef __gh100_dev_nv_xpl_h_
+#define __gh100_dev_nv_xpl_h_
+#define NV_XPL_DL_ERR_COUNT_RBUF                                               0x00000a54 /* R--4R */
+#define NV_XPL_DL_ERR_COUNT_RBUF__PRIV_LEVEL_MASK                              0x00000b08 /*       */
+#define NV_XPL_DL_ERR_COUNT_RBUF_CORR_ERR                                            15:0 /* R-EVF */
+#define NV_XPL_DL_ERR_COUNT_RBUF_CORR_ERR_INIT                                     0x0000 /* R-E-V */
+#define NV_XPL_DL_ERR_COUNT_RBUF_UNCORR_ERR                                         31:16 /* R-EVF */
+#define NV_XPL_DL_ERR_COUNT_RBUF_UNCORR_ERR_INIT                                   0x0000 /* R-E-V */
+#define NV_XPL_DL_ERR_COUNT_SEQ_LUT                                            0x00000a58 /* R--4R */
+#define NV_XPL_DL_ERR_COUNT_SEQ_LUT__PRIV_LEVEL_MASK                           0x00000b08 /*       */
+#define NV_XPL_DL_ERR_COUNT_SEQ_LUT_CORR_ERR                                         15:0 /* R-EVF */
+#define NV_XPL_DL_ERR_COUNT_SEQ_LUT_CORR_ERR_INIT                                  0x0000 /* R-E-V */
+#define NV_XPL_DL_ERR_COUNT_SEQ_LUT_UNCORR_ERR                                      31:16 /* R-EVF */
+#define NV_XPL_DL_ERR_COUNT_SEQ_LUT_UNCORR_ERR_INIT                                0x0000 /* R-E-V */
+
+#define NV_XPL_DL_ERR_RESET                                                    0x00000a5c /* RW-4R */
+#define NV_XPL_DL_ERR_RESET_RBUF_CORR_ERR_COUNT                                       0:0 /* RWCVF */
+#define NV_XPL_DL_ERR_RESET_RBUF_CORR_ERR_COUNT_DONE                                  0x0 /* RWC-V */
+#define NV_XPL_DL_ERR_RESET_RBUF_CORR_ERR_COUNT_PENDING                               0x1 /* -W--T */
+#define NV_XPL_DL_ERR_RESET_SEQ_LUT_CORR_ERR_COUNT                                    1:1 /* RWCVF */
+#define NV_XPL_DL_ERR_RESET_SEQ_LUT_CORR_ERR_COUNT_DONE                               0x0 /* RWC-V */
+#define NV_XPL_DL_ERR_RESET_SEQ_LUT_CORR_ERR_COUNT_PENDING                            0x1 /* -W--T */
+#define NV_XPL_DL_ERR_RESET_RBUF_UNCORR_ERR_COUNT                                   16:16 /* RWCVF */
+#define NV_XPL_DL_ERR_RESET_RBUF_UNCORR_ERR_COUNT_DONE                                0x0 /* RWC-V */
+#define NV_XPL_DL_ERR_RESET_RBUF_UNCORR_ERR_COUNT_PENDING                             0x1 /* -W--T */
+#define NV_XPL_DL_ERR_RESET_SEQ_LUT_UNCORR_ERR_COUNT                                17:17 /* RWCVF */
+#define NV_XPL_DL_ERR_RESET_SEQ_LUT_UNCORR_ERR_COUNT_DONE                             0x0 /* RWC-V */
+#define NV_XPL_DL_ERR_RESET_SEQ_LUT_UNCORR_ERR_COUNT_PENDING                          0x1 /* -W--T */
+#endif // __gh100_dev_nv_xpl_h__
--- a/src/common/inc/swref/published/hopper/gh100/dev_xtl_ep_pri.h
+++ b/src/common/inc/swref/published/hopper/gh100/dev_xtl_ep_pri.h
@@ -24,4 +24,7 @@
 #ifndef __gh100_dev_xtl_ep_pri_h__
 #define __gh100_dev_xtl_ep_pri_h__
 #define NV_EP_PCFGM                                                              0x92FFF:0x92000        /* RW--D */
+
+#define NV_XTL_EP_PRI_DED_ERROR_STATUS                                           0x0000043C    /* RW-4R */
+#define NV_XTL_EP_PRI_RAM_ERROR_INTR_STATUS                                      0x000003C8    /* RW-4R */
 #endif // __gh100_dev_xtl_ep_pri_h__
--- a/src/common/inc/swref/published/hopper/gh100/hwproject.h
+++ b/src/common/inc/swref/published/hopper/gh100/hwproject.h
@@ -21,3 +21,9 @@
 * DEALINGS IN THE SOFTWARE.
 */
 #define NV_CHIP_EXTENDED_SYSTEM_PHYSICAL_ADDRESS_BITS              52
+#define NV_LTC_PRI_STRIDE                            8192
+#define NV_LTS_PRI_STRIDE                             512
+#define NV_FBPA_PRI_STRIDE                      16384
+#define NV_SCAL_LITTER_NUM_FBPAS                       24
+#define NV_XPL_BASE_ADDRESS                    540672
+#define NV_XTL_BASE_ADDRESS                    593920
--- a/src/common/inc/swref/published/hopper/gh100/pri_nv_xal_ep.h
+++ b/src/common/inc/swref/published/hopper/gh100/pri_nv_xal_ep.h
@@ -47,5 +47,17 @@
 #define NV_XAL_EP_INTR_0_PRI_RSP_TIMEOUT                                              3:3
 #define NV_XAL_EP_INTR_0_PRI_RSP_TIMEOUT_PENDING                                      0x1
 #define NV_XAL_EP_SCPM_PRI_DUMMY_DATA_PATTERN_INIT                             0xbadf0200
+
+#define NV_XAL_EP_REORDER_ECC_UNCORRECTED_ERR_COUNT                            0x0010f364 /* RW-4R */
+#define NV_XAL_EP_REORDER_ECC_UNCORRECTED_ERR_COUNT_TOTAL                            15:0 /* RWIUF */
+#define NV_XAL_EP_REORDER_ECC_UNCORRECTED_ERR_COUNT_TOTAL_INIT                     0x0000 /* RWI-V */
+#define NV_XAL_EP_REORDER_ECC_UNCORRECTED_ERR_COUNT_UNIQUE                          31:16 /* RWIUF */
+#define NV_XAL_EP_REORDER_ECC_UNCORRECTED_ERR_COUNT_UNIQUE_INIT                    0x0000 /* RWI-V */
+
+#define NV_XAL_EP_P2PREQ_ECC_UNCORRECTED_ERR_COUNT                             0x0010f37c /* RW-4R */
+#define NV_XAL_EP_P2PREQ_ECC_UNCORRECTED_ERR_COUNT_TOTAL                             15:0 /* RWIUF */
+#define NV_XAL_EP_P2PREQ_ECC_UNCORRECTED_ERR_COUNT_TOTAL_INIT                      0x0000 /* RWI-V */
+#define NV_XAL_EP_P2PREQ_ECC_UNCORRECTED_ERR_COUNT_UNIQUE                           31:16 /* RWIUF */
+#define NV_XAL_EP_P2PREQ_ECC_UNCORRECTED_ERR_COUNT_UNIQUE_INIT                     0x0000 /* RWI-V */
 #endif // __gh100_pri_nv_xal_ep_h__

--- a/src/common/nvswitch/kernel/ls10/link_ls10.c
+++ b/src/common/nvswitch/kernel/ls10/link_ls10.c
@@ -1542,6 +1542,12 @@ nvswitch_reset_and_train_link_ls10
    nvswitch_execute_unilateral_link_shutdown_ls10(link);
    nvswitch_corelib_clear_link_state_ls10(link);

+    //
+    // When a link faults there could be a race between the driver requesting
+    // reset and MINION processing Emergency Shutdown. Minion will notify if
+    // such a collision happens and will deny the reset request, so try the
+    // request up to 3 times
+    //
    do
    {
        status = nvswitch_request_tl_link_state_ls10(link,
@@ -1597,15 +1603,18 @@ nvswitch_reset_and_train_link_ls10
            "%s: NvLink Reset has failed for link %d\n",
            __FUNCTION__, link->linkNumber);

-        // Re-register links.
-        status = nvlink_lib_register_link(device->nvlink_device, link);
-        if (status != NVL_SUCCESS)
-        {
-            nvswitch_destroy_link(link);
-            return status;
-        }
        return status;
    }
+
+    status = nvswitch_launch_ALI_link_training(device, link, NV_FALSE);
+    if (status != NVL_SUCCESS)
+    {
+        NVSWITCH_PRINT(device, ERROR,
+            "%s: NvLink failed to request ACTIVE for link %d\n",
+            __FUNCTION__, link->linkNumber);
+        return status;
+    }
+
    return NVL_SUCCESS;
 }

--- a/src/common/nvswitch/kernel/nvswitch.c
+++ b/src/common/nvswitch/kernel/nvswitch.c
@@ -1345,7 +1345,6 @@ nvswitch_lib_initialize_device
    NvU8 link_num;
    nvlink_link *link = NULL;
    NvBool is_blacklisted_by_os = NV_FALSE;
-    NvU64 mode;

    if (!NVSWITCH_IS_DEVICE_ACCESSIBLE(device))
    {
@@ -1508,18 +1507,6 @@ nvswitch_lib_initialize_device

        nvswitch_reset_persistent_link_hw_state(device, link_num);

-        if(_nvswitch_corelib_get_dl_link_mode(link, &mode) != NVL_SUCCESS)
-        {
-            NVSWITCH_PRINT(device, ERROR, "%s: nvlipt_lnk_status: Failed to check link mode! LinkId %d\n",
-                        __FUNCTION__, link_num);
-        }
-        else if(mode == NVLINK_LINKSTATE_FAULT)
-        {
-            NVSWITCH_PRINT(device, INFO, "%s: retraining LinkId %d\n",
-                        __FUNCTION__, link_num);
-            nvswitch_reset_and_train_link(device, link);
-        }
-
    }

    retval = nvswitch_set_training_mode(device);
@@ -1623,6 +1610,10 @@ nvswitch_lib_post_init_device
 )
 {
    NvlStatus retval;
+    NvlStatus status;
+    NvU32     link_num;
+    NvU64     mode;
+    nvlink_link *link;

    if (!NVSWITCH_IS_DEVICE_INITIALIZED(device))
    {
@@ -1634,7 +1625,7 @@ nvswitch_lib_post_init_device
    {
        return retval;
    }
-    
+
    if (nvswitch_is_bios_supported(device))
    {
        retval = nvswitch_bios_get_image(device);
@@ -1670,6 +1661,41 @@ nvswitch_lib_post_init_device
        (void)nvswitch_launch_ALI(device);
    }

+    //
+    // There is an edge case where a hypervisor may not send same number
+    // of reset to switch and GPUs, so try to re-train links in fault
+    // if possible
+    //
+    for (link_num=0; link_num < nvswitch_get_num_links(device); link_num++)
+    {
+        // Sanity check
+        if (!nvswitch_is_link_valid(device, link_num))
+        {
+            continue;
+        }
+
+        status = nvlink_lib_get_link(device->nvlink_device, link_num, &link);
+        if (status != NVL_SUCCESS)
+        {
+            NVSWITCH_PRINT(device, ERROR, "%s: Failed to get link for LinkId %d\n",
+                        __FUNCTION__, link_num);
+            continue;
+        }
+
+        // If the link is in fault then re-train
+        if(_nvswitch_corelib_get_dl_link_mode(link, &mode) != NVL_SUCCESS)
+        {
+            NVSWITCH_PRINT(device, ERROR, "%s: nvlipt_lnk_status: Failed to check link mode! LinkId %d\n",
+                        __FUNCTION__, link_num);
+        }
+        else if(mode == NVLINK_LINKSTATE_FAULT)
+        {
+            NVSWITCH_PRINT(device, INFO, "%s: retraining LinkId %d\n",
+                        __FUNCTION__, link_num);
+            nvswitch_reset_and_train_link(device, link);
+        }
+    }
+
    return NVL_SUCCESS;
 }

--- a/src/common/sdk/nvidia/inc/nverror.h
+++ b/src/common/sdk/nvidia/inc/nverror.h
@@ -121,7 +121,8 @@
 #define NVLINK_FLA_PRIV_ERR                             (137)
 #define ROBUST_CHANNEL_DLA_ERROR                        (138)
 #define ROBUST_CHANNEL_FAST_PATH_ERROR                  (139)
-#define ROBUST_CHANNEL_LAST_ERROR                       (ROBUST_CHANNEL_FAST_PATH_ERROR)
+#define UNRECOVERABLE_ECC_ERROR_ESCAPE                  (140)
+#define ROBUST_CHANNEL_LAST_ERROR                       (UNRECOVERABLE_ECC_ERROR_ESCAPE)


 // Indexed CE reference
--- a/src/common/sdk/nvidia/inc/nvos.h
+++ b/src/common/sdk/nvidia/inc/nvos.h
@@ -1776,7 +1776,7 @@ typedef struct
 // - Used for controlling CPU addresses in CUDA's unified CPU+GPU virtual
 //   address space
 // - Only valid on NvRmMapMemory
-// - Implemented on Unix but not VMware
+// - Only implemented on Linux
 #define NVOS33_FLAGS_MAP_FIXED                                     18:18
 #define NVOS33_FLAGS_MAP_FIXED_DISABLE                             (0x00000000)
 #define NVOS33_FLAGS_MAP_FIXED_ENABLE                              (0x00000001)
@@ -1794,10 +1794,9 @@ typedef struct
 // - When combined with MAP_FIXED, this allows the client to exert
 //   significant control over the CPU heap
 // - Used in CUDA's unified CPU+GPU virtual address space
-// - Valid in nvRmUnmapMemory
-// - Valid on NvRmMapMemory (specifies RM's behavior whenever the
+// - Only valid on NvRmMapMemory (specifies RM's behavior whenever the
 //   mapping is destroyed, regardless of mechanism)
-// - Implemented on Unix but not VMware
+// - Only implemented on Linux
 #define NVOS33_FLAGS_RESERVE_ON_UNMAP                              19:19
 #define NVOS33_FLAGS_RESERVE_ON_UNMAP_DISABLE                      (0x00000000)
 #define NVOS33_FLAGS_RESERVE_ON_UNMAP_ENABLE                       (0x00000001)
--- a/src/nvidia/generated/g_gpu_nvoc.c
+++ b/src/nvidia/generated/g_gpu_nvoc.c
@@ -492,6 +492,17 @@ static void __nvoc_init_funcTable_OBJGPU_1(OBJGPU *pThis) {
        pThis->__gpuWriteFunctionConfigRegEx__ = &gpuWriteFunctionConfigRegEx_GM107;
    }

+    // Hal function -- gpuReadVgpuConfigReg
+    if (( ((chipHal_HalVarIdx >> 5) == 1UL) && ((1UL << (chipHal_HalVarIdx & 0x1f)) & 0x10000000UL) )) /* ChipHal: GH100 */ 
+    {
+        pThis->__gpuReadVgpuConfigReg__ = &gpuReadVgpuConfigReg_GH100;
+    }
+    // default
+    else
+    {
+        pThis->__gpuReadVgpuConfigReg__ = &gpuReadVgpuConfigReg_46f6a7;
+    }
+
    // Hal function -- gpuGetIdInfo
    if (( ((chipHal_HalVarIdx >> 5) == 1UL) && ((1UL << (chipHal_HalVarIdx & 0x1f)) & 0x10000000UL) )) /* ChipHal: GH100 */ 
    {
--- a/src/nvidia/generated/g_gpu_nvoc.h
+++ b/src/nvidia/generated/g_gpu_nvoc.h
@@ -877,6 +877,7 @@ struct OBJGPU {
    NV_STATUS (*__gpuReadFunctionConfigReg__)(struct OBJGPU *, NvU32, NvU32, NvU32 *);
    NV_STATUS (*__gpuWriteFunctionConfigReg__)(struct OBJGPU *, NvU32, NvU32, NvU32);
    NV_STATUS (*__gpuWriteFunctionConfigRegEx__)(struct OBJGPU *, NvU32, NvU32, NvU32, THREAD_STATE_NODE *);
+    NV_STATUS (*__gpuReadVgpuConfigReg__)(struct OBJGPU *, NvU32, NvU32 *);
    void (*__gpuGetIdInfo__)(struct OBJGPU *);
    void (*__gpuHandleSanityCheckRegReadError__)(struct OBJGPU *, NvU32, NvU32);
    void (*__gpuHandleSecFault__)(struct OBJGPU *);
@@ -1427,6 +1428,8 @@ NV_STATUS __nvoc_objCreate_OBJGPU(OBJGPU**, Dynamic*, NvU32,
 #define gpuWriteFunctionConfigReg_HAL(pGpu, function, reg, data) gpuWriteFunctionConfigReg_DISPATCH(pGpu, function, reg, data)
 #define gpuWriteFunctionConfigRegEx(pGpu, function, reg, data, pThreadState) gpuWriteFunctionConfigRegEx_DISPATCH(pGpu, function, reg, data, pThreadState)
 #define gpuWriteFunctionConfigRegEx_HAL(pGpu, function, reg, data, pThreadState) gpuWriteFunctionConfigRegEx_DISPATCH(pGpu, function, reg, data, pThreadState)
+#define gpuReadVgpuConfigReg(pGpu, index, data) gpuReadVgpuConfigReg_DISPATCH(pGpu, index, data)
+#define gpuReadVgpuConfigReg_HAL(pGpu, index, data) gpuReadVgpuConfigReg_DISPATCH(pGpu, index, data)
 #define gpuGetIdInfo(pGpu) gpuGetIdInfo_DISPATCH(pGpu)
 #define gpuGetIdInfo_HAL(pGpu) gpuGetIdInfo_DISPATCH(pGpu)
 #define gpuHandleSanityCheckRegReadError(pGpu, addr, value) gpuHandleSanityCheckRegReadError_DISPATCH(pGpu, addr, value)
@@ -2970,6 +2973,16 @@ static inline NV_STATUS gpuWriteFunctionConfigRegEx_DISPATCH(struct OBJGPU *pGpu
    return pGpu->__gpuWriteFunctionConfigRegEx__(pGpu, function, reg, data, pThreadState);
 }

+NV_STATUS gpuReadVgpuConfigReg_GH100(struct OBJGPU *pGpu, NvU32 index, NvU32 *data);
+
+static inline NV_STATUS gpuReadVgpuConfigReg_46f6a7(struct OBJGPU *pGpu, NvU32 index, NvU32 *data) {
+    return NV_ERR_NOT_SUPPORTED;
+}
+
+static inline NV_STATUS gpuReadVgpuConfigReg_DISPATCH(struct OBJGPU *pGpu, NvU32 index, NvU32 *data) {
+    return pGpu->__gpuReadVgpuConfigReg__(pGpu, index, data);
+}
+
 void gpuGetIdInfo_GM107(struct OBJGPU *pGpu);

 void gpuGetIdInfo_GH100(struct OBJGPU *pGpu);
--- a/src/nvidia/generated/g_kern_mem_sys_nvoc.c
+++ b/src/nvidia/generated/g_kern_mem_sys_nvoc.c
@@ -425,6 +425,28 @@ static void __nvoc_init_funcTable_KernelMemorySystem_1(KernelMemorySystem *pThis
        pThis->__kmemsysRemoveAllAtsPeers__ = &kmemsysRemoveAllAtsPeers_GV100;
    }

+    // Hal function -- kmemsysCheckEccCounts
+    if (( ((chipHal_HalVarIdx >> 5) == 1UL) && ((1UL << (chipHal_HalVarIdx & 0x1f)) & 0x10000000UL) )) /* ChipHal: GH100 */ 
+    {
+        pThis->__kmemsysCheckEccCounts__ = &kmemsysCheckEccCounts_GH100;
+    }
+    // default
+    else
+    {
+        pThis->__kmemsysCheckEccCounts__ = &kmemsysCheckEccCounts_b3696a;
+    }
+
+    // Hal function -- kmemsysClearEccCounts
+    if (( ((chipHal_HalVarIdx >> 5) == 1UL) && ((1UL << (chipHal_HalVarIdx & 0x1f)) & 0x10000000UL) )) /* ChipHal: GH100 */ 
+    {
+        pThis->__kmemsysClearEccCounts__ = &kmemsysClearEccCounts_GH100;
+    }
+    // default
+    else
+    {
+        pThis->__kmemsysClearEccCounts__ = &kmemsysClearEccCounts_56cd7a;
+    }
+
    pThis->__nvoc_base_OBJENGSTATE.__engstateConstructEngine__ = &__nvoc_thunk_KernelMemorySystem_engstateConstructEngine;

    pThis->__nvoc_base_OBJENGSTATE.__engstateStateInitLocked__ = &__nvoc_thunk_KernelMemorySystem_engstateStateInitLocked;
--- a/src/nvidia/generated/g_kern_mem_sys_nvoc.h
+++ b/src/nvidia/generated/g_kern_mem_sys_nvoc.h
@@ -222,6 +222,8 @@ struct KernelMemorySystem {
    void (*__kmemsysNumaRemoveAllMemory__)(OBJGPU *, struct KernelMemorySystem *);
    NV_STATUS (*__kmemsysSetupAllAtsPeers__)(OBJGPU *, struct KernelMemorySystem *);
    void (*__kmemsysRemoveAllAtsPeers__)(OBJGPU *, struct KernelMemorySystem *);
+    void (*__kmemsysCheckEccCounts__)(OBJGPU *, struct KernelMemorySystem *);
+    NV_STATUS (*__kmemsysClearEccCounts__)(OBJGPU *, struct KernelMemorySystem *);
    NV_STATUS (*__kmemsysStateLoad__)(POBJGPU, struct KernelMemorySystem *, NvU32);
    NV_STATUS (*__kmemsysStateUnload__)(POBJGPU, struct KernelMemorySystem *, NvU32);
    NV_STATUS (*__kmemsysStatePostUnload__)(POBJGPU, struct KernelMemorySystem *, NvU32);
@@ -323,6 +325,10 @@ NV_STATUS __nvoc_objCreate_KernelMemorySystem(KernelMemorySystem**, Dynamic*, Nv
 #define kmemsysSetupAllAtsPeers_HAL(pGpu, pKernelMemorySystem) kmemsysSetupAllAtsPeers_DISPATCH(pGpu, pKernelMemorySystem)
 #define kmemsysRemoveAllAtsPeers(pGpu, pKernelMemorySystem) kmemsysRemoveAllAtsPeers_DISPATCH(pGpu, pKernelMemorySystem)
 #define kmemsysRemoveAllAtsPeers_HAL(pGpu, pKernelMemorySystem) kmemsysRemoveAllAtsPeers_DISPATCH(pGpu, pKernelMemorySystem)
+#define kmemsysCheckEccCounts(pGpu, pKernelMemorySystem) kmemsysCheckEccCounts_DISPATCH(pGpu, pKernelMemorySystem)
+#define kmemsysCheckEccCounts_HAL(pGpu, pKernelMemorySystem) kmemsysCheckEccCounts_DISPATCH(pGpu, pKernelMemorySystem)
+#define kmemsysClearEccCounts(pGpu, pKernelMemorySystem) kmemsysClearEccCounts_DISPATCH(pGpu, pKernelMemorySystem)
+#define kmemsysClearEccCounts_HAL(pGpu, pKernelMemorySystem) kmemsysClearEccCounts_DISPATCH(pGpu, pKernelMemorySystem)
 #define kmemsysStateLoad(pGpu, pEngstate, arg0) kmemsysStateLoad_DISPATCH(pGpu, pEngstate, arg0)
 #define kmemsysStateUnload(pGpu, pEngstate, arg0) kmemsysStateUnload_DISPATCH(pGpu, pEngstate, arg0)
 #define kmemsysStatePostUnload(pGpu, pEngstate, arg0) kmemsysStatePostUnload_DISPATCH(pGpu, pEngstate, arg0)
@@ -733,6 +739,26 @@ static inline void kmemsysRemoveAllAtsPeers_DISPATCH(OBJGPU *pGpu, struct Kernel
    pKernelMemorySystem->__kmemsysRemoveAllAtsPeers__(pGpu, pKernelMemorySystem);
 }

+void kmemsysCheckEccCounts_GH100(OBJGPU *pGpu, struct KernelMemorySystem *pKernelMemorySystem);
+
+static inline void kmemsysCheckEccCounts_b3696a(OBJGPU *pGpu, struct KernelMemorySystem *pKernelMemorySystem) {
+    return;
+}
+
+static inline void kmemsysCheckEccCounts_DISPATCH(OBJGPU *pGpu, struct KernelMemorySystem *pKernelMemorySystem) {
+    pKernelMemorySystem->__kmemsysCheckEccCounts__(pGpu, pKernelMemorySystem);
+}
+
+NV_STATUS kmemsysClearEccCounts_GH100(OBJGPU *pGpu, struct KernelMemorySystem *pKernelMemorySystem);
+
+static inline NV_STATUS kmemsysClearEccCounts_56cd7a(OBJGPU *pGpu, struct KernelMemorySystem *pKernelMemorySystem) {
+    return NV_OK;
+}
+
+static inline NV_STATUS kmemsysClearEccCounts_DISPATCH(OBJGPU *pGpu, struct KernelMemorySystem *pKernelMemorySystem) {
+    return pKernelMemorySystem->__kmemsysClearEccCounts__(pGpu, pKernelMemorySystem);
+}
+
 static inline NV_STATUS kmemsysStateLoad_DISPATCH(POBJGPU pGpu, struct KernelMemorySystem *pEngstate, NvU32 arg0) {
    return pEngstate->__kmemsysStateLoad__(pGpu, pEngstate, arg0);
 }
--- a/src/nvidia/generated/g_mem_mgr_nvoc.h
+++ b/src/nvidia/generated/g_mem_mgr_nvoc.h
@@ -460,6 +460,7 @@ struct MemoryManager {
    NvBool bPmaEnabled;
    NvBool bPmaInitialized;
    NvBool bPmaForcePersistence;
+    NvBool bPmaAddrTree;
    NvBool bClientPageTablesPmaManaged;
    NvBool bScanoutSysmem;
    NvBool bMixedDensityFbp;
@@ -2149,6 +2150,10 @@ static inline void memmgrSetClientPageTablesPmaManaged(struct MemoryManager *pMe
    pMemoryManager->bClientPageTablesPmaManaged = val;
 }

+static inline NvBool memmgrIsPmaAddrTree(struct MemoryManager *pMemoryManager) {
+    return pMemoryManager->bPmaAddrTree;
+}
+
 static inline NvU64 memmgrGetRsvdMemoryBase(struct MemoryManager *pMemoryManager) {
    return pMemoryManager->rsvdMemoryBase;
 }
--- a/src/nvidia/generated/g_nv_name_released.h
+++ b/src/nvidia/generated/g_nv_name_released.h
@@ -808,7 +808,6 @@ static const CHIPS_RELEASED sChipsReleased[] = {
    { 0x20B2, 0x147f, 0x10de, "NVIDIA A100-SXM4-80GB" },
    { 0x20B2, 0x1622, 0x10de, "NVIDIA A100-SXM4-80GB" },
    { 0x20B2, 0x1623, 0x10de, "NVIDIA A100-SXM4-80GB" },
-    { 0x20B2, 0x1625, 0x10de, "NVIDIA PG509-210" },
    { 0x20B3, 0x14a7, 0x10de, "NVIDIA A100-SXM-64GB" },
    { 0x20B3, 0x14a8, 0x10de, "NVIDIA A100-SXM-64GB" },
    { 0x20B5, 0x1533, 0x10de, "NVIDIA A100 80GB PCIe" },
@@ -816,7 +815,6 @@ static const CHIPS_RELEASED sChipsReleased[] = {
    { 0x20B6, 0x1492, 0x10de, "NVIDIA PG506-232" },
    { 0x20B7, 0x1532, 0x10de, "NVIDIA A30" },
    { 0x20B7, 0x1804, 0x10de, "NVIDIA A30" },
-    { 0x20B7, 0x1852, 0x10de, "NVIDIA A30" },
    { 0x20BD, 0x17f4, 0x10de, "NVIDIA A800-SXM4-40GB" },
    { 0x20F1, 0x145f, 0x10de, "NVIDIA A100-PCIE-40GB" },
    { 0x20F3, 0x179b, 0x10de, "NVIDIA A800-SXM4-80GB" },
@@ -901,8 +899,6 @@ static const CHIPS_RELEASED sChipsReleased[] = {
    { 0x2331, 0x1626, 0x10de, "NVIDIA H100 PCIe" },
    { 0x2339, 0x17fc, 0x10de, "NVIDIA H100" },
    { 0x233A, 0x183a, 0x10de, "NVIDIA H800 NVL" },
-    { 0x2342, 0x16eb, 0x10de, "NVIDIA GH200 120GB" },
-    { 0x2342, 0x1809, 0x10de, "NVIDIA GH200 480GB" },
    { 0x2414, 0x0000, 0x0000, "NVIDIA GeForce RTX 3060 Ti" },
    { 0x2420, 0x0000, 0x0000, "NVIDIA GeForce RTX 3080 Ti Laptop GPU" },
    { 0x2438, 0x0000, 0x0000, "NVIDIA RTX A5500 Laptop GPU" },
@@ -1805,21 +1801,6 @@ static const CHIPS_RELEASED sChipsReleased[] = {
    { 0x2322, 0x17ee, 0x10DE, "NVIDIA H800-40C" },
    { 0x2322, 0x17ef, 0x10DE, "NVIDIA H800-80C" },
    { 0x2322, 0x1845, 0x10DE, "NVIDIA H800-1-20C" },
-    { 0x2330, 0x187a, 0x10DE, "NVIDIA H100XM-1-10CME" },
-    { 0x2330, 0x187b, 0x10DE, "NVIDIA H100XM-1-10C" },
-    { 0x2330, 0x187c, 0x10DE, "NVIDIA H100XM-1-20C" },
-    { 0x2330, 0x187d, 0x10DE, "NVIDIA H100XM-2-20C" },
-    { 0x2330, 0x187e, 0x10DE, "NVIDIA H100XM-3-40C" },
-    { 0x2330, 0x187f, 0x10DE, "NVIDIA H100XM-4-40C" },
-    { 0x2330, 0x1880, 0x10DE, "NVIDIA H100XM-7-80C" },
-    { 0x2330, 0x1881, 0x10DE, "NVIDIA H100XM-4C" },
-    { 0x2330, 0x1882, 0x10DE, "NVIDIA H100XM-5C" },
-    { 0x2330, 0x1883, 0x10DE, "NVIDIA H100XM-8C" },
-    { 0x2330, 0x1884, 0x10DE, "NVIDIA H100XM-10C" },
-    { 0x2330, 0x1885, 0x10DE, "NVIDIA H100XM-16C" },
-    { 0x2330, 0x1886, 0x10DE, "NVIDIA H100XM-20C" },
-    { 0x2330, 0x1887, 0x10DE, "NVIDIA H100XM-40C" },
-    { 0x2330, 0x1888, 0x10DE, "NVIDIA H100XM-80C" },
    { 0x2331, 0x16d3, 0x10DE, "NVIDIA H100-1-10C" },
    { 0x2331, 0x16d4, 0x10DE, "NVIDIA H100-2-20C" },
    { 0x2331, 0x16d5, 0x10DE, "NVIDIA H100-3-40C" },
--- a/src/nvidia/inc/kernel/gpu/mem_mgr/phys_mem_allocator/addrtree.h
+++ b/src/nvidia/inc/kernel/gpu/mem_mgr/phys_mem_allocator/addrtree.h
@@ -0,0 +1,227 @@
+/*
+ * SPDX-FileCopyrightText: Copyright (c) 2015-2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: MIT
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ */
+
+/*!
+ * @brief Implement PMA address tree
+ *
+ */
+
+#ifndef ADDRTREE_H
+#define ADDRTREE_H
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include "map_defines.h"
+
+// Declare this before its definition because it refers to itself
+typedef struct addrtree_node ADDRTREE_NODE;
+
+struct addrtree_node
+{
+    NvU32 level;                    // The level this node belongs to
+    NvU32 numChildren;              // The number of children in the children array
+    NvU64 frame;                    // The first frame this node holds
+    NvU64 state[PMA_BITS_PER_PAGE];          // Tracks the actual state for each map
+    NvU64 seeChild[PMA_BITS_PER_PAGE];       // Whether this node is partially allocated
+                                    //   If it is partially allocated, we must go to the children
+                                    //   to find the correct information.
+
+    ADDRTREE_NODE *parent;          // The node's parent
+    ADDRTREE_NODE *children;        // Pointer to an array of children
+};
+
+typedef struct addrtree_level
+{
+    NvU64 nodeCount;                // Count of total number of nodes on this level
+    ADDRTREE_NODE *pNodeList;       // Pointer to the start of the list of nodes on this level
+    NvU32 pageSizeShift;            // Page size this level is tracking
+    NvU32 maxFramesPerNode;         // The max number of this level frames per node
+} ADDRTREE_LEVEL;
+
+typedef struct pma_addrtree
+{
+    NvU64           totalFrames;    // Total number of 64KB frames being tracked
+    NvU32           levelCount;     // Number of levels in this tree
+    ADDRTREE_LEVEL *levels;         // List of levels in the tree
+    ADDRTREE_NODE  *root;           // Start of the node list
+    NvU64           numPaddingFrames; // Number of 64KB frames needed for padding for alignment
+
+    NvU64 frameEvictionsInProcess;  // Count of frame evictions in-process
+    PMA_STATS      *pPmaStats;      // Point back to the public struct in PMA structure
+    NvBool          bProtected;     // The memory segment tracked by this tree is protected (VPR/CPR)
+} PMA_ADDRTREE;
+
+/*!
+ * @brief Initializes the addrtree for PMA uses
+ *
+ * Allocates the address tree structure for all the pages being managed in this tree.
+ * Address Tree implementation will use a default configuration for its own level
+ * structures.
+ *
+ * @param[in] numPages    The number of pages being managed in this tree
+ * @param[in] addrBase    The base address of this region. Required for addrtree alignment
+ * @param[in] pPmaStats   Pointer to the PMA-wide stats structure
+ * @param[in] bProtected  The tree tracks pages in protected memory
+ *
+ * @return PMA_ADDRTREE Pointer to the addrtree if succeeded, NULL otherwise
+ */
+void *pmaAddrtreeInit(NvU64 numFrames, NvU64 addrBase, PMA_STATS *pPmaStats, NvBool bProtected);
+
+/*!
+ * @brief Destroys the addrtree and free the memory
+ *
+ * @param[in] pMap  The addrtree to destroy
+ *
+ * @return void
+ */
+void pmaAddrtreeDestroy(void *pMap);
+
+/*!
+ * @brief Get/set number of evicting frames
+ * Used for sanity checking in PMA layer as well as performance optimization
+ * for the map layer to scan faster.
+ */
+NvU64 pmaAddrtreeGetEvictingFrames(void *pMap);
+void pmaAddrtreeSetEvictingFrames(void *pMap, NvU64 frameEvictionsInProcess);
+
+
+/*!
+ * @brief Scans the addrtree for contiguous space that has the certain status.
+ *
+ * @param[in]   pMap            The addrtree to be scanned
+ * @param[in]   addrBase        The base address of this region
+ * @param[in]   rangeStart      The start of the restricted range
+ * @param[in]   rangeEnd        The end of the restricted range
+ * @param[in]   numPages        The number of pages we are scanning for
+ * @param[out]  freeList        A list of free frame numbers -- contains only 1 element
+ * @param[in]   pageSize        Size of one page
+ * @param[in]   alignment       Alignment requested by client
+ * @param[out]  pagesAllocated  Number of pages this call allocated
+ * @param[in]   bSkipEvict      Whether it's ok to skip the scan for evictable pages
+ *
+ * @return NV_OK if succeeded
+ * @return NV_ERR_IN_USE if found pages that can be evicted
+ * @return NV_ERR_NO_MEMORY if no available pages could be found
+ */
+NV_STATUS pmaAddrtreeScanContiguous(
+    void *pMap, NvU64 addrBase, NvU64 rangeStart, NvU64 rangeEnd,
+    NvU64 numPages, NvU64 *freelist, NvU64 pageSize, NvU64 alignment,
+    NvU64 *pagesAllocated, NvBool bSkipEvict, NvBool bReverseAlloc);
+
+NV_STATUS pmaAddrtreeScanDiscontiguous(
+    void *pMap, NvU64 addrBase, NvU64 rangeStart, NvU64 rangeEnd,
+    NvU64 numPages, NvU64 *freelist, NvU64 pageSize, NvU64 alignment,
+    NvU64 *pagesAllocated, NvBool bSkipEvict, NvBool bReverseAlloc);
+
+void pmaAddrtreePrintTree(void *pMap, const char* str);
+
+
+/*!
+ * @brief Changes the state & attrib bits specified by mask
+ *
+ * Changes the state of the bits given the physical frame number
+ * TODO: all four interfaces need to be merged from PMA level so we can remove them!
+ *
+ * @param[in]   pMap         The addrtree to change
+ * @param[in]   frameNum     The frame number to change
+ * @param[in]   newState     The new state to change to
+ * @param[in]   newStateMask Specific bits to write
+ *
+ * @return void
+ */
+void pmaAddrtreeChangeState(void *pMap, NvU64 frameNum, PMA_PAGESTATUS newState);
+void pmaAddrtreeChangeStateAttrib(void *pMap, NvU64 frameNum, PMA_PAGESTATUS newState, NvBool writeAttrib);
+void pmaAddrtreeChangeStateAttribEx(void *pMap, NvU64 frameNum, PMA_PAGESTATUS newState,PMA_PAGESTATUS newStateMask);
+void pmaAddrtreeChangePageStateAttrib(void * pMap, NvU64 startFrame, NvU64 pageSize,
+                                      PMA_PAGESTATUS newState, NvBool writeAttrib);
+
+/*!
+ * @brief Read the page state & attrib bits
+ *
+ * Read the state of the page given the physical frame number
+ *
+ * @param[in]   pMap        The addrtree to read
+ * @param[in]   frameNum    The frame number to read
+ * @param[in]   readAttrib  Read attribute bits as well
+ *
+ * @return PAGESTATUS of the frame
+ */
+PMA_PAGESTATUS pmaAddrtreeRead(void *pMap, NvU64 frameNum, NvBool readAttrib);
+
+
+/*!
+ * @brief Gets the total size of specified PMA managed region.
+ *
+ * Gets the total size of current PMA managed region in the FB.
+ *
+ * @param[in]  pMap         Pointer to the addrtree for the region
+ * @param[in]  pBytesTotal  Pointer that will return total bytes for current region.
+ *
+ */
+void pmaAddrtreeGetSize(void *pMap, NvU64 *pBytesTotal);
+
+
+/*!
+ * @brief Gets the size of the maximum free chunk of memory in specified region.
+ *
+ * Gets the size of the maximum free chunk of memory in the specified PMA managed
+ * region of the FB.
+ *
+ * @param[in]  pMap         Pointer to the addrtree for the region
+ * @param[in]  pLargestFree Pointer that will return largest free in current region.
+ *
+ */
+void pmaAddrtreeGetLargestFree(void *pMap, NvU64 *pLargestFree);
+
+/*!
+ * @brief Returns the address range that is completely available for eviction.
+ *        - Should be ALLOC_UNPIN.
+ * In NUMA, OS manages memory and PMA will only track allocated memory in ALLOC_PIN
+ * and ALLOC_UNPIN state. FREE memory is managed by OS and cannot be tracked by PMA
+ * and hence PMA cannot consider FREE memory for eviction and can only consider frames
+ * in known state to PMA or eviction. ALLOC_PIN cannot be evicted and hence only ALLOC_UNPIN
+ * can be evictable.
+ *
+ *
+ * @param[in]  pMap         Pointer to the regmap for the region
+ * @param[in]  addrBase     Base address of the region
+ * @param[in]  actualSize   Size of the eviction range
+ * @param[in]  pageSize     Pagesize
+ * @param[out] evictStart   Starting address of the eviction range
+ * @param[out] evictEnd     End address of the eviction range.
+ *
+ * Returns:
+ *  -  NV_OK If there is evictable range of given size : actualSize
+ *
+ *  -  NV_ERR_NO_MEMORY if no contiguous range is evictable.
+ */
+NV_STATUS pmaAddrtreeScanContiguousNumaEviction(void *pMap, NvU64 addrBase,
+    NvLength actualSize, NvU64 pageSize, NvU64 *evictStart, NvU64 *evictEnd);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif // ADDRTREE_H
--- a/src/nvidia/inc/kernel/gpu/mem_mgr/phys_mem_allocator/map_defines.h
+++ b/src/nvidia/inc/kernel/gpu/mem_mgr/phys_mem_allocator/map_defines.h
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2015-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2015-2021 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * SPDX-License-Identifier: MIT
 *
 * Permission is hereby granted, free of charge, to any person obtaining a
@@ -22,7 +22,7 @@
 */

 /*!
- *  @brief Contains common defines for regmap
+ *  @brief Contains common defines between addrtree and regmap
 */

 #ifndef MAP_DEFINES_H
--- a/src/nvidia/inc/kernel/gpu/mem_mgr/phys_mem_allocator/phys_mem_allocator.h
+++ b/src/nvidia/inc/kernel/gpu/mem_mgr/phys_mem_allocator/phys_mem_allocator.h
@@ -42,6 +42,7 @@

 #include "nvport/nvport.h"
 #include "regmap.h"
+#include "addrtree.h"
 #include "nvmisc.h"

 #if defined(SRT_BUILD)
@@ -71,7 +72,7 @@ typedef struct SCRUB_NODE SCRUB_NODE;
 #define PMA_INIT_NUMA                   NVBIT(2)
 #define PMA_INIT_INTERNAL               NVBIT(3) // Used after heap is removed
 #define PMA_INIT_FORCE_PERSISTENCE      NVBIT(4)
-// unused
+#define PMA_INIT_ADDRTREE               NVBIT(5)
 #define PMA_INIT_NUMA_AUTO_ONLINE       NVBIT(6)

 // These flags are used for querying PMA's config and/or state.
--- a/src/nvidia/inc/libraries/poolalloc.h
+++ b/src/nvidia/inc/libraries/poolalloc.h
@@ -83,11 +83,10 @@ MAKE_LIST(PoolPageHandleList, POOLALLOC_HANDLE);
 /*!
 * @brief Callback function to upstream allocators for allocating new pages
 *
- * This function can allocate multiple pages at a time
+ * This function only allocate 1 page at a time right now
 *
 * @param[in]   ctxPtr      Provides context to upstream allocator
- * @param[in]   pageSize    Size of page to ask for from upstream
- * @param[in]   numPages    Number of pages to allocate
+ * @param[in]   pageSize    Not really needed. For debugging only
 * @param[out]  pPage       The output page handle from upstream
 *
 * @return NV_OK            if successfully allocated NvF32 totalTest, doneTest, failTest; the page
@@ -97,7 +96,7 @@ MAKE_LIST(PoolPageHandleList, POOLALLOC_HANDLE);
 *
 */
 typedef NV_STATUS (*allocCallback_t)(void *ctxPtr, NvU64 pageSize,
-                   NvU64 numPages, POOLALLOC_HANDLE *pPage);
+                   POOLALLOC_HANDLE *pPage);

 /*!
 * @brief Callback function to upstream allocators for freeing unused pages
--- a/src/nvidia/interface/nvrm_registry.h
+++ b/src/nvidia/interface/nvrm_registry.h
@@ -826,6 +826,13 @@

 //
 // Type DWORD
+// Controls enable of Address Tree memory tracking instead of regmap
+// for the PMA memory manager.
+//
+#define NV_REG_STR_RM_ENABLE_ADDRTREE       "RMEnableAddrtree"
+#define NV_REG_STR_RM_ENABLE_ADDRTREE_YES   (0x00000001)
+#define NV_REG_STR_RM_ENABLE_ADDRTREE_NO    (0x00000000)
+
 #define  NV_REG_STR_RM_SCRUB_BLOCK_SHIFT               "RMScrubBlockShift"
 // Type DWORD
 // Encoding Numeric Value
--- a/src/nvidia/src/kernel/gpu/arch/hopper/kern_gpu_gh100.c
+++ b/src/nvidia/src/kernel/gpu/arch/hopper/kern_gpu_gh100.c
@@ -32,6 +32,7 @@
 #include "published/hopper/gh100/dev_pmc.h"
 #include "published/hopper/gh100/dev_xtl_ep_pcfg_gpu.h"
 #include "published/hopper/gh100/pri_nv_xal_ep.h"
+#include "published/hopper/gh100/dev_xtl_ep_pri.h"

 #include "ctrl/ctrl2080/ctrl2080mc.h"

@@ -77,6 +78,28 @@ gpuReadBusConfigReg_GH100
    return gpuReadBusConfigCycle(pGpu, index, pData);
 }

+/*!
+ * @brief Read the non-private registers on vGPU through mirror space
+ *
+ * @param[in]  pGpu   GPU object pointer
+ * @param[in]  index  Register offset in PCIe config space
+ * @param[out] pData  Value of the register
+ *
+ * @returns    NV_OK on success
+ */
+NV_STATUS
+gpuReadVgpuConfigReg_GH100
+(
+    OBJGPU    *pGpu,
+    NvU32      index,
+    NvU32     *pData
+)
+{
+    *pData = GPU_REG_RD32(pGpu, DEVICE_BASE(NV_EP_PCFGM) + index);
+
+    return NV_OK;
+}
+
 /*!
 * @brief Get GPU ID based on PCIE config reads.
 * Also determine other properties of the PCIE capabilities.
--- a/src/nvidia/src/kernel/gpu/gpu.c
+++ b/src/nvidia/src/kernel/gpu/gpu.c
@@ -4941,12 +4941,19 @@ gpuReadBusConfigCycle_IMPL
    NvU8  device   = gpuGetDevice(pGpu);
    NvU8  function = 0;

-    if (pGpu->hPci == NULL)
+    if (IS_PASSTHRU(pGpu))
    {
-        pGpu->hPci = osPciInitHandle(domain, bus, device, function, NULL, NULL);
+        gpuReadVgpuConfigReg_HAL(pGpu, index, pData);
    }
+    else
+    {
+        if (pGpu->hPci == NULL)
+        {
+            pGpu->hPci = osPciInitHandle(domain, bus, device, function, NULL, NULL);
+        }

-    *pData = osPciReadDword(pGpu->hPci, index);
+        *pData = osPciReadDword(pGpu->hPci, index);
+    }

    return NV_OK;
 }
--- a/src/nvidia/src/kernel/gpu/gsp/arch/hopper/kernel_gsp_gh100.c
+++ b/src/nvidia/src/kernel/gpu/gsp/arch/hopper/kernel_gsp_gh100.c
@@ -29,6 +29,7 @@
 #include "gpu/conf_compute/conf_compute.h"
 #include "gpu/fsp/kern_fsp.h"
 #include "gpu/gsp/kernel_gsp.h"
+#include "gpu/mem_sys/kern_mem_sys.h"
 #include "gsp/gspifpub.h"
 #include "vgpu/rpc.h"

@@ -523,6 +524,7 @@ kgspBootstrapRiscvOSEarly_GH100
 {
    KernelFalcon *pKernelFalcon = staticCast(pKernelGsp, KernelFalcon);
    KernelFsp *pKernelFsp = GPU_GET_KERNEL_FSP(pGpu);
+    KernelMemorySystem *pKernelMemorySystem = GPU_GET_KERNEL_MEMORY_SYSTEM(pGpu);
    NV_STATUS     status        = NV_OK;

    // Only for GSP client builds
@@ -532,8 +534,16 @@ kgspBootstrapRiscvOSEarly_GH100
        return NV_ERR_NOT_SUPPORTED;
    }

+    // Clear ECC errors before attempting to load GSP
+    status = kmemsysClearEccCounts_HAL(pGpu, pKernelMemorySystem);
+    if (status != NV_OK)
+    {
+        NV_PRINTF(LEVEL_ERROR, "Issue clearing ECC counts! Status:0x%x\n", status);
+    }
+
    // Setup the descriptors that GSP-FMC needs to boot GSP-RM
-    NV_ASSERT_OK_OR_RETURN(kgspSetupGspFmcArgs_HAL(pGpu, pKernelGsp, pGspFw));
+    NV_CHECK_OK_OR_GOTO(status, LEVEL_ERROR,
+            kgspSetupGspFmcArgs_HAL(pGpu, pKernelGsp, pGspFw), exit);

    kgspSetupLibosInitArgs(pGpu, pKernelGsp);

@@ -562,7 +572,8 @@ kgspBootstrapRiscvOSEarly_GH100
    {
        NV_PRINTF(LEVEL_NOTICE, "Starting to boot GSP via FSP.\n");
        pKernelFsp->setProperty(pKernelFsp, PDB_PROP_KFSP_GSP_MODE_GSPRM, NV_TRUE);
-        NV_ASSERT_OK_OR_RETURN(kfspSendBootCommands_HAL(pGpu, pKernelFsp));
+        NV_CHECK_OK_OR_GOTO(status, LEVEL_ERROR,
+                kfspSendBootCommands_HAL(pGpu, pKernelFsp), exit);
    }
    else
    {
@@ -585,7 +596,7 @@ kgspBootstrapRiscvOSEarly_GH100
                kfspDumpDebugState_HAL(pGpu, pKernelFsp);
            }

-            return status;
+            goto exit;
        }
    }

@@ -606,7 +617,7 @@ kgspBootstrapRiscvOSEarly_GH100
                  kflcnRegRead_HAL(pGpu, pKernelFalcon, NV_PFALCON_FALCON_MAILBOX0));
        NV_PRINTF(LEVEL_ERROR, "NV_PGSP_FALCON_MAILBOX1 = 0x%x\n",
                  kflcnRegRead_HAL(pGpu, pKernelFalcon, NV_PFALCON_FALCON_MAILBOX1));
-        return status;
+        goto exit;
    }

    // Start polling for libos logs now that lockdown is released
@@ -640,6 +651,11 @@ kgspBootstrapRiscvOSEarly_GH100
    NV_PRINTF(LEVEL_INFO, "GSP FW RM ready.\n");

 exit:
+    // If GSP fails to boot, check if there's any DED error.
+    if (status != NV_OK)
+    {
+        kmemsysCheckEccCounts_HAL(pGpu, pKernelMemorySystem);
+    }
    NV_ASSERT(status == NV_OK);

    return status;
--- a/src/nvidia/src/kernel/gpu/gsp/arch/turing/kernel_gsp_tu102.c
+++ b/src/nvidia/src/kernel/gpu/gsp/arch/turing/kernel_gsp_tu102.c
@@ -799,7 +799,7 @@ kgspHealthCheck_TU102
            objDelete(pReport);
        }

-        return bHealthy;
+        goto exit_health_check;
    }

    NvU32 mb0 = GPU_REG_RD32(pGpu, NV_PGSP_MAILBOX(0));
@@ -845,6 +845,12 @@ kgspHealthCheck_TU102
                  "********************************************************************************\n");
    }

+exit_health_check:
+    if (!bHealthy)
+    {
+        KernelMemorySystem *pKernelMemorySystem = GPU_GET_KERNEL_MEMORY_SYSTEM(pGpu);
+        kmemsysCheckEccCounts_HAL(pGpu, pKernelMemorySystem);
+    }
    return bHealthy;
 }

--- a/src/nvidia/src/kernel/gpu/mem_mgr/mem_mgr.c
+++ b/src/nvidia/src/kernel/gpu/mem_mgr/mem_mgr.c
@@ -251,6 +251,19 @@ _memmgrInitRegistryOverrides(OBJGPU *pGpu, MemoryManager *pMemoryManager)
        }
    }

+    if (osReadRegistryDword(pGpu, NV_REG_STR_RM_ENABLE_ADDRTREE, &data32) == NV_OK)
+    {
+        if (data32 == NV_REG_STR_RM_ENABLE_ADDRTREE_YES)
+        {
+            pMemoryManager->bPmaAddrTree = NV_TRUE;
+            NV_PRINTF(LEVEL_ERROR, "Enabled address tree for PMA via regkey.\n");
+        }
+    }
+    else if (RMCFG_FEATURE_PLATFORM_MODS)
+    {
+        pMemoryManager->bPmaAddrTree = NV_TRUE;
+        NV_PRINTF(LEVEL_ERROR, "Enabled address tree for PMA for MODS.\n");
+    }
 }

 NV_STATUS
@@ -2764,6 +2777,11 @@ memmgrPmaInitialize_IMPL
        pmaInitFlags |= PMA_INIT_NUMA_AUTO_ONLINE;
    }

+    if (memmgrIsPmaAddrTree(pMemoryManager))
+    {
+        pmaInitFlags |= PMA_INIT_ADDRTREE;
+    }
+
    status = pmaInitialize(pPma, pmaInitFlags);
    if (status != NV_OK)
    {
--- a/src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/addrtree.c
+++ b/src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/addrtree.c
--- a/src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/phys_mem_allocator.c
+++ b/src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/phys_mem_allocator.c
@@ -240,6 +240,26 @@ pmaInitialize(PMA *pPma, NvU32 initFlags)
        pPma->bNuma = !!(initFlags & PMA_INIT_NUMA);

        pPma->bNumaAutoOnline = !!(initFlags & PMA_INIT_NUMA_AUTO_ONLINE);
+
+        // If we want to run with address tree instead of regmap
+        if (initFlags & PMA_INIT_ADDRTREE)
+        {
+            pMapInfo->pmaMapInit = pmaAddrtreeInit;
+            pMapInfo->pmaMapDestroy = pmaAddrtreeDestroy;
+            pMapInfo->pmaMapChangeState = pmaAddrtreeChangeState;
+            pMapInfo->pmaMapChangeStateAttrib = pmaAddrtreeChangeStateAttrib;
+            pMapInfo->pmaMapChangeStateAttribEx = pmaAddrtreeChangeStateAttribEx;
+            pMapInfo->pmaMapChangePageStateAttrib = pmaAddrtreeChangePageStateAttrib;
+            pMapInfo->pmaMapRead = pmaAddrtreeRead;
+            pMapInfo->pmaMapScanContiguous = pmaAddrtreeScanContiguous;
+            pMapInfo->pmaMapScanDiscontiguous = pmaAddrtreeScanDiscontiguous;
+            pMapInfo->pmaMapGetSize = pmaAddrtreeGetSize;
+            pMapInfo->pmaMapGetLargestFree = pmaAddrtreeGetLargestFree;
+            pMapInfo->pmaMapScanContiguousNumaEviction = pmaAddrtreeScanContiguousNumaEviction;
+            pMapInfo->pmaMapGetEvictingFrames = pmaAddrtreeGetEvictingFrames;
+            pMapInfo->pmaMapSetEvictingFrames = pmaAddrtreeSetEvictingFrames;
+            NV_PRINTF(LEVEL_WARNING, "Going to use addrtree for PMA init!!\n");
+        }
    }
    pPma->pMapInfo = pMapInfo;

@@ -569,12 +589,22 @@ pmaAllocatePages

    const NvU64 numFramesToAllocateTotal = framesPerPage * allocationCount;

-    NV_CHECK_OR_RETURN(LEVEL_ERROR, pPma != NULL, NV_ERR_INVALID_ARGUMENT);
-    NV_CHECK_OR_RETURN(LEVEL_ERROR, pPages != NULL, NV_ERR_INVALID_ARGUMENT);
-    NV_CHECK_OR_RETURN(LEVEL_ERROR, allocationCount != 0, NV_ERR_INVALID_ARGUMENT);
-    NV_CHECK_OR_RETURN(LEVEL_ERROR, allocationOptions != NULL, NV_ERR_INVALID_ARGUMENT);
-    NV_CHECK_OR_RETURN(LEVEL_ERROR, portUtilIsPowerOfTwo(pageSize), NV_ERR_INVALID_ARGUMENT);
-    NV_CHECK_OR_RETURN(LEVEL_ERROR, pageSize >= _PMA_64KB, NV_ERR_INVALID_ARGUMENT);
+    if (pPma == NULL || pPages == NULL || allocationCount == 0
+        || (pageSize != _PMA_64KB && pageSize != _PMA_128KB && pageSize != _PMA_2MB && pageSize != _PMA_512MB)
+        || allocationOptions == NULL)
+    {
+        if (pPma == NULL)
+            NV_PRINTF(LEVEL_ERROR, "NULL PMA object\n");
+        if (pPages == NULL)
+            NV_PRINTF(LEVEL_ERROR, "NULL page list pointer\n");
+        if (allocationCount == 0)
+            NV_PRINTF(LEVEL_ERROR, "count == 0\n");
+        if (pageSize != _PMA_64KB && pageSize != _PMA_128KB && pageSize != _PMA_2MB && pageSize != _PMA_512MB)
+            NV_PRINTF(LEVEL_ERROR, "pageSize=0x%llx (not 64K, 128K, 2M, or 512M)\n", pageSize);
+        if (allocationOptions == NULL)
+            NV_PRINTF(LEVEL_ERROR, "NULL allocationOptions\n");
+        return NV_ERR_INVALID_ARGUMENT;
+    }

    flags = allocationOptions->flags;
    evictFlag   = !(flags & PMA_ALLOCATE_DONT_EVICT);
@@ -643,14 +673,23 @@ pmaAllocatePages
    //
    if (alignFlag)
    {
-        NV_CHECK_OR_RETURN(LEVEL_ERROR,
-            portUtilIsPowerOfTwo(allocationOptions->alignment),
-            NV_ERR_INVALID_ARGUMENT);
-        NV_CHECK_OR_RETURN(LEVEL_ERROR,
-            allocationOptions->alignment >= _PMA_64KB,
-            NV_ERR_INVALID_ARGUMENT);
+        if (!NV_IS_ALIGNED(allocationOptions->alignment, _PMA_64KB) ||
+            !portUtilIsPowerOfTwo(allocationOptions->alignment))
+        {
+            NV_PRINTF(LEVEL_WARNING,
+                "alignment [%llx] is not aligned to 64KB or is not power of two.",
+                alignment);
+            return NV_ERR_INVALID_ARGUMENT;
+        }

-        alignment = allocationOptions->alignment;
+        alignment = NV_MAX(pageSize, allocationOptions->alignment);
+        if (!contigFlag && alignment > pageSize)
+        {
+            NV_PRINTF(LEVEL_WARNING,
+                "alignment [%llx] larger than the pageSize [%llx] not supported for non-contiguous allocs\n",
+                alignment, pageSize);
+            return NV_ERR_INVALID_ARGUMENT;
+        }
    }

    pinOption = pinFlag ? STATE_PIN : STATE_UNPIN;
@@ -797,6 +836,15 @@ pmaAllocatePages_retry:
        curPages += numPagesAllocatedThisTime;
        numPagesLeftToAllocate -= numPagesAllocatedThisTime;

+        //
+        // PMA must currently catch addrtree shortcomings and fail the request
+        // Just follow the no memory path for now to properly release locks
+        //
+        if (status == NV_ERR_INVALID_ARGUMENT)
+        {
+            status = NV_ERR_NO_MEMORY;
+        }
+
        if (status == NV_ERR_IN_USE && !tryEvict)
        {
            //
@@ -1145,6 +1193,13 @@ pmaAllocatePagesBroadcast
 )
 {

+    if (pPma == NULL || pmaCount == 0 || allocationCount == 0
+        || (pageSize != _PMA_64KB && pageSize != _PMA_128KB && pageSize != _PMA_2MB && pageSize != _PMA_512MB)
+        || pPages == NULL)
+    {
+        return NV_ERR_INVALID_ARGUMENT;
+    }
+
    return NV_ERR_GENERIC;
 }

@@ -1163,9 +1218,11 @@ pmaPinPages
    PMA_PAGESTATUS state;
    framesPerPage  = (NvU32)(pageSize >> PMA_PAGE_SHIFT);

-    NV_CHECK_OR_RETURN(LEVEL_ERROR,
-        (pPma != NULL) && (pageCount != 0) && (pPages != NULL),
-        NV_ERR_INVALID_ARGUMENT);
+    if (pPma == NULL || pageCount == 0 || pPages == NULL
+        || (pageSize != _PMA_64KB && pageSize != _PMA_128KB && pageSize != _PMA_2MB && pageSize != _PMA_512MB))
+    {
+        return NV_ERR_INVALID_ARGUMENT;
+    }

    portSyncSpinlockAcquire(pPma->pPmaLock);

@@ -1294,6 +1351,14 @@ pmaFreePages
    NV_ASSERT(pageCount != 0);
    NV_ASSERT(pPages != NULL);

+    if (pageCount != 1)
+    {
+        NV_ASSERT((size == _PMA_64KB)  ||
+                  (size == _PMA_128KB) ||
+                  (size == _PMA_2MB)   ||
+                  (size == _PMA_512MB));
+    }
+
    // Fork out new code path for NUMA sub-allocation from OS
    if (pPma->bNuma)
    {
--- a/src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/regmap.c
+++ b/src/nvidia/src/kernel/gpu/mem_mgr/phys_mem_allocator/regmap.c
@@ -1074,6 +1074,8 @@ pmaRegmapScanDiscontiguous
    PMA_PAGESTATUS startStatus, endStatus;
    PMA_REGMAP *pRegmap = (PMA_REGMAP *)pMap;

+    NV_ASSERT(alignment == pageSize);
+
    framesPerPage = pageSize >> PMA_PAGE_SHIFT;

    //
--- a/src/nvidia/src/kernel/gpu/mem_sys/arch/hopper/kern_mem_sys_gh100.c
+++ b/src/nvidia/src/kernel/gpu/mem_sys/arch/hopper/kern_mem_sys_gh100.c
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2021-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2021-2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * SPDX-License-Identifier: MIT
 *
 * Permission is hereby granted, free of charge, to any person obtaining a
@@ -23,15 +23,24 @@

 #include "core/core.h"
 #include "gpu/gpu.h"
+#include "nvtypes.h"
 #include "os/os.h"
 #include "kernel/gpu/mem_sys/kern_mem_sys.h"
 #include "gpu/mem_mgr/mem_desc.h"
 #include "gpu/bus/kern_bus.h"
+#include "kernel/gpu/intr/intr.h"
+#include "nverror.h"

 #include "published/hopper/gh100/dev_fb.h"
+#include "published/hopper/gh100/dev_ltc.h"
+#include "published/hopper/gh100/dev_fbpa.h"
 #include "published/hopper/gh100/dev_vm.h"
 #include "published/hopper/gh100/pri_nv_xal_ep.h"
 #include "published/hopper/gh100/dev_nv_xal_addendum.h"
+#include "published/hopper/gh100/dev_nv_xpl.h"
+#include "published/hopper/gh100/dev_xtl_ep_pri.h"
+#include "published/hopper/gh100/hwproject.h"
+#include "published/ampere/ga100/dev_fb.h"

 NV_STATUS
 kmemsysDoCacheOp_GH100
@@ -566,3 +575,168 @@ kmemsysSwizzIdToVmmuSegmentsRange_GH100

    return NV_OK;
 }
+/*!
+ * Utility function used to read registers and ignore PRI errors
+ */
+static NvU32
+_kmemsysReadRegAndMaskPriError
+(
+    OBJGPU *pGpu,
+    NvU32 regAddr
+)
+{
+    NvU32 regVal;
+
+    regVal = osGpuReadReg032(pGpu, regAddr);
+    if ((regVal & GPU_READ_PRI_ERROR_MASK) == GPU_READ_PRI_ERROR_CODE)
+    {
+        return 0;
+    }
+
+    return regVal;
+}
+/*
+ * @brief Function that checks if ECC error occurred by reading various count
+ * registers/interrupt registers. This function is not floorsweeping-aware so
+ * PRI errors are ignored
+ */
+void
+kmemsysCheckEccCounts_GH100
+(
+    OBJGPU *pGpu,
+    KernelMemorySystem *pKernelMemorySystem
+)
+{
+    NvU32 dramCount = 0;
+    NvU32 mmuCount = 0;
+    NvU32 ltcCount = 0;
+    NvU32 pcieCount = 0;
+    NvU32 regVal;
+    for (NvU32 i = 0; i < NV_SCAL_LITTER_NUM_FBPAS; i++)
+    {
+        for (NvU32 j = 0; j < NV_PFB_FBPA_0_ECC_DED_COUNT__SIZE_1; j++)
+        {
+            // DRAM count read
+            dramCount += _kmemsysReadRegAndMaskPriError(pGpu, NV_PFB_FBPA_0_ECC_DED_COUNT(j) + (i * NV_FBPA_PRI_STRIDE));
+
+            // LTC count read
+            regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_PLTCG_LTC0_LTS0_L2_CACHE_ECC_UNCORRECTED_ERR_COUNT +
+                    (i * NV_LTC_PRI_STRIDE) + (j * NV_LTS_PRI_STRIDE));
+            ltcCount += DRF_VAL(_PLTCG_LTC0_LTS0, _L2_CACHE_ECC, _UNCORRECTED_ERR_COUNT_UNIQUE, regVal);
+        }
+    }
+
+    // L2TLB
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_PFB_PRI_MMU_L2TLB_ECC_UNCORRECTED_ERR_COUNT);
+    mmuCount += DRF_VAL(_PFB_PRI_MMU, _L2TLB_ECC, _UNCORRECTED_ERR_COUNT_UNIQUE, regVal);
+
+    // HUBTLB
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_PFB_PRI_MMU_HUBTLB_ECC_UNCORRECTED_ERR_COUNT);
+    mmuCount += DRF_VAL(_PFB_PRI_MMU, _HUBTLB_ECC, _UNCORRECTED_ERR_COUNT_UNIQUE, regVal);
+
+    // FILLUNIT
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_PFB_PRI_MMU_FILLUNIT_ECC_UNCORRECTED_ERR_COUNT);
+    mmuCount += DRF_VAL(_PFB_PRI_MMU, _FILLUNIT_ECC, _UNCORRECTED_ERR_COUNT_UNIQUE, regVal);
+
+    // PCIE RBUF
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_XPL_BASE_ADDRESS + NV_XPL_DL_ERR_COUNT_RBUF);
+    pcieCount += DRF_VAL(_XPL_DL, _ERR_COUNT_RBUF, _UNCORR_ERR, regVal);
+
+    // PCIE SEQ_LUT
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_XPL_BASE_ADDRESS + NV_XPL_DL_ERR_COUNT_SEQ_LUT);
+    pcieCount += DRF_VAL(_XPL_DL, _ERR_COUNT_SEQ_LUT, _UNCORR_ERR, regVal);
+
+    // PCIE RE ORDER
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_XAL_EP_REORDER_ECC_UNCORRECTED_ERR_COUNT);
+    pcieCount += DRF_VAL(_XAL_EP, _REORDER_ECC, _UNCORRECTED_ERR_COUNT_UNIQUE, regVal);
+
+    // PCIE P2PREQ
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_XAL_EP_P2PREQ_ECC_UNCORRECTED_ERR_COUNT);
+    pcieCount += DRF_VAL(_XAL_EP, _P2PREQ_ECC, _UNCORRECTED_ERR_COUNT_UNIQUE, regVal);
+
+    // PCIE XTL
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_XTL_BASE_ADDRESS + NV_XTL_EP_PRI_DED_ERROR_STATUS);
+    if (regVal != 0)
+    {
+        pcieCount += 1;
+    }
+
+    // PCIE XTL
+    regVal = _kmemsysReadRegAndMaskPriError(pGpu, NV_XTL_BASE_ADDRESS + NV_XTL_EP_PRI_RAM_ERROR_INTR_STATUS);
+    if (regVal != 0)
+    {
+        pcieCount += 1;
+    }
+
+    // If counts > 0 or if poison interrupt pending, ECC error has occurred.
+    if (((dramCount + ltcCount + mmuCount + pcieCount) != 0) ||
+        intrIsVectorPending_HAL(pGpu, GPU_GET_INTR(pGpu), NV_PFB_FBHUB_POISON_INTR_VECTOR_HW_INIT, NULL))
+    {
+        nvErrorLog_va((void *)pGpu, UNRECOVERABLE_ECC_ERROR_ESCAPE,
+                      "An uncorrectable ECC error detected "
+                      "(possible firmware handling failure) "
+                      "DRAM:%d, LTC:%d, MMU:%d, PCIE:%d", dramCount, ltcCount, mmuCount, pcieCount);
+    }
+}
+
+/*
+ * @brief  Function that clears ECC error count registers.
+ */
+NV_STATUS
+kmemsysClearEccCounts_GH100
+(
+    OBJGPU *pGpu,
+    KernelMemorySystem *pKernelMemorySystem
+)
+{
+    NvU32 regVal = 0;
+    RMTIMEOUT timeout;
+    NV_STATUS status = NV_OK;
+
+    gpuClearFbhubPoisonIntrForBug2924523_HAL(pGpu);
+
+    for (NvU32 i = 0; i < NV_SCAL_LITTER_NUM_FBPAS; i++)
+    {
+        for (NvU32 j = 0; j < NV_PFB_FBPA_0_ECC_DED_COUNT__SIZE_1; j++)
+        {
+            osGpuWriteReg032(pGpu, NV_PFB_FBPA_0_ECC_DED_COUNT(j) + (i * NV_FBPA_PRI_STRIDE), 0);
+            osGpuWriteReg032(pGpu, NV_PLTCG_LTC0_LTS0_L2_CACHE_ECC_UNCORRECTED_ERR_COUNT + (i * NV_LTC_PRI_STRIDE) + (j * NV_LTS_PRI_STRIDE), 0);
+        }
+    }
+
+    // Reset MMU counts
+    osGpuWriteReg032(pGpu, NV_PFB_PRI_MMU_L2TLB_ECC_UNCORRECTED_ERR_COUNT, 0);
+    osGpuWriteReg032(pGpu, NV_PFB_PRI_MMU_HUBTLB_ECC_UNCORRECTED_ERR_COUNT, 0);
+    osGpuWriteReg032(pGpu, NV_PFB_PRI_MMU_FILLUNIT_ECC_UNCORRECTED_ERR_COUNT, 0);
+
+    // Reset XAL-EP counts
+    osGpuWriteReg032(pGpu, NV_XAL_EP_REORDER_ECC_UNCORRECTED_ERR_COUNT, 0);
+    osGpuWriteReg032(pGpu, NV_XAL_EP_P2PREQ_ECC_UNCORRECTED_ERR_COUNT, 0);
+
+    // Reset XTL-EP status registers
+    osGpuWriteReg032(pGpu, NV_XTL_BASE_ADDRESS + NV_XTL_EP_PRI_DED_ERROR_STATUS, ~0);
+    osGpuWriteReg032(pGpu, NV_XTL_BASE_ADDRESS + NV_XTL_EP_PRI_RAM_ERROR_INTR_STATUS, ~0);
+
+    // Reset XPL-EP error counters
+    regVal = DRF_DEF(_XPL, _DL_ERR_RESET, _RBUF_UNCORR_ERR_COUNT, _PENDING) |
+             DRF_DEF(_XPL, _DL_ERR_RESET, _SEQ_LUT_UNCORR_ERR_COUNT, _PENDING);
+    osGpuWriteReg032(pGpu, NV_XPL_BASE_ADDRESS + NV_XPL_DL_ERR_RESET, regVal);
+
+    // Wait for the error counter reset to complete
+    gpuSetTimeout(pGpu, GPU_TIMEOUT_DEFAULT, &timeout, 0);
+    for (;;)
+    {
+        status = gpuCheckTimeout(pGpu, &timeout);
+
+        regVal = osGpuReadReg032(pGpu, NV_XPL_BASE_ADDRESS + NV_XPL_DL_ERR_RESET);
+
+        if (FLD_TEST_DRF(_XPL, _DL_ERR_RESET, _RBUF_UNCORR_ERR_COUNT, _DONE, regVal) &&
+            FLD_TEST_DRF(_XPL, _DL_ERR_RESET, _SEQ_LUT_UNCORR_ERR_COUNT, _DONE, regVal))
+            break;
+
+        if (status != NV_OK)
+            return status;
+    }
+
+    return NV_OK;
+}
--- a/src/nvidia/src/kernel/mem_mgr/pool_alloc.c
+++ b/src/nvidia/src/kernel/mem_mgr/pool_alloc.c
@@ -1,4 +1,4 @@
- /*
+/*
 * SPDX-FileCopyrightText: Copyright (c) 2016-2022 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 * SPDX-License-Identifier: MIT
 *
@@ -193,7 +193,7 @@ struct RM_POOL_ALLOC_MEM_RESERVE_INFO
 *        pool.
 *
 * @param[in] pCtx     Context for upstream allocator.
- * @param[in] pageSize Page size to use when allocating from PMA
+ * @param[in] pageSize Only for debugging.
 * @param[in] pPage    Output page handle from upstream.
 *
 * @return NV_STATUS
@@ -203,23 +203,20 @@ allocUpstreamTopPool
 (
    void             *pCtx,
    NvU64             pageSize,
-    NvU64             numPages, 
    POOLALLOC_HANDLE *pPage
 )
 {
    PMA_ALLOCATION_OPTIONS      allocOptions = {0};
    RM_POOL_ALLOC_MEM_RESERVE_INFO *pMemReserveInfo;
-    NvU64 i;
-    NvU64 *pPageStore = portMemAllocNonPaged(sizeof(NvU64) * numPages);
-    NV_STATUS status = NV_OK;
+    NV_STATUS                   status;

    NV_ASSERT_OR_RETURN(NULL != pCtx, NV_ERR_INVALID_ARGUMENT);
    NV_ASSERT_OR_RETURN(NULL != pPage, NV_ERR_INVALID_ARGUMENT);
-    NV_ASSERT_OR_RETURN(NULL != pPageStore, NV_ERR_NO_MEMORY);

+    // TODO: Replace the direct call to PMA with function pointer.
    pMemReserveInfo = (RM_POOL_ALLOC_MEM_RESERVE_INFO *)pCtx;
-    allocOptions.flags = PMA_ALLOCATE_PINNED | PMA_ALLOCATE_PERSISTENT | PMA_ALLOCATE_FORCE_ALIGNMENT;
-    allocOptions.alignment = PMA_CHUNK_SIZE_64K;
+    allocOptions.flags = PMA_ALLOCATE_PINNED | PMA_ALLOCATE_PERSISTENT |
+                         PMA_ALLOCATE_CONTIGUOUS;

    if (pMemReserveInfo->bSkipScrub)
    {
@@ -230,24 +227,20 @@ allocUpstreamTopPool
    {
        allocOptions.flags |= PMA_ALLOCATE_PROTECTED_REGION;
    }
-        
-    NV_ASSERT_OK_OR_GOTO(status,
-        pmaAllocatePages(pMemReserveInfo->pPma,
-            numPages,
-            pageSize,
-            &allocOptions,
-            pPageStore),
-        free_mem);

-    for (i = 0; i < numPages; i++)
+    status = pmaAllocatePages(pMemReserveInfo->pPma,
+                              pMemReserveInfo->pmaChunkSize/PMA_CHUNK_SIZE_64K,
+                              PMA_CHUNK_SIZE_64K,
+                              &allocOptions,
+                              &pPage->address);
+    if (status != NV_OK)
    {
-        pPage[i].address = pPageStore[i];
-        pPage[i].pMetadata = NULL;
+        return status;
    }
-free_mem:
-    portMemFree(pPageStore);
-    return status;

+    pPage->pMetadata = NULL;
+
+    return status;
 }

 /*!
@@ -265,28 +258,17 @@ allocUpstreamLowerPools
 (
    void             *pCtx,
    NvU64             pageSize,
-    NvU64             numPages,
    POOLALLOC_HANDLE *pPage
 )
 {
    NV_STATUS status;
-    NvU64 i;

    NV_ASSERT_OR_RETURN(NULL != pCtx, NV_ERR_INVALID_ARGUMENT);
    NV_ASSERT_OR_RETURN(NULL != pPage, NV_ERR_INVALID_ARGUMENT);

-    for(i = 0; i < numPages; i++)
-    {
-        NV_ASSERT_OK_OR_GOTO(status,
-            poolAllocate((POOLALLOC *)pCtx, &pPage[i]),
-            cleanup);
-    }
-    return NV_OK;
-cleanup:
-    for(;i > 0; i--)
-    {
-        poolFree((POOLALLOC *)pCtx, &pPage[i-1]);
-    }
+    status = poolAllocate((POOLALLOC *)pCtx, pPage);
+    NV_ASSERT_OR_RETURN(status == NV_OK, status);
+
    return status;
 }

--- a/src/nvidia/src/kernel/rmapi/control.c
+++ b/src/nvidia/src/kernel/rmapi/control.c
@@ -1013,7 +1013,7 @@ _rmapiControlWithSecInfoTlsIRQL
    NV_STATUS           status;
    THREAD_STATE_NODE   threadState;

-    NvU8                stackAllocator[TLS_ISR_ALLOCATOR_SIZE];
+    NvU8                stackAllocator[2*TLS_ISR_ALLOCATOR_SIZE];
    PORT_MEM_ALLOCATOR* pIsrAllocator = portMemAllocatorCreateOnExistingBlock(stackAllocator, sizeof(stackAllocator));
    tlsIsrInit(pIsrAllocator);

--- a/src/nvidia/src/libraries/nvport/memory/memory_tracking.c
+++ b/src/nvidia/src/libraries/nvport/memory/memory_tracking.c
@@ -1444,6 +1444,14 @@ _portMemAllocatorCreateOnExistingBlock
    pAllocator->pTracking     = NULL; // No tracking for this allocator
    pAllocator->pImpl         = (PORT_MEM_ALLOCATOR_IMPL*)(pAllocator + 1);

+
+    //
+    // PORT_MEM_BITVECTOR (pAllocator->pImpl) and PORT_MEM_ALLOCATOR_TRACKING (pAllocator->pImpl->tracking)
+    // are mutually exclusively used.
+    // When pAllocator->pTracking = NULL the data in pAllocator->pImpl->tracking is not used and instead 
+    // pBitVector uses the same meory location. 
+    // When pAllocator->pImpl->tracking there is no usage of PORT_MEM_BITVECTOR
+    //
    pBitVector = (PORT_MEM_BITVECTOR*)(pAllocator->pImpl);
    pBitVector->pSpinlock = pSpinlock;

@@ -1544,6 +1552,10 @@ _portMemAllocatorAllocExistingWrapper
    {
        portSyncSpinlockRelease(pSpinlock);
    }
+    if (pMem == NULL)
+    {
+         PORT_MEM_PRINT_ERROR("Memory allocation failed.\n");
+    }
    return pMem;
 }

--- a/src/nvidia/src/libraries/poolalloc/poolalloc.c
+++ b/src/nvidia/src/libraries/poolalloc/poolalloc.c
@@ -262,12 +262,9 @@ poolReserve
    NvU64     numPages
 )
 {
-    NvU64            i, freeLength, totalAlloc;
-    NV_STATUS status = NV_ERR_NO_MEMORY;
+    NvU64            i, freeLength;
    allocCallback_t  allocCb;
-    POOLALLOC_HANDLE *pPageHandle = NULL;
-    POOLNODE *pNode = NULL;
-
+    POOLALLOC_HANDLE pageHandle;

    if (pPool == NULL || (pPool->callBackInfo).allocCb == NULL)
    {
@@ -280,45 +277,32 @@ poolReserve
        return NV_OK;
    }

-    totalAlloc = numPages - freeLength;
-
    allocCb = (pPool->callBackInfo).allocCb;

-    pPageHandle = PORT_ALLOC(pPool->pAllocator, totalAlloc * sizeof(POOLALLOC_HANDLE));
-    NV_ASSERT_OR_GOTO(pPageHandle != NULL, free_none);
-
-    NV_ASSERT_OK_OR_GOTO(status, 
-        allocCb(pPool->callBackInfo.pUpstreamCtx, pPool->upstreamPageSize,
-            totalAlloc, pPageHandle),
-        free_page);
-
-    status = NV_ERR_NO_MEMORY;
-
-    for (i = 0; i < totalAlloc; i++)
+    for (i = 0; i < (numPages - freeLength); i++)
    {
-        pNode = PORT_ALLOC(pPool->pAllocator, sizeof(POOLNODE));
-        NV_ASSERT_OR_GOTO(pNode != NULL, free_alloc);
+        if ((*allocCb)((pPool->callBackInfo).pUpstreamCtx,
+              pPool->upstreamPageSize, &pageHandle) == NV_OK)
+        {
+            POOLNODE *pNode;

-        listPrependExisting(&pPool->freeList, pNode);
-        pNode->pageAddr = pPageHandle[i].address;
-        pNode->bitmap = NV_U64_MAX;
-        pNode->pParent = pPageHandle[i].pMetadata;
+            pNode = PORT_ALLOC(pPool->pAllocator, sizeof(*pNode));
+            listPrependExisting(&pPool->freeList, pNode);
+
+            pNode->pageAddr = pageHandle.address;
+            pNode->bitmap = NV_U64_MAX;
+            pNode->pParent = pageHandle.pMetadata;
+        }
+        else
+        {
+            return NV_ERR_NO_MEMORY;
+        }
    }

-    status = NV_OK;
    freeLength = listCount(&pPool->freeList);
    NV_ASSERT(freeLength == numPages);
-    goto free_page;
-free_alloc:
-    for(; i < totalAlloc; i++)
-    {
-        pPool->callBackInfo.freeCb(pPool->callBackInfo.pUpstreamCtx,
-            pPool->upstreamPageSize, &pPageHandle[i]);
-    }
-free_page:
-    PORT_FREE(pPool->pAllocator, pPageHandle);
-free_none:
-    return status;
+
+    return NV_OK;
 }


@@ -399,7 +383,7 @@ poolAllocate
    //
    if (FLD_TEST_DRF(_RMPOOL, _FLAGS, _AUTO_POPULATE, _ENABLE, pPool->flags))
    {
-        if ((*allocCb)(pPool->callBackInfo.pUpstreamCtx, pPool->upstreamPageSize, 1, pPageHandle) == NV_OK)
+        if ((*allocCb)(pPool->callBackInfo.pUpstreamCtx, pPool->upstreamPageSize, pPageHandle) == NV_OK)
        {
            POOLNODE *pNode;

--- a/src/nvidia/srcs.mk
+++ b/src/nvidia/srcs.mk
@@ -522,6 +522,7 @@ SRCS += src/kernel/gpu/mem_mgr/mem_scrub.c
 SRCS += src/kernel/gpu/mem_mgr/mem_utils.c
 SRCS += src/kernel/gpu/mem_mgr/method_notification.c
 SRCS += src/kernel/gpu/mem_mgr/objheap.c
+SRCS += src/kernel/gpu/mem_mgr/phys_mem_allocator/addrtree.c
 SRCS += src/kernel/gpu/mem_mgr/phys_mem_allocator/numa.c
 SRCS += src/kernel/gpu/mem_mgr/phys_mem_allocator/phys_mem_allocator.c
 SRCS += src/kernel/gpu/mem_mgr/phys_mem_allocator/phys_mem_allocator_util.c
--- a/version.mk
+++ b/version.mk
@@ -1,4 +1,4 @@
-NVIDIA_VERSION = 535.43.22
+NVIDIA_VERSION = 535.104.12

 # This file.
 VERSION_MK_FILE := $(lastword $(MAKEFILE_LIST))
Author	SHA1	Message	Date
Bernhard Stoeckner	3916d91c99	535.104.12	2023-09-25 18:56:22 +02:00
Bernhard Stoeckner	a8e01be6b2	535.104.05	2023-08-22 15:09:37 +02:00
Bernhard Stoeckner	12c0739352	535.98	2023-08-08 18:28:38 +02:00
Bernhard Stoeckner	29f830f1bb	535.86.10	2023-07-31 18:17:14 +02:00
Bernhard Stoeckner	337e28efda	535.86.05	2023-07-18 16:00:22 +02:00
Bernhard Stoeckner	22a077c4fe	issue template: be clearer about issues with prop driver	2023-07-10 15:58:02 +02:00
Andy Ritger	26458140be	535.54.03	2023-06-14 12:37:59 -07:00