linux.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2025-11-11	drm/amdgpu: optimize timeout implemention in ras_eeprom_update_record_num	Tao Zhou	1	-5/+13
	The busy status returned by ras_eeprom_update_record_num may not be an error, increase timeout to exclude false busy status. Also add more comments to make the code readable. v2: define a macro for the timeout value. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-11	drm/amdgpu: add RAS bad page threshold handling for PMFW manages eeprom	Tao Zhou	1	-6/+30
	Check if bad page threshold is reached and take actions accordingly. v2: remove rma message sent to smu when pmfw manages eeprom. v3: add null pointer check for con. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-11	drm/amdgpu: try for more times if RAS bad page number is not updated	Tao Zhou	1	-1/+21
	RAS info update in PMFW is time cost, wait for it. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-11	drm/amdgpu: get RAS bad page address from MCA address	Tao Zhou	1	-2/+2
	Instead of from physical address. v2: add comment to make the code more readable Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06	drm/amdgpu: skip writing eeprom when PMFW manages RAS data	Tao Zhou	1	-1/+19
	Only update bad page number in legacy eeprom write path. v2: add null pointer check for con. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06	drm/amdgpu: add ras_eeprom_read_idx interface	Tao Zhou	1	-0/+51
	PMFW will manage RAS eeprom data by itself, add new interface to read eeprom data via PMFW, we can read part of records by setting index. v2: use IPID parse interface. pa is not used and set it to a fixed value. v3: optimize the null pointer check for IPID parse interface. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06	drm/amdgpu: initialize max record count after table reset	Gangliang Xie	1	-0/+6
	initialize max record count and record offset after table reset Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06	drm/amdgpu: add check function for pmfw eeprom	Gangliang Xie	1	-0/+44
	add check function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06	drm/amdgpu: add initialization function for pmfw eeprom	Gangliang Xie	1	-0/+39
	add initialization function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-06	drm/amdgpu: adapt reset function for pmfw eeprom	Gangliang Xie	1	-25/+36
	adapt reset function for pmfw eeprom Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04	drm/amdgpu: add wrapper functions for pmfw eeprom interface	Gangliang Xie	1	-0/+98
	add wrapper functions for pmfw eeprom interface, for these interfaces to be easily and safely called Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04	drm/amdgpu: add function to check if pmfw eeprom is supported	Gangliang Xie	1	-2/+44
	add function to check if pmfw is supported, skip eeprom check and recover when pmfw eeprom is supported Signed-off-by: Gangliang Xie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-11-04	drm/amd/ras: Add CPER ring read for uniras	Xiang Liu	1	-1/+2
	Read CPER raw data from debugfs node "/sys/kernel/debug/dri/*/ amdgpu_ring_cper". Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-10-20	drm/amdgpu: Add ras module eeprom safety watermark check	YiPeng Chai	1	-0/+4
	Add ras module eeprom safety watermark check. Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-29	drm/amdgpu: Notify pmfw bad page threshold exceeded	Xiang Liu	1	-2/+3
	Notify pmfw when bad page threshold is exceeded, no matter the module parameter 'bad_page_threshold' is set or not. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-08-06	drm/amdgpu: Generate BP threshold exceed CPER once threshold exceeded	Xiang Liu	1	-0/+4
	The bad pages threshold exceed CPER should be generated once threshold exceeded, no matter the bad_page_threshold setted or not. Signed-off-by: Xiang Liu <xiang.liu@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-28	drm/amdgpu: adjust the update of RAS bad page number	Tao Zhou	1	-4/+2
	One eeprom record may not map to unit number of bad pages, the accurate bad page number is gotten after bad page address check. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-07-15	drm/amdgpu: refine eeprom data check	ganglxie	1	-0/+28
	add eeprom data checksum check before driver unload. reset eeprom and save correct data to eeprom when check failed Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-30	drm/amdgpu: Convert from DRM_* to dev_*	Lijo Lazar	1	-34/+43
	Convert from generic DRM_* to dev_* calls to have device context info. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Asad Kamal <asad.kamal@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18	drm/amdgpu: refine usage of amdgpu_bad_page_threshold	ganglxie	1	-10/+9
	when amdgpu_bad_page_threshold == -1 or -2, driver will issue a warning message when threshold is reached and continue runtime services. Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-06-18	drm/amdgpu: clear pa and mca record counter when resetting eeprom	ganglxie	1	-0/+2
	clear pa and mca record counter when resetting eeprom, so that ras_num_bad_pages can be calculated correctly Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-11	drm/amdgpu: Set RAS EEPROM table version to v3 for umc v12_5	Candice Li	1	-0/+1
	Set RAS EEPROM table version to v3 for umc v12_5. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-08	drm/amdgpu: Reset RAS table if header is invalid	Lijo Lazar	1	-5/+7
	If a valid header is not found during RAS eeprom init, consider it as new and reset RAS table info. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-04-07	drm/amdgpu: Add basic validation for RAS header	Lijo Lazar	1	-3/+19
	If RAS header read from EEPROM is corrupted, it could result in trying to allocate huge memory for reading the records. Add some validation to header fields. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-19	drm/amdgpu: Add EEPROM I2C address support for smu v13_0_12	Candice Li	1	-0/+2
	Add EEPROM I2C address support for smu v13_0_12. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-18	drm/amdgpu: format old RAS eeprom data into V3 version	Tao Zhou	1	-12/+14
	Clear old data and save it in V3 format. v2: only format eeprom data for new ASICs. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-13	drm/amdgpu: Save PA of bad pages for old asics	ganglxie	1	-2/+7
	for old asics that do not support mca translating, we just save PA for them Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-03-07	drm/amdgpu: increase RAS bad page threshold	Tao Zhou	1	-3/+3
	For default policy, driver will issue an RMA event when the number of bad pages is greater than 8 physical rows, rather than reaches 8 physical rows, don't rely on threshold configurable parameters in default mode. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-25	drm/amdgpu: Change page/record number calculation based on nps	ganglxie	1	-10/+7
	save only one record to save eeprom space,and bad_page_num = pa_rec_num + mca_rec_num*16 Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-25	drm/amdgpu: Save nps to eeprom	ganglxie	1	-2/+6
	nps info saved together with bad page makes bad page parsing more efficient Signed-off-by: ganglxie <ganglxie@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12	drm/amdgpu: Add flag to make VBIOS read optional	Lijo Lazar	1	-1/+1
	Certain SOCs may not need much data from VBIOS. Some data like VBIOS version used will be missed but it doesn't affect functionality. Add a flag to make VBIOS image optional. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2025-02-12	drm/amdgpu: Update usage for bad page threshold	Hawking Zhang	1	-18/+23
	The driver's behavior varies based on the configuration of amdgpu_bad_page_threshold setting Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-18	drm/amdgpu: simplify return statement in amdgpu_ras_eeprom_init	Dheeraj Reddy Jonnalagadda	1	-1/+1
	Remove the logically dead code in the last return statement of amdgpu_ras_eeprom_init. The condition res < 0 is redundant since res is already checked for a negative value earlier. Replace return res < 0 ? res : 0; with return 0 to improve clarity. Fixes: 63d4c081a556 ("drm/amdgpu: Optimize EEPROM RAS table I/O") Closes: https://scan7.scan.coverity.com/#/project-view/52337/11354?selectedIssue=1602413 Signed-off-by: Dheeraj Reddy Jonnalagadda <dheeraj.linuxdev@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-10	drm/amdgpu: correct the calculation of RAS bad page	Tao Zhou	1	-13/+27
	After the introduction of NPS RAS, one bad page record on eeprom may be related to 1 or 16 bad pages, so the bad page record and bad page are two different concepts, define a new variable to store bad page number. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-10	drm/amdgpu: split ras_eeprom_init into init and check functions	Tao Zhou	1	-0/+20
	Init function is for ras table header read and check function is responsible for the validation of the header. Call them in different stages. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-10	drm/amdgpu: return error when eeprom checksum failed	Jinzhou Su	1	-2/+4
	Return eeprom table checksum error result, otherwise it might be overwritten by next call. V2: replace DRM_ERROR with dev_err Signed-off-by: Jinzhou Su <jinzhou.su@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-12-10	drm/amdgpu: add a flag to indicate UMC channel index version	Tao Zhou	1	-1/+10
	v1 (legacy way): store channel index within a UMC instance in eeprom v2: store global channel index in eeprom V2: only save the flag on eeprom, clear it after saving. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-09-18	drm/amdgpu: Fix a typo	Andrew Kreimer	1	-1/+1
	Fix a typo in comments. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Kreimer <algonell@gmail.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-07-24	drm/amdgpu: Fix eeprom max record count	Stanley.Yang	1	-0/+3
	The eeprom table is empty before initializing, set eeprom table version first before initializing. Changed from V1: Reuse amdgpu_ras_set_eeprom_table_version function Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit 015b8a2fdf39a4c288ff24e7b715b8d9198e56dc)
2024-06-05	drm/amdgpu: add RAS is_rma flag	Tao Zhou	1	-4/+6
	Set the flag to true if bad page number reaches threshold. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-05-02	drm/amdgpu: Add smu v13_0_14 ip block	Hawking Zhang	1	-0/+2
	Add smu v13_0_14 ip block support Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Le Ma <Le.Ma@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-03-20	drm/amdgpu: Update setting EEPROM table version	Candice Li	1	-5/+17
	Use helper function instead of umc callback to set EEPROM table version. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-02-12	drm/amdgpu: send smu rma reason event in ras eeprom driver	Yang Wang	1	-0/+3
	send smu rma reason event to smu in ras eeprom driver. Signed-off-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29	drm/amdgpu: Update EEPROM I2C address for smu v13_0_0	Candice Li	1	-0/+6
	Check smu v13_0_0 SKU type to select EEPROM I2C address. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-05	drm/amdgpu: Fix complex macros error	Srinivasan Shanmugam	1	-1/+1
	Fixes the below: ERROR: Macros with complex values should be enclosed in parentheses WARNING: macros should not use a trailing semicolon +#define amdgpu_inc_vram_lost(adev) atomic_inc(&((adev)->vram_lost_counter)); Cc: Christian König <christian.koenig@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-26	drm/amdgpu: change if condition for bad channel bitmap update	Tao Zhou	1	-2/+4
	The amdgpu_ras_eeprom_control.bad_channel_bitmap is u32 type, but the channel index could be larger than 32. For the ASICs whose channel number is more than 32, the amdgpu_dpm_send_hbm_bad_channel_flag interface is not supported, so we simply bypass channel bitmap update under this condition. v2: replace sizeof with BITS_PER_TYPE, we should check bit number instead of byte number. Signed-off-by: Tao Zhou <tao.zhou1@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20	drm/amdgpu: Use function for IP version check	Lijo Lazar	1	-2/+2
	Use an inline function for version check. Gives more flexibility to handle any format changes. Signed-off-by: Lijo Lazar <lijo.lazar@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-30	drm/amdgpu: Only support RAS EEPROM on dGPU platform	Candice Li	1	-1/+2
	RAS EEPROM device is only supported on dGPU platform for smu v13_0_6. Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-15	drm/amdgpu: Add I2C EEPROM support on smu v13_0_6	Candice Li	1	-0/+2
	Support I2C EEPROM on smu v13_0_6. v2: Move IP_VERSION(13, 0, 6) ahead of IP_VERSION(13, 0, 10). Signed-off-by: Candice Li <candice.li@amd.com> Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-21	drm/amd: Avoid reading the VBIOS part number twice	Mario Limonciello	1	-4/+4
	The VBIOS part number is read both in amdgpu_atom_parse() as well as in atom_get_vbios_pn() and stored twice in the `struct atom_context` structure. Remove the first unnecessary read and move the `pr_info` line from that read into the second. v2: squash in unused variable removal Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>