aboutsummaryrefslogtreecommitdiff
path: root/include/uapi/linux/devlink.h
AgeCommit message (Collapse)AuthorFilesLines
2025-11-20devlink: support default values for param-get and param-setDaniel Zahka1-0/+3
Support querying and resetting to default param values. Introduce two new devlink netlink attrs: DEVLINK_ATTR_PARAM_VALUE_DEFAULT and DEVLINK_ATTR_PARAM_RESET_DEFAULT. The former is used to contain an optional parameter value inside of the param_value nested attribute. The latter is used in param-set requests from userspace to indicate that the driver should reset the param to its default value. To implement this, two new functions are added to the devlink driver api: devlink_param::get_default() and devlink_param::reset_default(). These callbacks allow drivers to implement default param actions for runtime and permanent cmodes. For driverinit params, the core latches the last value set by a driver via devl_param_driverinit_value_set(), and uses that as the default value for a param. Because default parameter values are optional, it would be impossible to discern whether or not a param of type bool has default value of false or not provided if the default value is encoded using a netlink flag type. For this reason, when a DEVLINK_PARAM_TYPE_BOOL has an associated default value, the default value is encoded using a u8 type. Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251119025038.651131-4-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-11devlink: Introduce switchdev_inactive eswitch modeSaeed Mahameed1-0/+1
Adds DEVLINK_ESWITCH_MODE_SWITCHDEV_INACTIVE attribute to UAPI and documentation. Before having traffic flow through an eswitch, a user may want to have the ability to block traffic towards the FDB until FDB is fully programmed and the user is ready to send traffic to it. For example: when two eswitches are present for vports in a multi-PF setup, one eswitch may take over the traffic from the other when the user chooses. Before this take over, a user may want to first program the inactive eswitch and then once ready redirect traffic to this new eswitch. switchdev modes transition semantics: legacy->switchdev_inactive: Create switchdev mode normally, traffic not allowed to flow yet. switchdev_inactive->switchdev: Enable traffic to flow. switchdev->switchdev_inactive: Block traffic on the FDB, FDB and representros state and content is preserved. When eswitch is configured to this mode, traffic is ignored/dropped on this eswitch FDB, while current configuration is kept, e.g FDB rules and netdev representros are kept available, FDB programming is allowed. Example: # start inactive switchdev devlink dev eswitch set pci/0000:08:00.1 mode switchdev_inactive # setup TC rules, representors etc .. # activate devlink dev eswitch set pci/0000:08:00.1 mode switchdev Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20251108070404.1551708-2-saeed@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-26devlink: Make health reporter burst period configurableShahar Shitrit1-0/+2
Enable configuration of the burst period — a time window starting from the first error recovery, during which the reporter allows recovery attempts for each reported error. This feature is helpful when a single underlying issue causes multiple errors, as it delays the start of the grace period to allow sufficient time for recovering all related errors. For example, if multiple TX queues time out simultaneously, a sufficient burst period could allow all affected TX queues to be recovered within that window. Without this period, only the first TX queue that reports a timeout will undergo recovery, while the remaining TX queues will be blocked once the grace period begins. Configuration example: $ devlink health set pci/0000:00:09.0 reporter tx burst_period 500 Configuration example with ynl: ./tools/net/ynl/pyynl/cli.py \ --spec Documentation/netlink/specs/devlink.yaml \ --do health-reporter-set --json '{ "bus-name": "auxiliary", "dev-name": "mlx5_core.eth.0", "port-index": 65535, "health-reporter-name": "tx", "health-reporter-burst-period": 500 }' Signed-off-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250824084354.533182-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-23devlink: Fix excessive stack usage in rate TC bandwidth parsingCarolina Jubran1-2/+9
The devlink_nl_rate_tc_bw_parse function uses a large stack array for devlink attributes, which triggers a warning about excessive stack usage: net/devlink/rate.c: In function 'devlink_nl_rate_tc_bw_parse': net/devlink/rate.c:382:1: error: the frame size of 1648 bytes is larger than 1536 bytes [-Werror=frame-larger-than=] Introduce a separate attribute set specifically for rate TC bandwidth parsing that only contains the two attributes actually used: index and bandwidth. This reduces the stack array from DEVLINK_ATTR_MAX entries to just 2 entries, solving the stack usage issue. Update devlink selftest to use the new 'index' and 'bw' attribute names consistent with the YAML spec. Example usage with ynl with the new spec: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"index": 0, "bw": 50}, {"index": 1, "bw": 50}, {"index": 2, "bw": 0}, {"index": 3, "bw": 0}, {"index": 4, "bw": 0}, {"index": 5, "bw": 0}, {"index": 6, "bw": 0}, {"index": 7, "bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'bw': 50, 'index': 0}, {'bw': 50, 'index': 1}, {'bw': 0, 'index': 2}, {'bw': 0, 'index': 3}, {'bw': 0, 'index': 4}, {'bw': 0, 'index': 5}, {'bw': 0, 'index': 6}, {'bw': 0, 'index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Fixes: 566e8f108fc7 ("devlink: Extend devlink rate API with traffic classes bandwidth management") Reported-by: Arnd Bergmann <arnd@arndb.de> Closes: https://lore.kernel.org/netdev/20250708160652.1810573-1-arnd@kernel.org/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202507171943.W7DJcs6Y-lkp@intel.com/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Tested-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/1753175609-330621-1-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02devlink: Extend devlink rate API with traffic classes bandwidth managementCarolina Jubran1-0/+9
Introduce support for specifying relative bandwidth shares between traffic classes (TC) in the devlink-rate API. This new option allows users to allocate bandwidth across multiple traffic classes in a single command. This feature provides a more granular control over traffic management, especially for scenarios requiring Enhanced Transmission Selection. Users can now define a relative bandwidth share for each traffic class. For example, assigning share values of 20 to TC0 (TCP/UDP) and 80 to TC5 (RoCE) will result in TC0 receiving 20% and TC5 receiving 80% of the total bandwidth. The actual percentage each class receives depends on the ratio of its share value to the sum of all shares. Example: DEV=pci/0000:08:00.0 $ devlink port function rate add $DEV/vfs_group tx_share 10Gbit \ tx_max 50Gbit tc-bw 0:20 1:0 2:0 3:0 4:0 5:80 6:0 7:0 $ devlink port function rate set $DEV/vfs_group \ tc-bw 0:20 1:0 2:0 3:0 4:0 5:20 6:60 7:0 Example usage with ynl: ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-set --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1, "rate-tc-bws": [ {"rate-tc-index": 0, "rate-tc-bw": 50}, {"rate-tc-index": 1, "rate-tc-bw": 50}, {"rate-tc-index": 2, "rate-tc-bw": 0}, {"rate-tc-index": 3, "rate-tc-bw": 0}, {"rate-tc-index": 4, "rate-tc-bw": 0}, {"rate-tc-index": 5, "rate-tc-bw": 0}, {"rate-tc-index": 6, "rate-tc-bw": 0}, {"rate-tc-index": 7, "rate-tc-bw": 0} ] }' ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --do rate-get --json '{ "bus-name": "pci", "dev-name": "0000:08:00.0", "port-index": 1 }' output for rate-get: {'bus-name': 'pci', 'dev-name': '0000:08:00.0', 'port-index': 1, 'rate-tc-bws': [{'rate-tc-bw': 50, 'rate-tc-index': 0}, {'rate-tc-bw': 50, 'rate-tc-index': 1}, {'rate-tc-bw': 0, 'rate-tc-index': 2}, {'rate-tc-bw': 0, 'rate-tc-index': 3}, {'rate-tc-bw': 0, 'rate-tc-index': 4}, {'rate-tc-bw': 0, 'rate-tc-index': 5}, {'rate-tc-bw': 0, 'rate-tc-index': 6}, {'rate-tc-bw': 0, 'rate-tc-index': 7}], 'rate-tx-max': 0, 'rate-tx-priority': 0, 'rate-tx-share': 0, 'rate-tx-weight': 0, 'rate-type': 'leaf'} Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250629142138.361537-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-06devlink: define enum for attr types of dynamic attributesJiri Pirko1-0/+15
Devlink param and health reporter fmsg use attributes with dynamic type which is determined according to a different type. Currently used values are NLA_*. The problem is, they are not part of UAPI. They may change which would cause a break. To make this future safe, introduce a enum that shadows NLA_* values in it and is part of UAPI. Also, this allows to possibly carry types that are unrelated to NLA_* values. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jiri Pirko <jiri@nvidia.com> Link: https://patch.msgid.link/20250505114513.53370-3-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-08devlink: Support setting max_io_eqsParav Pandit1-0/+1
Many devices send event notifications for the IO queues, such as tx and rx queues, through event queues. Enable a privileged owner, such as a hypervisor PF, to set the number of IO event queues for the VF and SF during the provisioning stage. example: Get maximum IO event queues of the VF device:: $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10 Set maximum IO event queues of the VF device:: $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32 $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32 Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Parav Pandit <parav@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-11devlink: Add comments to use netlink gen toolWilliam Tu1-1/+4
Add the comment to remind people not to manually modify the net/devlink/netlink_gen.c, but to use tools/net/ynl/ynl-regen.sh to generate it. Signed-off-by: William Tu <witu@nvidia.com> Suggested-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20240310145503.32721-1-witu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-12-19devlink: add a command to set notification filter and use it for multicastsJiri Pirko1-0/+2
Currently the user listening on a socket for devlink notifications gets always all messages for all existing instances, even if he is interested only in one of those. That may cause unnecessary overhead on setups with thousands of instances present. User is currently able to narrow down the devlink objects replies to dump commands by specifying select attributes. Allow similar approach for notifications. Introduce a new devlink NOTIFY_FILTER_SET which the user passes the select attributes. Store these per-socket and use them for filtering messages during multicast send. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-10-23devlink: make devlink_flash_overwrite enum named oneJiri Pirko1-1/+1
Since this enum is going to be used in generated userspace file, name it properly. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20231021112711.660606-7-jiri@resnulli.us Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-09-17devlink: expose peer SF devlink instanceJiri Pirko1-0/+1
Introduce a new helper devl_port_fn_devlink_set() to be used by driver assigning a devlink instance to the peer devlink port function. Expose this to user over new netlink attribute nested under port function nest to expose devlink handle related to the port function. This is particularly helpful for user to understand the relationship between devlink instances created for SFs and the port functions they belong to. Note that caller of devlink_port_notify() needs to hold devlink instance lock, put the assertion to devl_port_fn_devlink_set() to make this requirement explicit. Also note the limitations that only allow to make this assignment for registered objects. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-27devlink: Expose port function commands to control IPsec packet offloadsDima Chumak1-0/+2
Expose port function commands to enable / disable IPsec packet offloads, this is used to control the port IPsec capabilities. When IPsec packet is disabled for a function of the port (default), function cannot offload IPsec packet operations (encapsulation and XFRM policy offload). When enabled, IPsec packet operations can be offloaded by the function of the port, which includes crypto operation (Encrypt/Decrypt), IPsec encapsulation and XFRM state and policy offload. Example of a PCI VF port which supports IPsec packet offloads: $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_packet disable $ devlink port function set pci/0000:06:00.0/1 ipsec_packet enable $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_packet enable Signed-off-by: Dima Chumak <dchumak@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-3-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-27devlink: Expose port function commands to control IPsec crypto offloadsDima Chumak1-0/+2
Expose port function commands to enable / disable IPsec crypto offloads, this is used to control the port IPsec capabilities. When IPsec crypto is disabled for a function of the port (default), function cannot offload any IPsec crypto operations (Encrypt/Decrypt and XFRM state offloading). When enabled, IPsec crypto operations can be offloaded by the function of the port. Example of a PCI VF port which supports IPsec crypto offloads: $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto disable $ devlink port function set pci/0000:06:00.0/1 ipsec_crypto enable $ devlink port show pci/0000:06:00.0/1 pci/0000:06:00.0/1: type eth netdev enp6s0pf0vf0 flavour pcivf pfnum 0 vfnum 0 function: hw_addr 00:00:00:00:00:00 roce enable ipsec_crypto enable Signed-off-by: Dima Chumak <dchumak@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230825062836.103744-2-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-07devlink: Expose port function commands to control migratableShay Drory1-0/+3
Expose port function commands to enable / disable migratable capability, this is used to set the port function as migratable. Live migration is the process of transferring a live virtual machine from one physical host to another without disrupting its normal operation. In order for a VM to be able to perform LM, all the VM components must be able to perform migration. e.g.: to be migratable. In order for VF to be migratable, VF must be bound to VFIO driver with migration support. When migratable capability is enabled for a function of the port, the device is making the necessary preparations for the function to be migratable, which might include disabling features which cannot be migrated. Example of LM with migratable function configuration: Set migratable of the VF's port function. $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 migratable disable $ devlink port function set pci/0000:06:00.0/2 migratable enable $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 migratable enable Bind VF to VFIO driver with migration support: $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind Attach VF to the VM. Start the VM. Perform LM. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Shannon Nelson <shannon.nelson@amd.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-12-07devlink: Expose port function commands to control RoCEShay Drory1-0/+10
Expose port function commands to enable / disable RoCE, this is used to control the port RoCE device capabilities. When RoCE is disabled for a function of the port, function cannot create any RoCE specific resources (e.g GID table). It also saves system memory utilization. For example disabling RoCE enable a VF/SF saves 1 Mbytes of system memory per function. Example of a PCI VF port which supports function configuration: Set RoCE of the VF's port function. $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 roce enable $ devlink port function set pci/0000:06:00.0/2 roce disable $ devlink port show pci/0000:06:00.0/2 pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 function: hw_addr 00:00:00:00:00:00 roce disable Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-30devlink: support directly reading from region memoryJacob Keller1-0/+2
To read from a region, user space must currently request a new snapshot of the region and then read from that snapshot. This can sometimes be overkill if user space only reads a tiny portion. They first create the snapshot, then request a read, then destroy the snapshot. For regions which have a single underlying "contents", it makes sense to allow supporting direct reading of the region data. Extend the DEVLINK_CMD_REGION_READ to allow direct reading from a region if requested via the new DEVLINK_ATTR_REGION_DIRECT. If this attribute is set, then perform a direct read instead of using a snapshot. Direct read is mutually exclusive with DEVLINK_ATTR_REGION_SNAPSHOT_ID, and care is taken to ensure that we reject commands which provide incorrect attributes. Regions must enable support for direct read by implementing the .read() callback function. If a region does not support such direct reads, a suitable extended error message is reported. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17devlink: Introduce new attribute 'tx_weight' to devlink-rateMichal Wilczynski1-0/+2
To fully utilize offload capabilities of Intel 100G card QoS capabilities new attribute 'tx_weight' needs to be introduced. This attribute allows for usage of Weighted Fair Queuing arbitration scheme among siblings. This arbitration scheme can be used simultaneously with the strict priority. Introduce new attribute in devlink-rate that will allow for configuration of Weighted Fair Queueing. New attribute is optional. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-17devlink: Introduce new attribute 'tx_priority' to devlink-rateMichal Wilczynski1-0/+1
To fully utilize offload capabilities of Intel 100G card QoS capabilities new attribute 'tx_priority' needs to be introduced. This attribute allows for usage of strict priority arbiter among siblings. This arbitration scheme attempts to schedule nodes based on their priority as long as the nodes remain within their bandwidth limit. Introduce new attribute in devlink-rate that will allow for configuration of strict priority. New attribute is optional. Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-28devlink: introduce framework for selftestsVikas Gupta1-0/+29
Add a framework for running selftests. Framework exposes devlink commands and test suite(s) to the user to execute and query the supported tests by the driver. Below are new entries in devlink_nl_ops devlink_nl_cmd_selftests_show_doit/dumpit: To query the supported selftests by the drivers. devlink_nl_cmd_selftests_run: To execute selftests. Users can provide a test mask for executing group tests or standalone tests. Documentation/networking/devlink/ path is already part of MAINTAINERS & the new files come under this path. Hence no update needed to the MAINTAINERS Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-07-26net: devlink: introduce nested devlink entity for line cardJiri Pirko1-0/+2
For the purpose of exposing device info and allow flash update which is going to be implemented in follow-up patches, introduce a possibility for a line card to expose relation to nested devlink entity. The nested devlink entity represents the line card. Example: $ devlink lc show pci/0000:01:00.0 lc 1 pci/0000:01:00.0: lc 1 state active type 16x100G nested_devlink auxiliary/mlxsw_core.lc.0 supported_types: 16x100G $ devlink dev show auxiliary/mlxsw_core.lc.0 auxiliary/mlxsw_core.lc.0 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-05Revert "Merge branch 'mlxsw-line-card-model'"Jakub Kicinski1-5/+0
This reverts commit 5e927a9f4b9f29d78a7c7d66ea717bb5c8bbad8e, reversing changes made to cfc1d91a7d78cf9de25b043d81efcc16966d55b3. The discussion is still ongoing so let's remove the uAPI until the discussion settles. Link: https://lore.kernel.org/all/20220425090021.32e9a98f@kernel.org/ Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20220504154037.539442-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-25devlink: introduce line card info get messageJiri Pirko1-0/+2
Allow the driver to provide per line card info get op to fill-up info, similar to the "devlink dev info". Example: $ devlink lc info pci/0000:01:00.0 lc 8 pci/0000:01:00.0: lc 8 versions: fixed: hw.revision 0 running: ini.version 4 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-25devlink: introduce line card devices supportJiri Pirko1-0/+3
Line card can contain one or more devices that makes sense to make visible to the user. For example, this can be a gearbox with flash memory, which could be updated. Provide the driver possibility to attach such devices to a line card and expose those to user. Example: $ devlink lc show pci/0000:01:00.0 lc 8 pci/0000:01:00.0: lc 8 state active type 16x100G supported_types: 16x100G devices: device 0 device 1 device 2 device 3 Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-18devlink: implement line card active stateJiri Pirko1-0/+1
Allow driver to mark a line card as active. Expose this state to the userspace over devlink netlink interface with proper notifications. 'active' state means that line card was plugged in after being provisioned. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-18devlink: implement line card provisioningJiri Pirko1-0/+15
In order to be able to configure all needed stuff on a port/netdevice of a line card without the line card being present, introduce line card provisioning. Basically by setting a type, provisioning process will start and driver is supposed to create a placeholder for instances (ports/netdevices) for a line card type. Allow the user to query the supported line card types over line card get command. Then implement two netlink command SET to allow user to set/unset the card type. On the driver API side, add provision/unprovision ops and supported types array to be advertised. Upon provision op call, the driver should take care of creating the instances for the particular line card type. Introduce provision_set/clear() functions to be called by the driver once the provisioning/unprovisioning is done on its side. These helpers are not to be called directly due to the async nature of provisioning. Example: $ devlink port # No ports are listed $ devlink lc pci/0000:01:00.0: lc 1 state unprovisioned supported_types: 16x100G lc 2 state unprovisioned supported_types: 16x100G lc 3 state unprovisioned supported_types: 16x100G lc 4 state unprovisioned supported_types: 16x100G lc 5 state unprovisioned supported_types: 16x100G lc 6 state unprovisioned supported_types: 16x100G lc 7 state unprovisioned supported_types: 16x100G lc 8 state unprovisioned supported_types: 16x100G $ devlink lc set pci/0000:01:00.0 lc 8 type 16x100G $ devlink lc show pci/0000:01:00.0 lc 8 pci/0000:01:00.0: lc 8 state active type 16x100G supported_types: 16x100G $ devlink port pci/0000:01:00.0/0: type notset flavour cpu port 0 splittable false pci/0000:01:00.0/53: type eth netdev enp1s0nl8p1 flavour physical lc 8 port 1 splittable true lanes 4 pci/0000:01:00.0/54: type eth netdev enp1s0nl8p2 flavour physical lc 8 port 2 splittable true lanes 4 pci/0000:01:00.0/55: type eth netdev enp1s0nl8p3 flavour physical lc 8 port 3 splittable true lanes 4 pci/0000:01:00.0/56: type eth netdev enp1s0nl8p4 flavour physical lc 8 port 4 splittable true lanes 4 pci/0000:01:00.0/57: type eth netdev enp1s0nl8p5 flavour physical lc 8 port 5 splittable true lanes 4 pci/0000:01:00.0/58: type eth netdev enp1s0nl8p6 flavour physical lc 8 port 6 splittable true lanes 4 pci/0000:01:00.0/59: type eth netdev enp1s0nl8p7 flavour physical lc 8 port 7 splittable true lanes 4 pci/0000:01:00.0/60: type eth netdev enp1s0nl8p8 flavour physical lc 8 port 8 splittable true lanes 4 pci/0000:01:00.0/61: type eth netdev enp1s0nl8p9 flavour physical lc 8 port 9 splittable true lanes 4 pci/0000:01:00.0/62: type eth netdev enp1s0nl8p10 flavour physical lc 8 port 10 splittable true lanes 4 pci/0000:01:00.0/63: type eth netdev enp1s0nl8p11 flavour physical lc 8 port 11 splittable true lanes 4 pci/0000:01:00.0/64: type eth netdev enp1s0nl8p12 flavour physical lc 8 port 12 splittable true lanes 4 pci/0000:01:00.0/125: type eth netdev enp1s0nl8p13 flavour physical lc 8 port 13 splittable true lanes 4 pci/0000:01:00.0/126: type eth netdev enp1s0nl8p14 flavour physical lc 8 port 14 splittable true lanes 4 pci/0000:01:00.0/127: type eth netdev enp1s0nl8p15 flavour physical lc 8 port 15 splittable true lanes 4 pci/0000:01:00.0/128: type eth netdev enp1s0nl8p16 flavour physical lc 8 port 16 splittable true lanes 4 $ devlink lc set pci/0000:01:00.0 lc 8 notype Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-04-18devlink: add support to create line card and expose to userJiri Pirko1-0/+7
Extend the devlink API so the driver is going to be able to create and destroy linecard instances. There can be multiple line cards per devlink device. Expose this new type of object over devlink netlink API to the userspace, with notifications. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-01devlink: report maximum number of snapshots with regionsJacob Keller1-0/+2
Each region has an independently configurable number of maximum snapshots. This information is not reported to userspace, making it not very discoverable. Fix this by adding a new DEVLINK_ATTR_REGION_MAX_SNAPSHOST attribute which is used to report this maximum. Ex: $devlink region pci/0000:af:00.0/nvm-flash: size 10485760 snapshot [] max 1 pci/0000:af:00.0/device-caps: size 4096 snapshot [] max 10 pci/0000:af:00.1/nvm-flash: size 10485760 snapshot [] max 1 pci/0000:af:00.1/device-caps: size 4096 snapshot [] max 10 This information enables users to understand why a new region command may fail due to having too many existing snapshots. Reported-by: Gurucharan G <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-02devlink: Allow setting parent node of rate objectsDmytro Linkin1-0/+1
Refactor DEVLINK_CMD_RATE_{GET|SET} command handlers to support setting a node as a parent for another rate object (leaf or node) by means of new attribute DEVLINK_ATTR_RATE_PARENT_NODE_NAME. Extend devlink ops with new callbacks rate_{leaf|node}_parent_set() to set node as a parent for rate object to allow supporting drivers to implement rate grouping through devlink. Driver implementations are allowed to support leafs or node children only. Invoking callback with NULL as parent should be threated by the driver as unset parent action. Extend rate object struct with reference counter to disallow deleting a node with any child pointing to it. User should unset parent for the child explicitly. Example: $ devlink port function rate add netdevsim/netdevsim10/group1 $ devlink port function rate add netdevsim/netdevsim10/group2 $ devlink port function rate set netdevsim/netdevsim10/group1 parent group2 $ devlink port function rate show netdevsim/netdevsim10/group1 netdevsim/netdevsim10/group1: type node parent group2 $ devlink port function rate set netdevsim/netdevsim10/group1 noparent Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-02devlink: Introduce rate nodesDmytro Linkin1-0/+3
Implement support for DEVLINK_CMD_RATE_{NEW|DEL} commands that are used to create and delete devlink rate nodes. Add new attribute DEVLINK_ATTR_RATE_NODE_NAME that specify node name string. The node name is an alphanumeric identifier. No valid node name can be a devlink port index, eg. decimal number. Extend devlink ops with new callbacks rate_node_{new|del}() and rate_node_tx_{share|max}_set() to allow supporting drivers to implement ports rate grouping and setting tx rate of rate nodes through devlink. Expose devlink_rate_nodes_destroy() function to allow vendor driver do proper cleanup of internally allocated resources for the nodes if the driver goes down or due to any other reasons which requires nodes to be destroyed. Disallow moving device from switchdev to legacy mode if any node exists on that device. User must explicitly delete nodes before switching mode. Example: $ devlink port function rate add netdevsim/netdevsim10/group1 $ devlink port function rate set netdevsim/netdevsim10/group1 \ tx_share 10mbit tx_max 100mbit Add + set command can be combined: $ devlink port function rate add netdevsim/netdevsim10/group1 \ tx_share 10mbit tx_max 100mbit $ devlink port function rate show netdevsim/netdevsim10/group1 netdevsim/netdevsim10/group1: type node tx_share 10mbit tx_max 100mbit $ devlink port function rate del netdevsim/netdevsim10/group1 Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-02devlink: Allow setting tx rate for devlink rate leaf objectsDmytro Linkin1-0/+2
Implement support for DEVLINK_CMD_RATE_SET command with new attributes DEVLINK_ATTR_RATE_TX_{SHARE|MAX} that are used to set devlink rate shared/max tx rate values. Extend devlink ops with new callbacks rate_leaf_tx_{share|max}_set() to allow supporting drivers to implement rate control through devlink. New attributes are optional. Driver implementations are allowed to support either or both of them. Shared rate example: $ devlink port function rate set netdevsim/netdevsim10/0 tx_share 10mbit $ devlink port function rate show netdevsim/netdevsim10/0 netdevsim/netdevsim10/0: type leaf tx_share 10mbit Max rate example: $ devlink port function rate set netdevsim/netdevsim10/0 tx_max 100mbit $ devlink port function rate show netdevsim/netdevsim10/0 netdevsim/netdevsim10/0: type leaf tx_max 100mbit Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-02devlink: Introduce rate objectDmytro Linkin1-0/+11
Allow registering rate object for devlink ports with dedicated devlink_rate_leaf_{create|destroy}() API. Implement new netlink DEVLINK_CMD_RATE_GET command that is used to retrieve rate object info. Add new DEVLINK_CMD_RATE_{NEW|DEL} commands that are used for notifications when creating/deleting leaf rate object. Rate API is intended to be used for rate limiting of individual devlink ports (leafs) and their aggregates (nodes). Example: $ devlink port show pci/0000:03:00.0/0 pci/0000:03:00.0/1 $ devlink port function rate show pci/0000:03:00.0/0: type leaf pci/0000:03:00.0/1: type leaf Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-01-22devlink: Support get and set state of port functionParav Pandit1-0/+20
devlink port function can be in active or inactive state. Allow users to get and set port function's state. When the port function it activated, its operational state may change after a while when the device is created and driver binds to it. Similarly on deactivation flow. To clearly describe the state of the port function and its device's operational state in the host system, define state and opstate attributes. Example of a PCI SF port which supports a port function: $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev $ devlink port show pci/0000:06:00.0/65535: type eth netdev ens2f0np0 flavour physical port 0 splittable false $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88 pci/0000:08:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:00:00 state inactive opstate detached $ devlink port show pci/0000:06:00.0/32768 pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:88:88 state inactive opstate detached $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 state active $ devlink port show pci/0000:06:00.0/32768 -jp { "port": { "pci/0000:06:00.0/32768": { "type": "eth", "netdev": "ens2f0npf0sf88", "flavour": "pcisf", "controller": 0, "pfnum": 0, "sfnum": 88, "external": false, "splittable": false, "function": { "hw_addr": "00:00:00:00:88:88", "state": "active", "opstate": "attached" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2021-01-22devlink: Introduce PCI SF port flavour and port attributeParav Pandit1-0/+5
A PCI sub-function (SF) represents a portion of the device similar to PCI VF. In an eswitch, PCI SF may have port which is normally represented using a representor netdevice. To have better visibility of eswitch port, its association with SF, and its representor netdevice, introduce a PCI SF port flavour. When devlink port flavour is PCI SF, fill up PCI SF attributes of the port. Extend port name creation using PCI PF and SF number scheme on best effort basis, so that vendor drivers can skip defining their own scheme. This is done as cApfNSfM, where A, N and M are controller, PCI PF and PCI SF number respectively. This is similar to existing naming for PCI PF and PCI VF ports. An example view of a PCI SF port: $ devlink port show pci/0000:06:00.0/32768 pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false function: hw_addr 00:00:00:00:88:88 state active opstate attached $ devlink port show pci/0000:06:00.0/32768 -jp { "port": { "pci/0000:06:00.0/32768": { "type": "eth", "netdev": "ens2f0npf0sf88", "flavour": "pcisf", "controller": 0, "pfnum": 0, "sfnum": 88, "splittable": false, "function": { "hw_addr": "00:00:00:00:88:88", "state": "active", "opstate": "attached" } } } } Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Vu Pham <vuhuong@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2020-12-16devlink: use _BITUL() macro instead of BIT() in the UAPI headerTobias Klauser1-1/+1
The BIT() macro is not available for the UAPI headers. Moreover, it can be defined differently in user space headers. Thus, replace its usage with the _BITUL() macro which is already used in other macro definitions in <linux/devlink.h>. Fixes: dc64cc7c6310 ("devlink: Add devlink reload limit option") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Link: https://lore.kernel.org/r/20201215102531.16958-1-tklauser@distanz.ch Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-24devlink: Fix reload stats structureMoshe Shemesh1-0/+2
Fix reload stats structure exposed to the user. Change stats structure hierarchy to have the reload action as a parent of the stat entry and then stat entry includes value per limit. This will also help to avoid string concatenation on iproute2 output. Reload stats structure before this fix: "stats": { "reload": { "driver_reinit": 2, "fw_activate": 1, "fw_activate_no_reset": 0 } } After this fix: "stats": { "reload": { "driver_reinit": { "unspecified": 2 }, "fw_activate": { "unspecified": 1, "no_reset": 0 } } Fixes: a254c264267e ("devlink: Add reload stats") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/1606109785-25197-1-git-send-email-moshe@mellanox.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-09devlink: Add remote reload statsMoshe Shemesh1-0/+1
Add remote reload stats to hold the history of actions performed due devlink reload commands initiated by remote host. For example, in case firmware activation with reset finished successfully but was initiated by remote host. The function devlink_remote_reload_actions_performed() is exported to enable drivers update on remote reload actions performed as it was not initiated by their own devlink instance. Expose devlink remote reload stats to the user through devlink dev get command. Examples: $ devlink dev show pci/0000:82:00.0: stats: reload: driver_reinit 2 fw_activate 1 fw_activate_no_reset 0 remote_reload: driver_reinit 0 fw_activate 0 fw_activate_no_reset 0 pci/0000:82:00.1: stats: reload: driver_reinit 1 fw_activate 0 fw_activate_no_reset 0 remote_reload: driver_reinit 1 fw_activate 1 fw_activate_no_reset 0 $ devlink dev show -jp { "dev": { "pci/0000:82:00.0": { "stats": { "reload": { "driver_reinit": 2, "fw_activate": 1, "fw_activate_no_reset": 0 }, "remote_reload": { "driver_reinit": 0, "fw_activate": 0, "fw_activate_no_reset": 0 } } }, "pci/0000:82:00.1": { "stats": { "reload": { "driver_reinit": 1, "fw_activate": 0, "fw_activate_no_reset": 0 }, "remote_reload": { "driver_reinit": 1, "fw_activate": 1, "fw_activate_no_reset": 0 } } } } } Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-09devlink: Add reload statsMoshe Shemesh1-0/+6
Add reload stats to hold the history per reload action type and limit. For example, the number of times fw_activate has been performed on this device since the driver module was added or if the firmware activation was performed with or without reset. Add devlink notification on stats update. Expose devlink reload stats to the user through devlink dev get command. Examples: $ devlink dev show pci/0000:82:00.0: stats: reload: driver_reinit 2 fw_activate 1 fw_activate_no_reset 0 pci/0000:82:00.1: stats: reload: driver_reinit 1 fw_activate 0 fw_activate_no_reset 0 $ devlink dev show -jp { "dev": { "pci/0000:82:00.0": { "stats": { "reload": { "driver_reinit": 2, "fw_activate": 1, "fw_activate_no_reset": 0 } } }, "pci/0000:82:00.1": { "stats": { "reload": { "driver_reinit": 1, "fw_activate": 0, "fw_activate_no_reset": 0 } } } } } Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-10-09