ConnectX-3 VF not working on Windows comes from old in-box MLNX_OFED driver of PVE

It seems the matter ConnectX-3's SR-IOV doesn't work on Windows 10 VM on Proxmox VE 6.3 is a pretty old inbox mlx4 driver of PVE, to be exact, the driver is boxed in Linux.

When the driver is used, Windows VM recognises virtual functions and they appear to be working, but actually they are not working because of “Code 43” status. PVE reports logs like below:

mlx4_core 0000:01:00.0: vhcr command:0x43 slave:1 failed with error:0, status -22
mlx4_core 0000:01:00.0: Received reset from slave:1

I saw the “Mellanox OFED for Linux Archived Bug Fixes” document, then I found that the Internal Ref 1178129 was kin to my situation.

Description: Fixed an issue that prevented Windows virtual machines running over MLNX_OFED Linux hypervisors from operating ConnectX-3 IB ports.

When such failures occurred, the following message (or similar) appeared in the Linux HV message log when users attempted to start up a Windows VM running a ConnectX-3 VF:

“mlx4_core 0000:81:00.0: vhcr command 0x1a slave:1 in_param 0x793000 in_mod=0x210 op_mod=0x0 failed with error:0, status -22”

There was a difference that the description was mentioned to “IB ports” while my environment was “Eth ports”, but the situation was similar indeed. This issue have been fixed on MLNX_OFED v4.2-1.2.0.0. The latest driver is v4.9-2.2.4.0 LTS as of January 10, 2021, so it should be solved years ago.

I didn't know why I ran into the problem, but I found the fact which the version of Linux inbox driver corresponded to v4.0. What the hell… Someone sent the bug report to Kernel.org' Bugzilla, though, nobody cares.

I had no choice but to build the latest driver myself, then the ConnectX-3 VF easily worked. Come on, I just fell into the pitfall!

However, it is not perfect solution because any traffics don't flow though the VF seems to work fine sometimes. I guess this problem is cased by wrong something at the time of a device initialization. If it happens, restart VM several times to open the connection up. Once the device work fine, it will seem to keep working as long as the VM is active.

Another thing that makes me wonder is that Send/Recv bytes counters on Windows' network status dialog are weired. The recev bytes is always zero though the network is actually communicating. Some applications seem to recgonise that the network is down in case of using the VF. For example, iTunes fails to connect to Gracenote server because of “no network.” I'm not sure that there is the relation between these or not. By virtio-net, it works perfectly without these problems, so it is not like that my network is bad.

The information about SR-IOV little exists on the Internet, so I'm in the fog.


(Updated: 2021-12-07)

I saw Known Issues of WinOF v5.50.5400 Release notes, then found Internal Ref 1297888 which was exactly the “packet counter” issue. That's why the issue is a Windows driver bug.

Its workaround is N/A, so I have to wait to be fixed, but I wonder if it will be done? ConnectX-3 series already has been LTS phase anyway. I wish I could get a ConnectX-4 at a low price.