Ilya Maximets [Tue, 16 Dec 2025 20:33:47 +0000 (21:33 +0100)]
ovsdb: raft: Discard pre-vote replies during the actual election.
I am not sure how we could receive a pre-vote reply for a term of the
actual vote, but if somehow it does happen, we must not accept it, as
the same server can vote differently in the pre-vote and the real vote
and so we may end up with more than one elected leader.
Ignore the pre-vote reply during the actual elections and warn the user
if this ever happens, so we could investigate further.
Found while investigating a report with a cluster with two elected
leaders. It may not be the cause of the issue and, as stated above,
I'm not even sure if receiving a pre-vote for the actual election term
is possible. But it's better to cover this case explicitly, as the
flag in the reply is not used today.
Fixes: 85634fd58004 ("ovsdb: raft: Support pre-vote mechanism to deal with disruptive server.") Acked-by: Han Zhou <hzhou@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Mon, 15 Dec 2025 19:33:14 +0000 (20:33 +0100)]
netdev-afxdp: Require kernel v5.4+ and need-wakeup.
Kernels with AF_XDP support below v5.4 are not supported and are not
used in any major distributions. Using kernels that old with AF_XDP
that was rapidly changing at the time is also not a good idea in
general. Let's drop support for them. This allows to have the
XDP_USE_NEED_WAKEUP flag to be always on, so we can simplify the code
and stop guessing how many packets kernel can send in one go.
The next milestone may be to require libxdp 1.2+ and kernel 5.18+, so
we can simplify acinclude.m4 and avoid checking for bpf_xdp_detach and
bpf_xdp_query_id API. But it feels a little bit too early for that,
as it excludes Ubuntu 22.04, even if it's probably not a great idea
to run AF_XDP there either. The gains from moving the pole to 5.18
are just a couple of checks in the code, so not doing that for now.
ofproto-dpif-xlate: Log clarification for dp_hash method.
From Open vSwitch 2.10 onwards, dp_hash is the default selection method
for select groups. When this method is used, packet recirculation occurs
to compute the hash at the datapath level. Previously, this case was
logged as "no live bucket", which was misleading.
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2023-December/052860.html Fixes: 53cc166ae5fe ("xlate: Use dp_hash for select groups.") Signed-off-by: Alexandra Rukomoinikova <arukomoinikova@k2.cloud> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Kevin Traynor [Thu, 11 Dec 2025 10:53:22 +0000 (10:53 +0000)]
dpdk: Update to use v25.11.
This commit adds support for DPDK v25.11.
It updates the CI script and documentation.
The following commit was previously added and is required for
compatibility with DPDK 25.11. cd2ff530d740 ("dpdk: Convert dpdk-lcore-mask to DPDK lcore args.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: David Marchand <david.marchand@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Ilya Maximets <i.maximets@ovn.org>
Simon Horman [Tue, 9 Dec 2025 10:37:47 +0000 (10:37 +0000)]
NEWS: Deprecate running user-space with OOT Kernel module.
It is some time now since the OOT kernel module was removed. It is yet
longer since it was deprecated. And it is no longer present in any
supported versions.
However, running user-space with the OOT kernel module, from unsupported
releases, is still supported.
This notice deprecates that support as of the v3.7 release.
With the intention to remove it in the v3.8 release.
As discussed at OVS+OVN Conference 2025.
Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Simon Horman <horms@ovn.org>
Simon Horman [Thu, 11 Dec 2025 17:31:25 +0000 (17:31 +0000)]
dpif-netlink-rtnl: Tighten probe for OOT Kernel tunnels.
It is common to run make check as an unprivileged user.
And in that case calling dpif_netlink_rtnl_create() to add a tunnel
fails due to insufficient privileges. Which is assumed
to mean that OOT Kernel tunnels are used.
Reverse this assumption and assume they are not being used.
An assumption in making this change is that the distinction
is unimportant in terms of run-time tunnel usage as as
make check doesn't use kernel tunnels.
However, a follow-up patch will add a deprecation notice, which prints
in the case that OOT Kernel tunnels are used. It does not seem useful
to print in the case of make check. And doing so breaks existing tests
run by that target.
Ilya Maximets [Wed, 3 Dec 2025 14:45:31 +0000 (15:45 +0100)]
ovsdb-idl: Fix returning non-existent rows from uuid lookup.
IDL may contain deleted or orphan rows that should never be visible
to the user. However, ovsdb_idl_get_row_for_uuid() function simply
looks up the row by UUID and returns it without checking if the row
actually exists in the IDL. This is causing a crash on assertion
failure in ovn-controller when it looks up and finds port binding
records that were already deleted:
5 vlog_abort at lib/vlog.c:1325
6 ovs_assert_failure at lib/util.c:90
7 ovsdb_idl_txn_write__.constprop.0 at lib/ovsdb-idl.c:3650
8 ovsdb_idl_txn_write at lib/ovsdb-idl.c:3742
9 sbrec_port_binding_set_up at lib/ovn-sb-idl.c:39665
10 port_binding_set_down at controller/binding.c:3700
11 if_status_mgr_update at controller/if-status.c:645
12 main at controller/ovn-controller.c:7544
7 ovsdb_idl_txn_write__.constprop.0 at lib/ovsdb-idl.c:3650
3650 ovs_assert(row->new_datum != NULL);
Can be easily reproduced with ovs-vsctl:
$ ovs-vsctl add-br br-int
$ ovs-vsctl del-br br-int \
-- set bridge $(ovs-vsctl get bridge br-int _uuid) other-config:a=b
ovs-vsctl: lib/ovsdb-idl.c:2673:
assertion row->new_datum != NULL failed in ovsdb_idl_read()
Aborted (core dumped)
Fix that by adding an extra check for row existence like in IDL
iterators, so deleted or orphan rows can no longer be found.
Fixes: 979821c0a6b0 ("ovsdb-idl: Allow clients to modify records without using structs.") Reported-by: Dumitru Ceara <dceara@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Wed, 3 Dec 2025 11:01:05 +0000 (12:01 +0100)]
ovsdb: transaction: Fix logging order of duplicate index rows.
The message about conflicting rows is trying to order the two rows
in a consistent manner, so the log messages do not change in tests.
But it fails to do so, because the order of columns in the column
set depends on the order of columns inside the hash map, which
depends on the hash function and the internal implementation details
of the hash map. This results in random test failures, when two
rows end up in the opposite order.
Uncovered while testing a different hash map implementation, but the
failure is technically possible even without any changes in the code,
e.g., by running on a different CPU architecture or with different
compiler flags.
Fix that by introducing a new function that constructs the column
set with columns in a predictable order and without UUID columns that
have random values in most cases and so not actually comparable.
Ilya Maximets [Fri, 14 Nov 2025 12:24:59 +0000 (13:24 +0100)]
dns-resolve: Do not treat never accessed responses as expired.
If a daemon has multiple remotes to connect to and it already reached
pretty high backoff interval on those connections, it is possible that
it will never be able to connect, if the DNS TTL value is too low.
For example, if ovn-controller has 3 remotes in the ovn-remote
configuration, where each is presented as a host name, the following
is happening when it reaches default max backoff of 8 seconds:
1. Tries to connect to the first remote - sends async DNS request.
2. Since jsonrpc/reconnect modules are not really aware of this, they
just treat connection as temporarily failed - 8 second backoff.
3. After the backoff - switching to a new remote - sending DNS request.
4. Temporarily failing - 8 second backoff.
5. Switching to the third remote - sending DNS request.
6. Temporarily failing - 8 second backoff.
7. Finally trying the first remote again - checking DNS.
8. There is a processed response, but it is already 24 seconds old.
9. If DNS TTL is lower than 24 seconds - consider expired - send
a new DNS request.
10. Go to step 2.
With that, if DNS TTL is lower than 3x of the backoff interval, the
process will never be able to connect without some external help to
break the loop.
A proper solution for this should include:
1. Making jsonrpc and reconnect and all the users of these modules
aware of the DNS request being made. This means introduction of
a new RECONNECT state for DNS request and not switching to a new
target while we're still in this state.
2. Making the poll loop state machine properly react to DNS responses
by waiting on the file descriptor provided by the unbound library.
However, such solution will be very invasive to the code structure
and all the involved libraries, so it may not be something that we
would want to backport as a bug fix to stable branches.
Instead, making a much simpler change to allow use of never previously
accessed DNS replies for a short period of time, so the loop can be
broken. It's not caching if we just requested the value, but didn't
use it yet, it's a "transaction in progress" situation in which we can
use the response even if TTL is zero. Without a proper solution though
we can't be sure that the process will ever look at the result of
asynchronous request, so we need to have an upper limit for such
"transactions in progress". Limiting them to a fairly arbitrary, but
big enough, value of 5 minutes. In the worst case where the address
actually goes stale in between our request and the first access, we'll
try to use the stale value once and then re-request right away on
failure to connect.
This solution seems reasonable and simple enough to backport to stable
branches while working on the proper solution on main.
MJ Ponsonby [Thu, 13 Nov 2025 15:08:30 +0000 (15:08 +0000)]
dpif: Maintain upcall_pid, mtu and hash.
The commit in the fixes tag extended struct dpif_execute with an
upcall_pid member.
This member was left uninitialized in dpif_execute_helper_cb(),
leading to Open vSwitch providing the kernel with a PID for which
it does not have a listener, we fix this by extending the
execute_helper_aux struct to include the previous dpif_execute struct
which we use to carry over various values such as the mtu, hash and
upcall_pid.
This condition is caught by one of the existing system-traffic
tests which is extended to explicitly check for presence of lost
upcalls in the datapath, however only with linux kernel >= 6.17.0-5.5
and gcc < 15.
Fixes: 0d9dc8e9ca4a ("dpif-netlink: Provide original upcall pid in 'execute' commands.") Fixes: 27130224cd5c ("dpif-netlink: Allow MRU packet attribute.") Fixes: 0442bfb11d6c ("ofproto-dpif-upcall: Echo HASH attribute back to datapath.") Fixes: f9d30390392c ("dpif: Fix use of uninitialized execute hash.") Signed-off-by: MJ Ponsonby <mj.ponsonby@canonical.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Thu, 30 Oct 2025 10:56:54 +0000 (11:56 +0100)]
dpif-netlink: Fix probing for broken meters on Linux v5.10+.
If someone creates a lot of meters (>1017) in the kernel datapath
and then re-starts ovs-vswitchd, OVS fails to probe meter support
and refuses to install any meters afterwards:
dpif_netlink|INFO|The kernel module has a broken meter implementation.
The reason is that probing for broken meters relies on creating
two meters with high meter IDs. Inside the kernel, however, the
meter table is not a real hash table since v5.10 and commit:
c7c4c44c9a95 ("net: openvswitch: expand the meters supported number")
Instead, it's just an array with meter IDs mapped to indexes with a
simple modulo operation. This array can expand, but only when it is
full. There is no handling of collisions, so if the meter at
ID % size is not the right one, then the lookup just fails. This is
fine as long as userspace creates meters with densely packed IDs,
which is the case for ovs-vswitchd... except for probing.
While probing, we attempt to create meters 54545401 and 54545402.
Without expanding the table size, these map onto 1017 and 1018.
So, creation of these meters fails if there are already meters with
IDs 1017 or 1018. At this point OVS declares meter implementation
broken.
Ideally, we should make the "hash" table in the kernel handle
collisions and otherwise be a more or less proper hash table. But
we can also improve probing in userspace and avoid this issue by
choosing lower numbered meter IDs and trying to get them from the
kernel first before trying to create. Choosing high values at the
top of the 0-1023 range, so they are guaranteed to fit into the
minimal size table in the kernel (1024). If one of these already
exists and has a proper ID, then meters are likely working fine
and we don't need to install new ones for probing. If these meters
are not in the kernel, then we can try to create and check.
This logic should work fine for older or future kernels with a
proper hash table as well.
There is no Fixes tag here as the check was correct at the moment
it was introduced. It's the kernel change that broke it.
Ilya Maximets [Thu, 6 Nov 2025 12:14:44 +0000 (13:14 +0100)]
windows: Fix absolute-header.m4 for newer msys2 appearing as cygwin.
MSYS2 is trying to move closer to cygwin and as part of that effort
they changed reporting of the system triplets from msys to cygwin:
https://www.msys2.org/news/#2025-06-20-replacing-x86_64-pc-msys-with-x86_64-pc-cygwin
This broke our builds in AppVeyor, since detection of absolute paths
for system headers is now using the wrong path separator list:
checking absolute name of <stdio.h>... ""
checking absolute name of <string.h>... ""
Fix by adding cygwin to the list for Windows-style paths. With that,
we can properly detect the headers again:
checking absolute name of <stdio.h>...
"C:\\Program Files (x86)\\Windows Kits\\10\\include\\10.0.22000.0\\ucrt\\stdio.h"
checking absolute name of <string.h>...
"C:\\Program Files (x86)\\Windows Kits\\10\\include\\10.0.22000.0\\ucrt\\string.h"
We used the 'Previous' variant of the image for AppVeyor before to
avoid this issue, but now even the previous image has the new MSYS2
and CI is fully broken without the fix.
Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
David Marchand [Wed, 29 Oct 2025 15:23:44 +0000 (16:23 +0100)]
netdev: Fix reported rate for Linux and BSD ports.
As the link speeds have evolved over the years, the list of supported
rates is obsolete in OVS.
Yet a catch all value (NETDEV_F_OTHER) is available, so use it.
Fixes: 6240c0b4c80e ("netdev: Add netdev_get_speed() to netdev API.") Acked-by: Mike Pattrick <mkp@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: David Marchand <david.marchand@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Fri, 24 Oct 2025 16:56:09 +0000 (18:56 +0200)]
ipf: Work around thread safety warnings with clang 21.
In a few places the list sizes are checked without holding the lock.
While this probably speeds up the processing a bit, it's technically
incorrect as the access is not atomic. Caught by thread-safety
analyzer of clang 21, as it now properly supports GUARDED annotations
on structure fields:
lib/ipf.c:1117:33:
error: reading the value pointed to by 'frag_exp_list' requires
holding any mutex [-Werror,-Wthread-safety-analysis]
1117 | if (ovs_list_is_empty(&ipf->frag_exp_list)) {
| ^
lib/ipf.c:1154:33:
error: reading the value pointed to by 'reassembled_pkt_list'
requires holding any mutex [-Werror,-Wthread-safety-analysis]
1154 | if (ovs_list_is_empty(&ipf->reassembled_pkt_list)) {
| ^
Not trying to fix the actual non-atomic access to the list as getting
the wrong result in these particular calls will not meaningfully
impact the correctness of the code, but taking the lock before checking
will likely reduce performance.
For now, just working around the problem by turning off thread safety
analysis for these particular calls. The new _unsafe() access function
is also a bit more obvious in a way that it's more clear that these
calls should not be relied upon for logic that requires precision.
Ilya Maximets [Fri, 24 Oct 2025 16:59:06 +0000 (18:59 +0200)]
ipf: Fix potential deadlock with the clean thread on exit.
There is a potential deadlock between the thread destroying the ipf
and the clean thread as they are using the latch and the ipf lock in
the opposite order. The latch itself is thread-safe, so we should
not be holding the lock while setting it and waiting for the clean
thread that may be waiting for the lock.
Yunjian Wang [Thu, 23 Oct 2025 01:35:24 +0000 (09:35 +0800)]
dpif-netlink: Fix memory leak when re-add vport channel.
When vport_add_channel() is called duplicate, the resources for
previously specified sock was not freed. This issue only occurs
when per-cpu dispatch is disabled. This patch fixes this issue.
Reported by Address Sanitizer.
Direct leak of 60 byte(s) in 3 object(s) allocated from:
0 0xffffb3658080 in malloc (/usr/lib64/libasan.so.6+0xa9080)
1 0x922630 in xmalloc__ lib/util.c:141
2 0x922718 in xmalloc lib/util.c:176
3 0x9c67e4 in nl_sock_create lib/netlink-socket.c:147
4 0x94cb6c in create_nl_sock lib/dpif-netlink.c:283
5 0x950bec in dpif_netlink_port_add__ lib/dpif-netlink.c:978
6 0x951a20 in dpif_netlink_port_add_compat lib/dpif-netlink.c:1101
7 0x951cd0 in dpif_netlink_port_add lib/dpif-netlink.c:1147
8 0x616354 in dpif_port_add lib/dpif.c:602
9 0x49f424 in port_add ofproto/ofproto-dpif.c:4144
10 0x44d51c in ofproto_port_add ofproto/ofproto.c:2204
11 0x416914 in iface_do_create vswitchd/bridge.c:2203
12 0x416dbc in iface_create vswitchd/bridge.c:2246
13 0x40e1d0 in bridge_add_ports__ vswitchd/bridge.c:1225
14 0x40e290 in bridge_add_ports vswitchd/bridge.c:1241
15 0x40cc6c in bridge_reconfigure vswitchd/bridge.c:952
16 0x420884 in bridge_run vswitchd/bridge.c:3440
17 0x42f3d0 in main vswitchd/ovs-vswitchd.c:137
Reproduce steps:
ovs-vsctl add-br br-ovs
ovs-vsctl add-port br-ovs test -- set interface test type=internal
ip netns add ns_test
ip link set test netns ns_test
ip netns del ns_test
ifconfig br-ovs up
sleep 1
ifconfig br-ovs down
sleep 1
ovs-vsctl del-br br-ovs
Fixes: 1579cf677fcb ("dpif-linux: Implement the API functions to allow multiple handler threads read upcall.") Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Fri, 17 Oct 2025 17:11:07 +0000 (19:11 +0200)]
configure: Disable clang 21 warning for uninitialized const fields.
'default-const-init-field-unsafe' warning complains about allocating
structures on stack if they have any const fields. We use this for
classifier rules that are embedded into many other structures:
ofproto/ofproto.c:4980:26:
error: default initialization of an object of type
'struct rule_criteria' with const member leaves the
object uninitialized [-Werror,-Wdefault-const-init-field-unsafe]
4980 | struct rule_criteria criteria;
| ^
./lib/classifier.h:356:15:
note: member 'priority' declared 'const' here
356 | const int priority;
| ^
Priority is marked as const as it is not supposed to be changed after
the rule creation, as that would break the classifier logic. Any code
that changes the value is clearly seen, as it requires the explicit
CONST_CAST. Initialization functions are the only places where this is
happening. And these functions are always called for all the instances
allocated on stack, but clang fails to recognize this.
I believe, described code pattern is useful, so it's better to turn
off the warning instead of making the field non-const. Alternative
would be to use an explicit initializer for every stack allocation
to silence the warning, but it feels like a bit too much.
Ilya Maximets [Fri, 17 Oct 2025 17:11:06 +0000 (19:11 +0200)]
thread: Convert init/destroy lock functions to use non-const arguments.
'-Wuninitialized-const-pointer' complains about passing uninitialized
data into a function via a const pointer. That's a valid concern,
since normally the function accepting a const argument wouldn't write
to it, but will likely use the data.
However, we do have some cases of such a coding pattern in OVS today
and they generate a ton of warnings with clang 21.
The main offenders of this are all the types of locks in OVS: mutex,
rwlock, spinlock. All the locking functions accept const pointers
and then do a CONST_CAST inside. This is useful for the cases where
we have a lock protecting a structure and want to have some read-only
function that accepts this structure, e.g.:
struct my_type {
struct ovs_mutex mutex;
int value;
};
int get_value(const struct my_type *p)
{
int res;
ovs_mutex_lock(&p->mutex);
res = p->value;
ovs_mutex_unlock(&p->mutex);
return res;
}
If the ovs_mutex_lock() required non-const argument, then we'd have
to use CONST_CAST for every lock/unlock, or we would have to make the
original pointer 'p' non-const, cascading the change through all the
callers of this "read-only" functions.
However, it's totally reasonable for the mutex init and destroy to
not have a constant argument, since we're explicitly trying to change
it and the initialization and destruction likely not happening for
the constant parent structure.
Let's convert both init and destroy API for all the locks to accept
non-const arguments. This makes clang happy as we're no longer
passing in the uninitialized locks as const.
This technically changes the API, but we're relaxing the restriction,
so it shouldn't be a concern. All the code that used these functions
before should still work fine.
Ilya Maximets [Fri, 17 Oct 2025 17:11:05 +0000 (19:11 +0200)]
treewide: Fix clang 21 thread-safety warnings for init/create/destroy.
Clang 21 now properly implements GUARDED annotations for structure
fields, and so it complains about those fields being accessed in
init/create/destroy type of functions that normally do not take any
locks, as the data is just allocated and not available to other
threads. Add some locks where it is simple to do, so we don't have
to turn off the analysis entirely, in other places just turn off the
analysis.
While at it, adding a couple of missed annotations that clang started
to complain about due to actual support for the GUARDED annotation on
structure fields. Added some unused arguments to a couple of functions
to be able to specify which lock is required. It seemed better than
turning off the analysis.
Ilya Maximets [Fri, 17 Oct 2025 17:11:04 +0000 (19:11 +0200)]
treewide: Remove OVS_GUARDED from RCU-based structure fields.
This annotation never worked on structure fields, but it does now with
clang 21. The problem is that structures like cmap or rculist are
designed for lockless access for readers and so we're not holding any
locks while reading. That makes new clang generate thread-safety
warnings:
lib/dpif-netdev.c:3632:40:
warning: reading the value pointed to by 'flow_table' requires
holding any mutex [-Wthread-safety-analysis]
3632 | &pmd->flow_table) {
| ^
ssl: Support for SNI extension in clients for ssl/tls.
This commit introduces Server Name Indication (SNI) support for all OVS
SSL/TLS clients, enabling clients to specify which hostname they are
attempting to reach during the SSL handshake. This is essential for
connecting through proxies, load balancers, and service meshes where
the connection endpoint differs from the intended server name.
An example use case is trying to connect to a ovsdb-server through
a istio proxy gateway in kubernetes environments.
stream-ssl: Fix missing OPT_SSL_CIPHERSUITES in STREAM_SSL_CASES.
The OPT_SSL_CIPHERSUITES option was defined in SSL_OPTION_ENUMS and handled
in STREAM_SSL_OPTION_HANDLERS, but was missing from STREAM_SSL_CASES.
This could cause tools using STREAM_SSL_CASES to not recognize
--ssl-ciphersuites as a valid SSL option.
Fixes: 4d09d6b48ef5 ("stream-ssl: Add explicit support for configuring TLSv1.3.") Signed-off-by: Gurucharan Shetty <guru@ovn.org> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Thu, 9 Oct 2025 09:21:26 +0000 (11:21 +0200)]
github: Add a job to build on ubuntu-14.04.
While adding new code we frequently miss that certain compiler features
may be relatively new, or more often that some system headers are not
available or do not have certain definitions in them. This results in
builds failing on older systems.
Adding a new CI job that runs inside Ubuntu 14.04 container, which is
the oldest Ubuntu that is in the "legacy support" mode:
https://ubuntu.com/about/release-cycle
This image has GCC 4.8 that is missing a lot of modern features and
it's based on Linux v3.13 kernel that also has a lot of definitions
missing in uAPI headers, compared to modern ones. This makes this
image a good candidate for a baseline "old distribution" testing.
This job can't cover everything and there will be different
configurations and distributions that may still fail, especially if
they have custom backports or some packages much newer than others.
But it should cover the vast majority of potential issues.
Since we're running inside a very old container, we can't use any of
the pre-defined GitHub workflows like 'checkout' or 'cache', as they
are based on Node.js that is built for much newer version of Ubuntu
and so requires much newer glibc to run. Hence doing everything
manually.
Need to disable SSL, as we require OpenSSL 1.1.1+, which can probably
be built, but it seems like a bit of a waste of time to re-build so
many large things from sources. Need to build a newer python though,
as python >= 3.7 is required in order to build OVS.
Building python 3.12 because it's the same as in other tests. We could
also find and choose the latest 3.12.z release automatically, but it's
much less code to just manually stick to the current latest 3.12.11.
There should be no reason to update it frequently.
Ilya Maximets [Thu, 9 Oct 2025 09:21:25 +0000 (11:21 +0200)]
test-psample: Fix missing field initializer warnings on older GCC.
GCC 4.8 complains for some reason:
tests/test-psample.c: In function 'run':
tests/test-psample.c:217:12:
error: missing initializer for field 'packet' of 'struct sample'
[-Werror=missing-field-initializers]
struct sample sample = {};
^
While it is a little strange to complain, the {} initializer is a C++
thing and a GNU extension, so we should not be using it. It's only
available in C23 standard.
Also, the initialization is not even necessary here, all the fields
will be initialized later with sample_clear(), as long as it covers
all the fields (rate was missing).
Ilya Maximets [Thu, 9 Oct 2025 09:21:24 +0000 (11:21 +0200)]
test-netlink-policy: Fix missing field initializer warnings on older GCC.
For some reason GCC 4.8 doesn't like this style of initialization,
complaining that nla_type is missing in the initializer:
tests/test-netlink-policy.c: In function 'test_nl_policy_parse_ll_addr':
tests/test-netlink-policy.c:60:9:
error: missing initializer for field 'nla_type' of 'struct nlattr'
[-Werror=missing-field-initializers]
.nlattr.nla_type = TEST_POLICY_ATTR,
^
Let's use a normal designated initializer for the nested structure
to avoid this issue.
Fixes: 2f2ae5b6bdef ("tests: Fix endianness in netlink policy test fixtures.") Acked-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Thu, 9 Oct 2025 09:21:22 +0000 (11:21 +0200)]
netlink: Fix build on kernels < 4.1 due to missing RTA_VIA.
Current code assumes that if the rtnetlink.h exists, then it must
have definitions for RTA_VIA and 'struct rtvia'. This is causing
build failures on older systems:
lib/netlink.c: In function 'min_attr_len':
lib/netlink.c:832:38:
error: invalid application of 'sizeof' to incomplete type 'struct rtvia'
case NL_A_RTA_VIA: return sizeof(struct rtvia) + sizeof(struct in_addr);
^
The code structure overall is not great, we should not include Linux
specific headers in the common netlink.c. The only reason for the
inclusion though is the structure size. We can just put a number
instead, as it is already done for all other types. It should never
change, as long as kernel uAPI is stable.
Structure and the RTA_VIA should be defined in the Linux-specific
route-table.c, in case rtnetlink.h doesn't have a definition for them.
Ilya Maximets [Thu, 9 Oct 2025 09:21:21 +0000 (11:21 +0200)]
m4: Don't add -mno-avx512f if compiler doesn't support it.
When building using an older compiler that doesn't support avx512,
we are adding '-mno-avx512f' into the command line, which results
with a build failure:
gcc: error: unrecognized command line option '-mno-avx512f'
This is a case, for example, while trying to build OVS with GCC 4.8.
Fix that by avoiding binutils check when compiler doesn't understand
-mavx512f.
Later in the call chain there is also an explicit check for -mavx512f
support, but it will just use the cached result, so it's not a problem.
Ales Musil [Tue, 7 Oct 2025 06:27:28 +0000 (08:27 +0200)]
ovsdb-idl: Add a way to assert IDL txn as read only.
Add a way to assert IDL txn as read only, that is useful in cases when
we want to ensure that the program doesn't write anything into the
txn. It is done via assert because this is considered a bug which
should be fixed in the IDL caller.
Acked-by: Dumitru Ceara <dceara@redhat.com> Signed-off-by: Ales Musil <amusil@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Mike Pattrick [Tue, 7 Oct 2025 15:16:53 +0000 (11:16 -0400)]
offload-dpdk: Don't use 24bit value as 32bit.
Currently while handling the 24 bit VNI field in VXLAN code,
netdev-offload-dpdk will actually read and write 32 bits. The byte that
is overwritten is reserved and supposed to be set to zero anyways, so
this is mostly harmless.
However, Openscanhub correctly identified this as a buffer overrun. Now
use ovs_16aligned_be32 to access the field as a 32bit integer.
Reported-at: https://issues.redhat.com/browse/FDP-1122 Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Ilya Maximets [Tue, 7 Oct 2025 20:04:36 +0000 (22:04 +0200)]
treewide: Fix unused result warnings in NDEBUG builds.
With NDEBUG, all assertions are just ignoring the result of the
condition and this is causing build warnings when the condition
is a result of a function marked with OVS_WARN_UNUSED_RESULT:
lib/dp-packet.c: In function 'dp_packet_ol_send_prepare':
lib/dp-packet.c:580:28:
error: ignoring return value of 'dp_packet_l4_proto_sctp' declared
with attribute 'warn_unused_result' [-Werror=unused-result]
580 | ovs_assert(dp_packet_l4_proto_sctp(p));
./include/openvswitch/util.h:58:40:
note: in definition of macro 'ovs_assert'
58 | #define ovs_assert(CONDITION) ((void) (CONDITION))
| ^~~~~~~~~
lib/dp-packet.c:621:24:
error: ignoring return value of 'dp_packet_inner_l4_proto_sctp'
declared with attribute 'warn_unused_result' [-Werror=unused-result]
621 | ovs_assert(dp_packet_inner_l4_proto_sctp(p));
./include/openvswitch/util.h:58:40:
note: in definition of macro 'ovs_assert'
58 | #define ovs_assert(CONDITION) ((void) (CONDITION))
| ^~~~~~~~~
lib/dp-packet.c:634:20:
error: ignoring return value of 'dp_packet_l4_proto_udp' declared
with attribute 'warn_unused_result' [-Werror=unused-result]
634 | ovs_assert(dp_packet_l4_proto_udp(p));
./include/openvswitch/util.h:58:40:
note: in definition of macro 'ovs_assert'
58 | #define ovs_assert(CONDITION) ((void) (CONDITION))
| ^~~~~~~~~
cc1: all warnings being treated as errors
Let's export the ignore() function as ovs_ignore() and use it for
the ovs_assert in case of NDEBUG build. This will silence the
warnings.
Most files already import "util.h", so there is no need to add
the public "openvswitch/util.h" as well.
Ilya Maximets [Tue, 7 Oct 2025 14:43:16 +0000 (16:43 +0200)]
ofproto-dpif.at: Wait for logs in select group tests.
These tests are executing group updates and then check the logs, but
it is possible that logs are not yet flushed to the disk. That is
causing sporadic failures like this one:
+awk: stddev.awk:10: fatal: division by zero attempted
This happens when we run the awk script to check the bucket load
before the logs actually appear in the file.
Fix that by always waiting for the logs.
FreeBSD also has a long standing issue with async IO where some logs
can be re-ordered and jammed together. This mostly happens in the
'equal weights' test due to high amount of logging. Changing the
pattern from 'ofproto_dpif|DBG|.*Bucket' to a shorter '|DBG|.*Bucket'
should solve most of the issues of this kind in this particular test.
Ivan Burnin [Thu, 18 Sep 2025 09:06:27 +0000 (17:06 +0800)]
checkpatch: Set explicit encoding in do_authors_exist.
The UTF-8 encoding system is used by default on most operating systems, but
not all of them. This can cause the ’24:checkpatch - AUTHORS.rst existence’
test to fail with a UnicodeDecodeError. All Python versions < 3.15 are
affected, resolved in PEP 686 (https://peps.python.org/pep-0686/)
Fixes: a6ccd111552d ("checkpatch: Add new check-authors-file option to checkpatch.py.") Acked-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Ivan Burnin <iburnin@k2.cloud> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
tests: Revalidate on flow change in tunnel-push-pop.
The `tunnel_push_pop - action` test case may fail due to a race
condition between the revalidator modifying flows and dummy `receive`
action. Other test cases are also potentially affected.
To fix the race for this test case and other test cases in the same
file, this patch will call revalidator/wait every time a flow is
modified before sending a packet through a bridge using `receive`.
tests: Ignore no tunnel port error on vswitch stop.
The test failure occurs quite consistently (~50% of the runs) when using
a musl static build. It always happens if I add `sync` or `sleep` before
vswitchd shutdown. The warning message can be seen in the
retained vswitchd log (-d) even when the test case passes.
I believe a race condition between vswitchd flushing the log to disc and
the test code reading from the log file explains why the test case
doesn't fail consistently.
tests: Don't fail when test path contains 'p2' string.
The `grep` pattern was so broad that it allowed to match any string that
contained `p2`. This is problematic because the path to service log and
sockets is part of log message. It made it so that the test case failed
whenever executed from a directory that has `p2` in any of its absolute
path chain.
tests: Gracefully handle EADDRINUSE string from musl.
Different libc implementations may have different string representations
for the same errno. Service shutdown routines verify that no
*unexpected* warnings are seen in service logs. This makes the test case
fail because a slightly different warning is observed.
This patch makes the test case gracefully handle the musl libc
representation.
When nsec == 0, ofp_print_duration will not add a fractional part. This
may happen either when you are very lucky, or when you use a libc
without nanosecond precision.
appveyor: More robust OpenSSL installation wrt path changes.
AppVeyor had an issue where C:\OpenSSL-Win64 folder didn't exist for
some reason and the installation of OpenSSL was placing it into
C:\Program Files\OpenSSL-Win64 instead as well. After that we couldn't
find the libraries and the build failed:
https://help.appveyor.com/discussions/problems/38517-cannot-find-path-copenssl-win64-because-it-does-not-exist
It's better if we check the paths and install into specific location to
be more resilient to this kind of environment issues, i.e. make less
assumptions.
While at it, fixing the slash type for the path. Remove-Item somehow
accepts the "wrong" one, but a backslash is more native in paths on
Windows. We use forward slash while in msys2 shell, but should not use
it in PowerShell environment.
Kevin Traynor [Thu, 11 Sep 2025 13:33:13 +0000 (14:33 +0100)]
dpdk: Convert dpdk-lcore-mask to DPDK lcore args.
OVS currently uses other_config:dpdk-lcore-mask <coremask> directly
in DPDK rte_eal_init() with '-c <coremask>' argument.
'-c' argument is now deprecated from DPDK and will be removed in
DPDK 25.11, so OVS will no longer be able to use the '-c <coremask>'
argument.
Convert dpdk-lcore-mask core mask to a core list that can be used with
'--lcores' and add some tests.
The core list is validated to prevent invalid cores being passed to
DPDK rte_eal_init().
Using the '--lcores' argument also adds compatibility for using a core
in the core mask that is greater than the max lcore, similar to commit fe53b478f86e ("dpdk: Fix main lcore on systems with many cores.")
Signed-off-by: Kevin Traynor <ktraynor@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Reviewed-by: David Marchand <david.marchand@redhat.com>
David Marchand [Wed, 10 Sep 2025 14:46:36 +0000 (16:46 +0200)]
odp-execute: Fix packet length check for TSO packets.
TSO packets were incorrectly treated as too big by the check_pkt_len
action with the userspace datapath.
Adjust the check by looking at the requested segment size.
David Marchand [Wed, 10 Sep 2025 15:25:34 +0000 (17:25 +0200)]
netdev-dpdk: Fix IP checksum with net/virtio.
OVS checks a netdev checksum offload capabilities before calling its
transmit helper.
Yet the OVS transmit preparation helper for DPDK drivers requests IP
checksum (setting RTE_MBUF_F_TX_IP_CSUM) regardless of the DPDK driver
support if some L4 checksum is requested.
This is problematic since net/virtio does support L4 checksum offload,
but does not support nor announce IP checksum offload and this DPDK
driver tx prepare helper (calling helper shared with other DPDK drivers)
clears the ip checksum in packet data if RTE_MBUF_F_TX_IP_CSUM is set.
The DPDK mbuf API mandates IP csum when TCP segmentation is requested.
However, IP csum is not required when TCP or other L4 checksum is
requested.
While most DPDK drivers implement both IP and L4 checksums (and this is
likely why the bug was never caught), let's avoid requesting an
unsupported offload.
netdev-native-tnl: Fix DF bit not being extracted into tunnel flags.
OVS adds matches on the DF/CSUM/etc bits of the tunnel info flags, but
the DF bit is never actually extracted from the outer IP header during
the tunnel decapsulation. This is not a huge problem, as we'll only
match on what was parsed out, since matching on DF flag is not exposed
through OpenFlow to the users. And since the bit is never extracted,
we'll just always match on -df, which sort of "works", because the bit
is never extracted and so it is never set. However, this causes
misleading -df matches in the datapath flow dumps even if the packets
actually have the DF bit set, which it is by default.
Fix that by actually extracting the bit from the outer header while
decapsulating tunneled traffic.
Aaron Conole [Wed, 13 Aug 2025 18:05:56 +0000 (14:05 -0400)]
bond: Do not flag a revalidation when adjusting time.
When adjusting bond parameters, any adjustment is considered sufficient
for triggering a rebalance. This is a very simplistic config update
scheme that triggers a complete rebalance even if the time adjustment
would move the next expiration out beyond the last calculated expiration.
For the interval parameter only, we can simply recalculate the expiry
deadline and let the next bond_run() event do the rebalance if needed.
Even if the recalculation would cause the deadline to have occurred in
the past, it should execute on the next bond_run() anyway. This is
still okay, as the rebalance interval timeout may not result in a
full rebalance anyway.
We're not installing anything from FreeBSD-kmods or any other non-main
repository, but it seems to have mismatching packages with the latest
Google Cloud image that appears to not be updated yet:
# pkg update -f
Updating FreeBSD repository catalogue...
Fetching meta.conf: . done
Fetching data.pkg: .......... done
Processing entries: .......... done
FreeBSD repository update completed. 36502 packages processed.
Updating FreeBSD-kmods repository catalogue...
Fetching meta.conf: . done
Fetching data.pkg: .. done
Processing entries:
Newer FreeBSD version for package parallels-tools:
To ignore this error set IGNORE_OSVERSION=yes
- package: 1403506
- running userland: 1403505
Ignore the mismatch and continue? [y/N]:
pkg: repository FreeBSD-kmods contains packages for wrong OS version:
FreeBSD:14:amd64
Processing entries...
Unable to update repository FreeBSD-kmods
Error updating repositories!
Not updating FreeBSD-kmods should avoid this issue. In the worst case
the next 'pkg install' will fail, but it shouldn't really ever happen
as we do not install any kernel modules or packages that depend on
any kernel modules.
ovsdb: transaction: Silence dereference after null check warning.
Coverity reported a possible null dereference of datum->values
when handling value-type weak references in find_and_add_weak_refs().
In practice, datum->values must be non-NULL for value-type weak references.
Adding ovs_assert(datum->values) documents this invariant and ensures that
we catch any unexpected misuse.
No change to runtime behavior in production builds; this only improves
safety and satisfies static analysis.
ovsdb: raft: Silence dereference after null check warning.
Coverity flagged a potential null dereference of txnid
when calling raft_next_entry() in ovsdb_storage_read().
To prevent this, raft_next_entry() now checks whether txnid is non-NULL
before attempting to write to it. This preserves existing behavior
when txnid is valid, while safely handling the case where it is NULL.
No change to normal operation; this only prevents potential crashes
if a NULL txnid is passed.
ovsdb: relay: Silence dereference before null check warning.
Coverity reported a possible null dereference of ctx->db
inside the event processing loop in ovsdb_relay_run().
In practice, ctx->db is guaranteed to be non-NULL for valid relay_ctx
structures. Adding ovs_assert(ctx->db) documents this invariant and
ensures debug builds catch any misuse.
No change in runtime behavior; this only improves code safety and
satisfies static analysis.
ovsdb: trigger: Silence dereference after null check warning.
Coverity reported a possible null dereference of t->reply
when handling errors in the "committing" state for "transact" requests.
In practice, t->reply is always non-NULL at this point, but static analysis
cannot infer this guarantee. Adding ovs_assert(t->reply) documents the
invariant for developers and ensures debug builds will catch any misuse.
No change in runtime behavior; this only improves code safety and
satisfies static analysis.
ovsdb: monitor: Silence dereference after null check warning.
This patch resolves a Coverity-reported issue where a dereference
could occur after a null check in ovsdb_monitor_row_update_type().
An additional condition has been added to ensure a safe early exit
in ovsdb_monitor_compose_row_update().
ofproto: Fix missing lock when reading oftable->vacancy_up/down.
Coverity reported an issue where the oftable->vacancy_up/down
members were read without holding the ofproto_mutex.
This patch ensures the mutex is acquired before accessing these
members and adds OVS_REQUIRES(ofproto_mutex) to the affected
function to enforce proper locking.
Acked-by: Ilya Maximets <i.maximets@ovn.org> Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Fixes: bab86012066c ("Implement Vacancy Events for OFPMP_TABLE_DESC.")
Linda Wang [Fri, 22 Aug 2025 09:46:20 +0000 (17:46 +0800)]
netdev-offload-dpdk: Fix vport hw-offload stat.
For vport devices hardware offload counters were being incremented
on the physical device but decremented on the vport, causing
incorrect statistics that didn't return to zero when flows are
removed.
Fix by using physdev instead of netdev for counter decrements to match
the increment behavior.
Fixes: 0e6366c2399d ("netdev-offload-dpdk: Implement hw-offload statistics read.") Signed-off-by: Linda Wang <linda.wang@jaguarmicro.com> Signed-off-by: Kevin Traynor <ktraynor@redhat.com>
Ilya Maximets [Tue, 26 Aug 2025 12:44:02 +0000 (14:44 +0200)]
tests: system-traffic: Fix flaky floating IP test.
The last ACK from the 3-WAY handshake in this test goes to userspace
(miss upcall) after going through conntrack because +est traffic is
handled differently from +new in this set of OpenFlow rules.
The sender though proceeds with sending the data that may end up
re-ordered with the aforementioned ACK. Since connection is very
short, it is possible that this one ACK will be delivered even after
the connection is already closed, i.e., after all the data is sent
and the connection termination sequence (FIN-ACK-FIN-ACK) is done.
Delivery in this case triggers an RST reply. And RST transitions
TIME_WAIT into CLOSE (CLOSING in OVS terms), causing the test failure.
It's not really possible to fully avoid the packet re-ordering as it
is part of the upcall mechanism. But there is a couple ways to avoid
it in this particular case, e.g., if we predict how the +est packet
will look like and install the datapath flow for +est while processing
the original +new, or by storing the upcall PID in the kernel for the
whole time of action execution and not only for one packet we're
executing actions for (TCP handshake replies are happening in the
context of the initiator, which is OVS handler thread in our case).
But these are large changes that will not help with test failures
on older branches / older kernels. For now, fixing the test instead.
The intention in the test was to check that the state is terminal at
the end of the connection, i.e., that our actions do not leave
established or incomplete conntrack entries. So it should be fine to
check for both TIME_WAIT and CLOSING, as both of these are terminal
states.
Mike Pattrick [Mon, 18 Aug 2025 16:58:31 +0000 (12:58 -0400)]
conntrack: Add support for IPv4 EPRT and EPSV.
Recent versions of Curl combine the EPSV and EPRT commands with IPv4
connections instead of the PASV and PORT commands. EPSV and EPRT were
added to FTP for IPv6 support but these commands also support IPv4. Most
software still uses the old PASV and PORT commands in IPv4 connections
but recent versions of curl will default to EPSV and EPRT for all
connections. This patch adds support for these extended commands, and
added tests for both with and without them.
Checkpatch is used to do spell checking for C files, currently. However,
python code is also a routine part of the Open vSwitch suite, and it should
also receive spell checking where possible.
This patch adds an initial implementation of spell checking to checkpatch
that will catch all of the comments starting with '#', and some of the
doc-string lines. Future work would implement a more robust python
parser that will handle doc-string entries.
Acked-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Eli Oliver <eoliver@redhat.com> Co-Authored-by: Aaron Conole <aconole@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com>
Mike Pattrick [Wed, 13 Aug 2025 19:37:11 +0000 (15:37 -0400)]
ipf: Complete l4 checksum before frag out.
Previously during NAT actions we could calculate a checksum on a packet,
if this checksum was valid we would mark it as such for offloading. Then
when modifying the packet we would check if we set the checksum as
partial. This worked fine for most packets, but fragmented packets can
not complete their checksum because we process these packets one at a
time instead of all at once. This breaks NAT of fragmented packets in
userspace conntrack.
This patch resolves any outstanding checksums on the reassembled
fragment before adding the fragments to a batch.
Reported-at: https://issues.redhat.com/browse/FDP-1571 Reported-by: Hekai Wang <hewang@redhat.com> Fixes: e36793e11fe8 ("dp-packet: Resolve unknown checksums.") Acked-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Eli Britstein [Mon, 14 Jul 2025 11:29:06 +0000 (14:29 +0300)]
ofproto: Move group-modify to mod_start instead of mod_finish.
Upon modifying a group, the following steps occur:
1. ofproto_group_mod_start()->modify_group_start():
Find an old group object, create a new one.
2. ofproto_bump_tables_version()
3. ofproto_group_mod_finish():
Modify the new group object with buckets etc.
At step #3, the new group object is already in use by revalidators,
that may read incorrect data while being modified.
Instead, move the group modification of the new object to step #1.
Fixes: 0a8f6beb54ab ("ofproto-dpif: Fix dp_hash mapping after select group modification.") Acked-by: Gaetan Rivet <gaetanr@nvidia.com> Acked-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Eli Britstein <elibr@nvidia.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
checkpatch: Separate out the built-in spelling words.
Checkpatch contains a large list of spelling words, currently 326. The
in-built list is quite cumbersome, and includes duplicate entries
('netdev' and 'revalidation'). Managing this list will continue to grow
difficult over time, and really shouldn't be so difficult. To make the
maintenance of OVS specific spelling words easier, create a new file
that will contain the spelling words and move the list into that file
(sorted). This patch adds 'FreeBSD' and 'gcloud' to the list as it
was tested against 80d723736b64 ("cirrus: Update to FreeBSD 14.3 and 13.5."). It also
adds "apis", "enablement", "ifdef", and "tnl" after the patch proposed
by Eelco at the referenced link.
ovs-vsctl: Exit with error if postdb checks report errors.
Today the exit code refers to the execution of the change
in the database. However, when not using parameter --no-wait
(default), the ovs-vsctl also checks if OVSDB transactions
are successfully recorded and reload by ovs-vswitchd. In this
case, an error message is printed if there is a problem during
the reload, like for example the one below:
# ovs-vsctl add-port br0 gre0 -- \
set interface gre0 type=ip6gre options:key=100 \
options:remote_ip=2001::2
ovs-vsctl: Error detected while setting up 'gre0': could not \
add network device gre0 to ofproto (Address family not supported\
by protocol). See ovs-vswitchd log for details.
ovs-vsctl: The default log directory is "/var/log/openvswitch".
# echo $?
0
This patch changes to exit with specific error code 160
(ERROR_BAD_ARGUMENTS) on Windows and 65 (EX_DATAERR) on
Linux or BSD if errors were reported during the reload.
This change may break existing scripts because ovs-vsctl will
start to fail when before it was succeeding. However, if an
error is printed, then it is likely that the change was not
functional anyway.
Mike Pattrick [Wed, 19 Jun 2024 12:54:53 +0000 (08:54 -0400)]
ovsdb: Use table indexes if available for ovsdb_query().
Currently all OVSDB database queries except for UUID lookups all result
in linear lookups over the entire table, even if an index is present.
This patch modifies ovsdb_query() to attempt an index lookup first, if
possible. If no matching indexes are present then a linear index is
still conducted.
To test this, I set up an ovsdb database with a variable number of rows
and timed the average of how long ovsdb-client took to query a single
row. The first two tests involved a linear scan that didn't match any
rows, so there was no overhead associated with sending or encoding
output. The post-patch linear scan was a worst case scenario where the
table did have an appropriate index but the conditions made its usage
impossible. The indexed lookup test was for a matching row, which did
also include overhead associated with a match. The results are included
in the table below.
I also tested the performance of ovsdb_query() by wrapping it in a loop
and measuring the time it took to perform 1000 linear scans on 1, 10,
100k, and 200k rows. This test showed that the new index checking code
did not slow down worst case lookups to a statistically detectable
degree.
Reported-at: https://issues.redhat.com/browse/FDP-590 Signed-off-by: Mike Pattrick <mkp@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
dpif: Fix infinite netlink loop in dpif_execute_helper_cb.
When a meter action is encountered and stored in the auxiliary
structure, and subsequently, a non-meter action is processed
within a nested list during callback execution, an infinite
loop is triggered.
This patch maintains the current behavior but stores all
required meter actions in an ofpbuf for deferred execution.
Adrian Moreno [Thu, 10 Jul 2025 10:13:03 +0000 (12:13 +0200)]
docs: Specify retis dependency on USDT probes.
Retis' "--ovs-track" option relies on USDT probes being available.
Fixes: 22732c0e6770 ("tests: Add support for running system tests under retis.") Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
dpif-netlink: Provide original upcall pid in 'execute' commands.
When a packet enters kernel datapath and there is no flow to handle it,
packet goes to userspace through a MISS upcall. With per-CPU upcall
dispatch mechanism, we're using the current CPU id to select the
Netlink PID on which to send this packet. This allows us to send
packets from the same traffic flow through the same handler.
The handler will process the packet, install required flow into the
kernel and re-inject the original packet via OVS_PACKET_CMD_EXECUTE.
While handling OVS_PACKET_CMD_EXECUTE, however, we may hit a
recirculation action that will pass the (likely modified) packet
through the flow lookup again. And if the flow is not found, the
packet will be sent to userspace again through another MISS upcall.
However, the handler thread in userspace is likely running on a
different CPU core, and the OVS_PACKET_CMD_EXECUTE request is handled
in the syscall context of that thread. So, when the time comes to
send the packet through another upcall, the per-CPU dispatch will
choose a different Netlink PID, and this packet will end up processed
by a different handler thread on a different CPU.
The process continues as long as there are new recirculations, each
time the packet goes to a different handler thread before it is sent
out of the OVS datapath to the destination port. In real setups the
number of recirculations can go up to 4 or 5, sometimes more.
There is always a chance to re-order packets while processing upcalls,
because userspace will first install the flow and then re-inject the
original packet. So, there is a race window when the flow is already
installed and the second packet can match it inside the kernel and be
forwarded to the destination before the first packet is re-injected.
But the fact that packets are going through multiple upcalls handled
by different userspace threads makes the reordering noticeably more
likely, because we not only have a race between the kernel and a
userspace handler (which is hard to avoid), but also between multiple
userspace handlers.
For example, let's assume that 10 packets got enqueued through a MISS
upcall for handler-1, it will start processing them, will install the
flow into the kernel and start re-injecting packets back, from where
they will go through another MISS to handler-2. Handler-2 will install
the flow into the kernel and start re-injecting the packets, while
handler-1 continues to re-inject the last of the 10 packets, they will
hit the flow installed by handler-2 and be forwarded without going to
the handler-2, while handler-2 still re-injects the first of these 10
packets. Given multiple recirculations and misses, these 10 packets
may end up completely mixed up on the output from the datapath.
Let's provide the original upcall PID via the new netlink attribute
OVS_PACKET_ATTR_UPCALL_PID. This way the upcall triggered during the
execution will go to the same handler. Packets will be enqueued to
the same socket and re-injected in the same order. This doesn't
eliminate re-ordering as stated above, since we still have a race
between the kernel and the handler thread, but it allows to eliminate
races between multiple handlers.
The openvswitch kernel module ignores unknown attributes for the
OVS_PACKET_CMD_EXECUTE, so it's safe to provide it even on older
kernels.
If an MLD packet is not large enough to contain the
message-specific data, it may lead to a NULL pointer access.
This patch fixes the issue by adding appropriate length checks.
Ilya Maximets [Mon, 30 Jun 2025 17:09:57 +0000 (19:09 +0200)]
tests: Add support for running system tests under retis.
Retis is very useful for debugging our system tests or debugging
kernel issues through our system tests. This change adds a convenient
way to run any kernel system test with the retis capture on the
background. E.g.:
make check-kernel OVS_TEST_WITH_RETIS=yes TESTSUITEFLAGS='167 -d'
Retis 1.5 is required, since we're using ifdump profile, and it also
will mount debugfs for us in case of running in a different namespace.
It should be available in $PATH.
In addition to just capturing the retis.data, we're also running the
capture with --print to print all the events as they appear, and
producing the sorted output in the end. This makes it easier to work
across systems with different versions of retis and saves time for
running the sort manually. The raw data is still available for
advanced processing, if needed.
Not specifying any particular collector, capturing everything that's
enabled by default. OVS tracking is turned on by default.
Since OVS tracking is used, it's required to start retis after the
kernel datapath is created, otherwise it will fail to obtain the map
of upcall PIDs. That's why we need to start it after the bridge is
created.
Only adding support for kernel-related test suites for now. For
userspace test suites it may also be useful at some point, but
currently that requires running without --ovs-track and isn't too
important.
Startup of the retis capture adds significant amount of time to each
test, so not running it by default.
Ilya Maximets [Mon, 30 Jun 2025 11:21:49 +0000 (13:21 +0200)]
seq: Fix deadlock with the time_init.
There is an ABBA deadlock between time_init() and seq_wait():
Thread 1:
poll_block()
time_poll()
time_init()
pthread_once() <-- lock A
do_time_init()
seq_create()
pthread_mutex_lock(seq_mutex) <-- lock B
Thread 2:
seq_wait(different seqno)
pthread_mutex_lock(seq_mutex) <-- lock B
poll_immediate_wake()
poll_timer_wait()
time_msec()
time_init()
pthread_once() <-- lock A
This is likely the same deadlock Intel CI saw last year before the lab
was shut down.
The issue should not happen with normal applications as those would
normally have the time module initialized early in the process before
waiting on any sequence numbers, but it happens in the test-barrier
application from time to time causing the test suite to hang.
Fix that by making sure we're not calling poll_immediate_wake() under
the seq_mutex. The time and seq modules are independent and it's hard
to ensure the dependency without exporting some of their internals.
Instead re-defining the prototype of the poll_immediate_wake_at(),
adding the thread safety annotation, so we have some basic protection
from this deadlock if the code ever changes. Compiler will warn on
the prototype mismatch as well if it ever happens, so it's not a big
problem. Having this prototype also gives us a spot in the code where
we can place a comment explaining the locking order.