Tanvi Jagtap [Wed, 24 Dec 2025 16:50:56 +0000 (08:50 -0800)]
[PH2][Trivial] Tidy up client code - Part 3
Moving code from header to cc file.
Trying to manage the class Http2ClientTransport which has become 500+ lines and hard to work with. Need to reorder as well (in next PR)
Sergii Tkachenko [Wed, 24 Dec 2025 07:46:56 +0000 (23:46 -0800)]
[Fix][Build] Move xds-protos templates to the new path (#41297)
In #41261, templates weren't moved to the new path `templates/py_xds_protos`, so they didn't render the updated version in `py_xds_protos/grpc_version.py` correctly.
Tanvi Jagtap [Wed, 24 Dec 2025 06:36:51 +0000 (22:36 -0800)]
[PH2][Trivial] Tidy up client code - Part 2
Moving code from header to cc file.
Trying to manage the class Http2ClientTransport which has become 500+ lines and hard to work with.
Notably, this fixes [rules_python repl](https://rules-python.readthedocs.io/en/latest/repl.html):
```
$ bazel --quiet run @rules_python//python/bin:repl
INFO: Running bazel wrapper (see //tools/bazel for details), bazel version 8.0.1 will be used instead of system-wide bazel installation.
Target @@rules_python//python/bin:repl up-to-date:
bazel-bin/external/rules_python/python/bin/repl
bazel-bin/external/rules_python/python/bin/repl_py.py
Python 3.13.11 (main, Dec 6 2025, 02:15:39) [Clang 17.0.0 (clang-1700.3.19.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
```
yuanweiz [Tue, 23 Dec 2025 20:10:06 +0000 (12:10 -0800)]
[bzlmod] Add check to ensure requested versions match selected versions. (#41244)
Bzlmod uses Minimal Version Selection algorithm for building dependency graph (see https://bazel.build/external/module#version-selection) which can cause resolved version number to be higher than requested versions. This may lead to nuanced bugs and hide behavioral differences between WORKSPACE and MODULE.bazel settings.
This PR does a few things:
* explicitly turn on --check_direct_dependencies=error for bzlmod tests, so version mismatch will now be an error
* Bump versions in MODULE.bazel to fix tests in `tools/bazelify_tests/test/bazel_build_with_bzlmod_linux.sh`.
* update bzl extensions accordingly to minimize the difference between workspace and bzlmod settings.
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
yuanweiz [Tue, 23 Dec 2025 19:23:41 +0000 (11:23 -0800)]
Mark BCR Release PRs as ready by default. (#41282)
In our current settings, a draft pull request is created with author set
to `grpc-bot`, making it impossible to auto-merge. An example here
https://github.com/bazelbuild/bazel-central-registry/pull/6872.
Tanvi Jagtap [Mon, 22 Dec 2025 18:02:04 +0000 (10:02 -0800)]
[PH2][Trivial] Tidy up client code - Part 1
Moving code from header to cc file.
Trying to manage the class Http2ClientTransport which has become 500+ lines and hard to work with.
Kai-Hsun Chen [Thu, 18 Dec 2025 22:54:59 +0000 (14:54 -0800)]
[python] aio: fix race condition causing `asyncio.run()` to hang forever during the shutdown process (#40989)
# Root cause
* gRPC AIO creates a Unix domain socket pair, and the current thread passes the read socket to the event loop for reading, while the write socket is passed to a thread for polling events and writing a byte into the socket.
* However, during the shutdown process, the event loop stops reading the read socket without closing it before the polling thread receives the final event to exit the thread.
* The shutdown process will hang if (1) the event loop stops reading the read socket before the polling thread receives the final event to exit the thread, and (2) the polling process stuck at `write` syscall.
* The `write` syscall may get stuck at [sock_alloc_send_pskb](https://elixir.bootlin.com/linux/v5.15/source/net/core/sock.c#L2463) when there is not enough socket buffer space for the write socket. Hence, the polling thread hangs at write and cannot continue to the next iteration to retrieve the final event. As a result, the event loop no longer reads the read socket, so the allocable buffer size for the write socket does not increase any longer. Therefore, the current thread hangs when waiting for the polling thread to `join()`.
* `asyncio` will shutdown the default executor (`ThreadPoolExecutor`) when `asyncio.run(...)` finishes. Hence, it hangs because some threads can't join.
# Reproduction
* Step 0: Reduce the socket buffer size to increase the probability to reproduce the issue.
```sh
sysctl -w net.core.rmem_default=8192
sysctl -w net.core.rmem_default=8192
```
* Step 1: Manually update `unistd.write(fd, b'1', 1)` to `unistd.write(fd, b'1' * 4096, 4096)`. The goal is to make write (4096 bytes per write) faster than read (1 byte per read), thereby filling the write buffer nearly full.
https://github.com/grpc/grpc/blob/8e67cb088d3709ae74c1ff31d1655bea6c2b86c0/src/python/grpcio/grpc/_cython/_cygrpc/aio/completion_queue.pyx.pxi#L31
* Step 2: Create an `aio.insecure_channel` and use it to send 100 requests with at most 10 in-flight requests. After all requests finish, the shutdown process will be triggered, and it's highly likely to hang if you follow Steps 0 and 1 correctly. In my case, my reproduction script reproduces the issue 10 out of 10 times.
* Step 3: If it hangs, check the following information:
* `ss -xpnm state connected | grep $PID` => You will find there are two sockets that belong to the same socket pair, and one has non-zero bytes in the read buffer while the other has non-zero bytes in the write buffer. In addition, write buffer should be close to `net.core.rmem_default`.
* Check the stack of the `_poller_thread` by running `cat /proc/$PID/task/$TID/stack`. The thread is stuck at `sock_alloc_send_pskb` because there is not enough buffer space to finish the `write` syscall.
* Use GDB to find the `_poller_thread` and make sure it's stuck at `write()`, then print its `$rdi` to confirm that the FD is the one with a non-zero write buffer in the socket.
# Test
Follow Steps 0, 1, and 2 in the 'Reproduction' section with this PR. It doesn't hang in 10 out of 10 cases.
<!--
If you know who should review your pull request, please assign it to that
person, otherwise the pull request would get assigned randomly.
If your pull request is for a specific language, please add the appropriate
lang label.
- Regen requirements.bazel.lock with Python 3.9
- bump isort to 6.0.1 (except in pylint, which needs to be updated separately)
- fix python version specifiers for black, isort and pylint, typeguard
- fix default ignore patterns for isort and pylint
- consistent debug info: python version, pip list
- consistent virtualenv naming: `.venv-ci-*`
- bazel: bump typeguard to 4.4.2
- bazel: bumped gevent to `25.9.1`, greenlet to `3.2.4` to support Python 3.13, closes #40685
- bazel: bump pyyaml for python 3.14 support
- bazel: take care of temporary pins to support 3.8-based CIs
Bazel RBE CIs upgraded in the following changelists, and currently run Python 3.10:
- cl/845778848
- cl/845816768
Sergii Tkachenko [Thu, 18 Dec 2025 02:16:21 +0000 (18:16 -0800)]
[Fix][CI] Fix master Bazel RBE jobs running on vanilla Ubuntu 22 (#41251)
Fixes
```pytb
+ python3 ./tools/run_tests/python_utils/upload_rbe_results.py --invocation_id=c9453d05-8c0a-43bc-abb8-1b5d34a163b8
Traceback (most recent call last):
File "/tmpfs/altsrc/github/grpc/./tools/run_tests/python_utils/upload_rbe_results.py", line 31, in <module>
import big_query_utils
File "/tmpfs/altsrc/github/grpc/tools/gcp/utils/big_query_utils.py", line 21, in <module>
from apiclient import discovery
ModuleNotFoundError: No module named 'apiclient'
```
Tanvi Jagtap [Mon, 15 Dec 2025 06:09:41 +0000 (22:09 -0800)]
[PH2][FlowControl][Bug] Adding missing flow control plumbing
This is a hack. The actual fix needs some work for Call V3 stack which is scheduled for later.
Sergii Tkachenko [Sat, 13 Dec 2025 08:50:58 +0000 (00:50 -0800)]
[Fix][CI] grpc_bazel_rbe_nonbazel job: align kokoro and bazel timeouts (#41231)
Target `//tools/bazelify_tests/test:cpp_distribtest_cmake_aarch64_cross_linux` seems to go over Bazel's `--test_timeout` limit from time to time.
Bazel `--test_timeout` flag was initially introduced in #38123 and set to 30 minutes below Kokoro's job `timeout_mins`. Since then, we've increased Kokoro's timeout several times without making corresponding changes to bazel's `test_timeout`.
This PR updates Bazel's test timeout to aligned with Kokoro's job timeout, and adds a reminder to keep those in sync. In addition, I've broken `BAZEL_FLAGS` in multiple lines for better readability, proto text format spec [allows it](https://protobuf.dev/reference/protobuf/textformat-spec/#string).
Gregory Cooke [Thu, 11 Dec 2025 18:25:03 +0000 (10:25 -0800)]
[Test] Don't need any versioning block for SPIFFE tests (#41218)
This should further fix the flakiness in the portability tests.
Specifically, we shouldn't have left this compiler directive in the code in the previous PR https://github.com/grpc/grpc/pull/41205
Tanvi Jagtap [Thu, 11 Dec 2025 06:46:36 +0000 (22:46 -0800)]
[PH2][Flake] Fix settings flake, add LOGs
1. Fix settings flake by allowing some more buffer for promise scheduling delays.
2. Increasing timeouts to some reasonable numbers. If I keep time buffer as 0.5 , it fails once in 20000 times. Reducing it to 0.4 makes it pass all of 100000 times. Which is good enough.
3. Adding LOGs to help to debug another flow control related bug
These were missed with the initial implementation of call tracing in channelz, but luckily our fuzzers found them. Add the calls, and a regression test.
Gregory Cooke [Wed, 10 Dec 2025 05:23:02 +0000 (21:23 -0800)]
[Testing] Fix spiffe portability (#41205)
Fix a few issues when build with OpenSSL versions
OpenSSL1.0.2 - copied some CRL related test code that was not valid assumptions for these tests.
OpenSSL1.1.1 - The regex is too sensitive, only do the regex check for BoringSSL
OpenSSL3 - We though the Invalid UTF8-SAN behavior should cause handshake failures for OpenSSL3 here and included different behavior, but that is still what is breaking. Let's revert that change.
The issue is with the
`//tools/bazelify_tests/test:runtests_cpp_linux_dbg_gcc_8_build_only`
target, which is a part of the portability suite
(`//tools/bazelify_tests/test:portability_tests_linux`). With gcc-8,
building `buildtests_cxx` make target either times out, or fails with
`collect2: fatal error: ld terminated with signal 9`.
I've investigated this as an OOM issue (a common cause of `collect2:
fatal error: ld terminated`), but increasing memory limits does not
help. I've updated RBE stack from `n1-standard-16` (60 GB RAM) to
`e2-standard-32` (128 GB RAM) with no effect. Increasing various job
timeouts (kokoro, bazel, target, etc) didn't help either. See PR #41028
for more details and other attempts at root-causing.
The most important part of portability tests is to verify that gRPC can
be built with all supported compilers. Since we are having a problem
with building the tests with gcc-8, we've decided to stop covering the
tests for that compiler..
Specifically, this PR changes `runtests_c*_linux_dbg_gcc_8_build_only`
bazel target to skip building test make targets (via
`--cmake_configure_extra_args=-DgRPC_BUILD_TESTS=OFF`), and only build
`grpc++` make target. See `build_cxx.sh`:
https://github.com/grpc/grpc/blob/cb2db8fc21b31ac322d463dff5b7eff9fbbab97d/tools/run_tests/helper_scripts/build_cxx.sh#L49-L55
Notes and observations:
- Only gcc-8 and only cpp version is affected:
- Portability tests for other gcc versions have no problems building
`buildtests_cxx` of their corresponding
`runtests_c*_linux_dbg_gcc_*_build_only`.
- The C version of gcc-8 portability test
(`runtests_c_linux_dbg_gcc_8_build_only`) has not issues building tests
([sample run with full target
log](https://btx.cloud.google.com/invocations/0b3d41e7-3cf2-4ff8-b6d5-2bc0d52179cd/targets/%2F%2Ftools%2Fbazelify_tests%2Ftest:runtests_c_linux_dbg_gcc_8_build_only;config=815e4ca9071c7e1d8ca72b9c87c1347399a51eb1246eb9c49dd54d9a24ef5cba/tests)).
- However, unfortunately, this change skips the test targets for
`runtests_c_linux_dbg_gcc_8_build_only` too.
- We already had the logic to skip tests for gcc-7, but for a different
reason: #37257
This is needed for gRFC A105 (https://github.com/grpc/proposal/pull/516). Specifically, see the "Interaction with xDS Circuit Breaking" section.
It's possible for an LB pick to be happening at the same time as the subchannel sees its underlying connection fail. In this case, the picker can return a subchannel, but when the channel tries to start a call on the subchannel, the call creation fails, because there is no underlying connection. In that case, the channel will queue the pick, on the assumption that the LB policy will soon notice that the subchannel has been disconnected and return a new picker, at which point the queued pick will be re-attempted with that new picker.
When the picker returns a complete pick, it can optionally return a `SubchannelCallTracker` object that allows it to see when the subchannel call starts and ends. In the current API, when the channel successfully creates a call on the subchannel, it will immediately call `Start()`, and then when the subchannel call later ends, it will call `Finish()`. However, when the race condition described above occurs, the `SubchannelCallTracker` object will be destroyed without `Start()` or `Finish()` ever having been called. This API allows us to handle call counter incrementing and decrementing for things like xDS circuit breaking: we check the counter in the picker to see that it's currently below the limit, we increment the counter in `Start()`, and decrement it in `Finish()`. If the subchannel call never starts, then the counter never gets incremented.
With the introduction of connection scaling functionality in the subchannel, this approach will no longer work, because the call may be queued inside of the subchannel rather than being immediately started on a connection, and the channel can't tell if that is going to happen. In other words, there's no longer any benefit to the `Start()` method, because it will no longer actually indicate that the call is actually being started on a connection. As a result, I am removing that method from the API.
For xDS circuit breaking in the xds_cluster_impl LB policy, we are now incrementing the call counter in the picker, and the `SubchannelCallTracker` object will decrement it when either `Finish()` is called or when the object is destroyed, whichever comes first.
For grpclb, the `Start()` method was used in an ugly hack to handle ownership of the client stats object between the grpclb policy and the client load reporting filter. The LB policy passes a pointer to this object down to the filter via client initial metadata, which contains a raw pointer and does not hold a ref. To handle ownership, the LB policy returns a `SubchannelCallTracker` that holds a ref to the client stats object, but when `Start()` is called, it releases that ref, on the assumption that the client load reporting filter will subsequently take ownership. I've replaced this with a slightly cleaner approach whereby the call tracker always holds a ref to the client stats object, thus guaranteeing that the client stats object exists when the client load reporting filter sees it, and the client load reporting filter takes its own ref when it runs. (An even cleaner approach would be to instead pass the client stats object to the filter via a call attribute, similar to how we pass the xDS cluster name from the ConfigSelector to the LB policy tree, but it doesn't seem worth putting that much effort into grpclb at this point.)
Craig Tiller [Mon, 8 Dec 2025 17:41:15 +0000 (09:41 -0800)]
[chaotic-good] Deadline fixes (#41190)
* Increase test connection deadline to account for CI slowness
* Add experiment to use handshaker deadline instead of hard coded deadline (since this is likely a bug)
Mark D. Roth [Fri, 5 Dec 2025 20:24:49 +0000 (12:24 -0800)]
[pick_first] go CONNECTING when selected subchannel goes CONNECTING or TF (#41029)
Needed as part of gRFC A105 (https://github.com/grpc/proposal/pull/516).
Currently, when the selected subchannel leaves READY state, the only possible state it can move to is IDLE, and pick_first handles that by itself going IDLE. However, as part of A105, we are going to introduce the possibility of the subchannel going from READY to either CONNECTING or TRANSIENT_FAILURE, and in those two cases we want pick_first to go back into CONNECTING and start a new happy eyeballs pass. This PR introduces an experiment that adds that behavior.
While I was at it, I noticed an existing misfeature. There are two cases where pick_first will go IDLE, which is done by calling [`GoIdle()`](https://github.com/grpc/grpc/blob/24b25a0baa72a658cc37d1db28f77513a9670ea2/src/core/load_balancing/pick_first/pick_first.cc#L610):
1. The case mentioned above, where the selected subchannel goes from READY to IDLE (`GoIdle()` is called from [`SubchannelState::OnConnectivityStateChange()`](https://github.com/grpc/grpc/blob/24b25a0baa72a658cc37d1db28f77513a9670ea2/src/core/load_balancing/pick_first/pick_first.cc#L784)).
2. The case where pick_first already has a selected subchannel and receives a new address list, but none of the subchannels in the new list report READY. In this case, pick_first knows that the currently selected subchannel is for an address that is not present in the new address list, so it unrefs the selected subchannel and goes IDLE (`GoIdle()` is called from [`SubchannelData::OnConnectivityStateChange()`](https://github.com/grpc/grpc/blob/24b25a0baa72a658cc37d1db28f77513a9670ea2/src/core/load_balancing/pick_first/pick_first.cc#L859)).
The code in `GoIdle()` currently requests a re-resolution, which is the right behavior for case 1. However, it doesn't really make sense to do this for case 2, since we have just received a fresh resolver update in that case. Therefore, as part of this experiment, I am moving the code that triggers the re-resolution out of `GoIdle()` and directly into `SubchannelState::OnConnectivityStateChange()`, where it will occur only for case 1.
Tanvi Jagtap [Fri, 5 Dec 2025 18:04:10 +0000 (10:04 -0800)]
[PH2][Refactor]
The Pausing and Restarting of the ReadLoop happens in a separate class.
We could generalize and re-use this mechanism elsewhere, but that is a task for later.
Akshit Patel [Fri, 5 Dec 2025 09:13:23 +0000 (01:13 -0800)]
[PH2][ChannelArg] Adding support for GRPC_ARG_HTTP2_INITIAL_SEQUENCE_NUMBER. This CL also modifies the error message returned when the last stream is closed and the transport cannot create any new streams.
Tanvi Jagtap [Thu, 4 Dec 2025 15:26:54 +0000 (07:26 -0800)]
[PH2][Settings][Refactor]
1. Moved on_receive_settings callback logic into SettingsPromiseManager.
2. Stall reads until the first peer settings are processed.
3. Encapsulated security frame settings logic within SettingsPromiseManager.
Akshit Patel [Thu, 4 Dec 2025 08:40:59 +0000 (00:40 -0800)]
[PH2][Bug] Fix call to `BeginCloseStream` from `HandleError`.
`HandleError` is called from a transport promise when some stream/connection error is encountered. Hence when a stream trailing metadata is passed to the call stack, it MUST be passed with a cancelled status.
Tanvi Jagtap [Wed, 3 Dec 2025 15:21:53 +0000 (07:21 -0800)]
[PH2] Misc items
1. Move `SourceConstructed` to after the party is instantiated.
2. Update TODOs and comments.
3. Add debug info where mark (@roth) had left a TODO.
4. Rename GetActiveStreamCount to GetActiveStreamCountLocked
Aananth V [Wed, 3 Dec 2025 05:39:00 +0000 (21:39 -0800)]
Chaotic Good: Verify Peer in Chaotic Good Handshake during Data Endpoint creation
Since Chaotic Good enables using a group of TCP connections as a composite channel we need to ensure that all TCP connections are established with the same peer. In this change, we store a Ref to the `grpc_auth_context` of the Connection that created the Control Endpoint and compare it to the `grpc_auth_context` of the Connection requesting each Data Endpoint using the [Injectable Peer Comparison API](https://github.com/grpc/grpc/pull/39610). If no peer comparison API is installed, the identity verification will not be performed.
The updated Chaotic Good handshake is as follows: (changed steps are in **bolded**)
First the control channel is established:
1. ALTS/TLS/LOAS/PSP: Each new TCP connection goes through the “normal” security handshakes for gRPC, checking certificates, establishing identity
2. A Chaotic Good Settings frame is sent from the client, with data_channel == 0
3. The server processes the received Settings frame, creates N pending data connections, and responds with a Settings frame with a randomly generated set of connection ids: 1 per requested data connection. **The created PendingDataConnections hold a reference to the Control Channel’s grpc_auth_context.**
4. The client processes the received Settings frame and creates one data connection per received connection_id.
For each data channel requested:
1. The TCP connection proceeds as usual (same as 1 above)
2. The Settings frame sent will relay the connection_id for this data channel, with data_channel == 1
3. The server responds with a Settings frame with data_channel == 1.
4. **Finally, server looks up the association for this connection_id and verifies the equivalence of the current connection’s grpc_auth_context and the stored grpc_auth_context of the control channel.**
- **If lookup is successful and peer is equivalent, we bind the connection with that chaotic good channel.**
- **Else, we abort the connection.**
Track allocations in tsi_zero_copy_grpc_protector towards ResourceQuota.
This change introduces a `set_allocator` method to the `tsi_zero_copy_grpc_protector` vtable and API. The ALTS zero-copy frame protector implementation is updated to use a provided allocator callback (`tsi_zero_copy_grpc_protector_allocator_cb`) for allocating protected and unprotected slices, falling back to `GRPC_SLICE_MALLOC` if no custom allocator is set.
Akshit Patel [Wed, 3 Dec 2025 02:54:40 +0000 (18:54 -0800)]
[PH2][E2E] Fix a race condition in stream_data_queue
The `stream_id_` is currently accessed both in Enqueue and Dequeue operations resulting in the race. Technically, in the Enqueue flow `stream_id_` is only used for logs which is redundant and hence being removed.
Tanvi Jagtap [Tue, 2 Dec 2025 14:49:23 +0000 (06:49 -0800)]
[PH2][Settings][Refactor] Step 3.3
1. Removes unused includes of http2_settings_manager.h
2. Moves settings ACK handling into SettingsPromiseManager from Http2SettingsManager
3. Deletes `MaybeSendAck` related tests from http2_settings_test.cc
4. Moved tests as-is into settings_timeout_manager_test.cc from http2_transport_test.cc
Aananth V [Tue, 2 Dec 2025 13:32:29 +0000 (05:32 -0800)]
Include GlobalCollectionScope in StatsPluginGroup::GetCollectionScope.
Also adds a requirement that the Collection Scope returned by StatsPlugin::GetCollectionScope is a Root Scope (i.e. has no parents). This is to avoid Diamond structures in the DAG (doesn't fix the problem entirely but is a good failsafe for now).
Tanvi Jagtap [Tue, 2 Dec 2025 08:37:13 +0000 (00:37 -0800)]
[PH2][Settings][Refactor] Step 3.1
This CL refactors HTTP/2 settings ACK handling by moving the did_previous_settings_promise_resolve_ flag from Http2SettingsManager to Http2SettingsPromiseManager. did_previous_settings_promise_resolve_ is now fully managed by Http2SettingsPromiseManager so other classes don't need to check it or set it.
Step 2.2
Move object of Http2SettingsManager class into SettingsPromiseManager and the Http2ClientTransport will use Http2SettingsManager via SettingsPromiseManager
Tanvi Jagtap [Tue, 2 Dec 2025 05:28:07 +0000 (21:28 -0800)]
[PH2][Bug][Stream]
1. Fixes a bug by preventing DATA frame processing on streams that have not yet received initial metadata.
2. Minor refactoring of existing code.
Tanvi Jagtap [Sat, 29 Nov 2025 19:11:02 +0000 (11:11 -0800)]
[PH2][Settings][Refactor] Move MaybeGetSettingsAndSettingsAckFrames
Make MaybeGetSettingsAndSettingsAckFrames a data member of class SettingsPromiseManager.
Tanvi Jagtap [Fri, 28 Nov 2025 11:23:28 +0000 (03:23 -0800)]
[PH2][Settings][Refactor] Step 4 : Rename
Step 1 : https://github.com/grpc/grpc/pull/41103
Step 2, 3 : WIP
Step 4 : (This PR)
Rename variables and functions to ensure that the common confusion between SENT and RECEIVED settings is not there. The current structure and naming makes it hard to differentiate. We really have wasted a LOT of time here.