Sergii Tkachenko [Sat, 13 Dec 2025 08:50:58 +0000 (00:50 -0800)]
[Fix][CI] grpc_bazel_rbe_nonbazel job: align kokoro and bazel timeouts (#41231)
Target `//tools/bazelify_tests/test:cpp_distribtest_cmake_aarch64_cross_linux` seems to go over Bazel's `--test_timeout` limit from time to time.
Bazel `--test_timeout` flag was initially introduced in #38123 and set to 30 minutes below Kokoro's job `timeout_mins`. Since then, we've increased Kokoro's timeout several times without making corresponding changes to bazel's `test_timeout`.
This PR updates Bazel's test timeout to aligned with Kokoro's job timeout, and adds a reminder to keep those in sync. In addition, I've broken `BAZEL_FLAGS` in multiple lines for better readability, proto text format spec [allows it](https://protobuf.dev/reference/protobuf/textformat-spec/#string).
Gregory Cooke [Thu, 11 Dec 2025 18:25:03 +0000 (10:25 -0800)]
[Test] Don't need any versioning block for SPIFFE tests (#41218)
This should further fix the flakiness in the portability tests.
Specifically, we shouldn't have left this compiler directive in the code in the previous PR https://github.com/grpc/grpc/pull/41205
Tanvi Jagtap [Thu, 11 Dec 2025 06:46:36 +0000 (22:46 -0800)]
[PH2][Flake] Fix settings flake, add LOGs
1. Fix settings flake by allowing some more buffer for promise scheduling delays.
2. Increasing timeouts to some reasonable numbers. If I keep time buffer as 0.5 , it fails once in 20000 times. Reducing it to 0.4 makes it pass all of 100000 times. Which is good enough.
3. Adding LOGs to help to debug another flow control related bug
These were missed with the initial implementation of call tracing in channelz, but luckily our fuzzers found them. Add the calls, and a regression test.
Gregory Cooke [Wed, 10 Dec 2025 05:23:02 +0000 (21:23 -0800)]
[Testing] Fix spiffe portability (#41205)
Fix a few issues when build with OpenSSL versions
OpenSSL1.0.2 - copied some CRL related test code that was not valid assumptions for these tests.
OpenSSL1.1.1 - The regex is too sensitive, only do the regex check for BoringSSL
OpenSSL3 - We though the Invalid UTF8-SAN behavior should cause handshake failures for OpenSSL3 here and included different behavior, but that is still what is breaking. Let's revert that change.
The issue is with the
`//tools/bazelify_tests/test:runtests_cpp_linux_dbg_gcc_8_build_only`
target, which is a part of the portability suite
(`//tools/bazelify_tests/test:portability_tests_linux`). With gcc-8,
building `buildtests_cxx` make target either times out, or fails with
`collect2: fatal error: ld terminated with signal 9`.
I've investigated this as an OOM issue (a common cause of `collect2:
fatal error: ld terminated`), but increasing memory limits does not
help. I've updated RBE stack from `n1-standard-16` (60 GB RAM) to
`e2-standard-32` (128 GB RAM) with no effect. Increasing various job
timeouts (kokoro, bazel, target, etc) didn't help either. See PR #41028
for more details and other attempts at root-causing.
The most important part of portability tests is to verify that gRPC can
be built with all supported compilers. Since we are having a problem
with building the tests with gcc-8, we've decided to stop covering the
tests for that compiler..
Specifically, this PR changes `runtests_c*_linux_dbg_gcc_8_build_only`
bazel target to skip building test make targets (via
`--cmake_configure_extra_args=-DgRPC_BUILD_TESTS=OFF`), and only build
`grpc++` make target. See `build_cxx.sh`:
https://github.com/grpc/grpc/blob/cb2db8fc21b31ac322d463dff5b7eff9fbbab97d/tools/run_tests/helper_scripts/build_cxx.sh#L49-L55
Notes and observations:
- Only gcc-8 and only cpp version is affected:
- Portability tests for other gcc versions have no problems building
`buildtests_cxx` of their corresponding
`runtests_c*_linux_dbg_gcc_*_build_only`.
- The C version of gcc-8 portability test
(`runtests_c_linux_dbg_gcc_8_build_only`) has not issues building tests
([sample run with full target
log](https://btx.cloud.google.com/invocations/0b3d41e7-3cf2-4ff8-b6d5-2bc0d52179cd/targets/%2F%2Ftools%2Fbazelify_tests%2Ftest:runtests_c_linux_dbg_gcc_8_build_only;config=815e4ca9071c7e1d8ca72b9c87c1347399a51eb1246eb9c49dd54d9a24ef5cba/tests)).
- However, unfortunately, this change skips the test targets for
`runtests_c_linux_dbg_gcc_8_build_only` too.
- We already had the logic to skip tests for gcc-7, but for a different
reason: #37257
This is needed for gRFC A105 (https://github.com/grpc/proposal/pull/516). Specifically, see the "Interaction with xDS Circuit Breaking" section.
It's possible for an LB pick to be happening at the same time as the subchannel sees its underlying connection fail. In this case, the picker can return a subchannel, but when the channel tries to start a call on the subchannel, the call creation fails, because there is no underlying connection. In that case, the channel will queue the pick, on the assumption that the LB policy will soon notice that the subchannel has been disconnected and return a new picker, at which point the queued pick will be re-attempted with that new picker.
When the picker returns a complete pick, it can optionally return a `SubchannelCallTracker` object that allows it to see when the subchannel call starts and ends. In the current API, when the channel successfully creates a call on the subchannel, it will immediately call `Start()`, and then when the subchannel call later ends, it will call `Finish()`. However, when the race condition described above occurs, the `SubchannelCallTracker` object will be destroyed without `Start()` or `Finish()` ever having been called. This API allows us to handle call counter incrementing and decrementing for things like xDS circuit breaking: we check the counter in the picker to see that it's currently below the limit, we increment the counter in `Start()`, and decrement it in `Finish()`. If the subchannel call never starts, then the counter never gets incremented.
With the introduction of connection scaling functionality in the subchannel, this approach will no longer work, because the call may be queued inside of the subchannel rather than being immediately started on a connection, and the channel can't tell if that is going to happen. In other words, there's no longer any benefit to the `Start()` method, because it will no longer actually indicate that the call is actually being started on a connection. As a result, I am removing that method from the API.
For xDS circuit breaking in the xds_cluster_impl LB policy, we are now incrementing the call counter in the picker, and the `SubchannelCallTracker` object will decrement it when either `Finish()` is called or when the object is destroyed, whichever comes first.
For grpclb, the `Start()` method was used in an ugly hack to handle ownership of the client stats object between the grpclb policy and the client load reporting filter. The LB policy passes a pointer to this object down to the filter via client initial metadata, which contains a raw pointer and does not hold a ref. To handle ownership, the LB policy returns a `SubchannelCallTracker` that holds a ref to the client stats object, but when `Start()` is called, it releases that ref, on the assumption that the client load reporting filter will subsequently take ownership. I've replaced this with a slightly cleaner approach whereby the call tracker always holds a ref to the client stats object, thus guaranteeing that the client stats object exists when the client load reporting filter sees it, and the client load reporting filter takes its own ref when it runs. (An even cleaner approach would be to instead pass the client stats object to the filter via a call attribute, similar to how we pass the xDS cluster name from the ConfigSelector to the LB policy tree, but it doesn't seem worth putting that much effort into grpclb at this point.)
Craig Tiller [Mon, 8 Dec 2025 17:41:15 +0000 (09:41 -0800)]
[chaotic-good] Deadline fixes (#41190)
* Increase test connection deadline to account for CI slowness
* Add experiment to use handshaker deadline instead of hard coded deadline (since this is likely a bug)
Mark D. Roth [Fri, 5 Dec 2025 20:24:49 +0000 (12:24 -0800)]
[pick_first] go CONNECTING when selected subchannel goes CONNECTING or TF (#41029)
Needed as part of gRFC A105 (https://github.com/grpc/proposal/pull/516).
Currently, when the selected subchannel leaves READY state, the only possible state it can move to is IDLE, and pick_first handles that by itself going IDLE. However, as part of A105, we are going to introduce the possibility of the subchannel going from READY to either CONNECTING or TRANSIENT_FAILURE, and in those two cases we want pick_first to go back into CONNECTING and start a new happy eyeballs pass. This PR introduces an experiment that adds that behavior.
While I was at it, I noticed an existing misfeature. There are two cases where pick_first will go IDLE, which is done by calling [`GoIdle()`](https://github.com/grpc/grpc/blob/24b25a0baa72a658cc37d1db28f77513a9670ea2/src/core/load_balancing/pick_first/pick_first.cc#L610):
1. The case mentioned above, where the selected subchannel goes from READY to IDLE (`GoIdle()` is called from [`SubchannelState::OnConnectivityStateChange()`](https://github.com/grpc/grpc/blob/24b25a0baa72a658cc37d1db28f77513a9670ea2/src/core/load_balancing/pick_first/pick_first.cc#L784)).
2. The case where pick_first already has a selected subchannel and receives a new address list, but none of the subchannels in the new list report READY. In this case, pick_first knows that the currently selected subchannel is for an address that is not present in the new address list, so it unrefs the selected subchannel and goes IDLE (`GoIdle()` is called from [`SubchannelData::OnConnectivityStateChange()`](https://github.com/grpc/grpc/blob/24b25a0baa72a658cc37d1db28f77513a9670ea2/src/core/load_balancing/pick_first/pick_first.cc#L859)).
The code in `GoIdle()` currently requests a re-resolution, which is the right behavior for case 1. However, it doesn't really make sense to do this for case 2, since we have just received a fresh resolver update in that case. Therefore, as part of this experiment, I am moving the code that triggers the re-resolution out of `GoIdle()` and directly into `SubchannelState::OnConnectivityStateChange()`, where it will occur only for case 1.
Tanvi Jagtap [Fri, 5 Dec 2025 18:04:10 +0000 (10:04 -0800)]
[PH2][Refactor]
The Pausing and Restarting of the ReadLoop happens in a separate class.
We could generalize and re-use this mechanism elsewhere, but that is a task for later.
Akshit Patel [Fri, 5 Dec 2025 09:13:23 +0000 (01:13 -0800)]
[PH2][ChannelArg] Adding support for GRPC_ARG_HTTP2_INITIAL_SEQUENCE_NUMBER. This CL also modifies the error message returned when the last stream is closed and the transport cannot create any new streams.
Tanvi Jagtap [Thu, 4 Dec 2025 15:26:54 +0000 (07:26 -0800)]
[PH2][Settings][Refactor]
1. Moved on_receive_settings callback logic into SettingsPromiseManager.
2. Stall reads until the first peer settings are processed.
3. Encapsulated security frame settings logic within SettingsPromiseManager.
Akshit Patel [Thu, 4 Dec 2025 08:40:59 +0000 (00:40 -0800)]
[PH2][Bug] Fix call to `BeginCloseStream` from `HandleError`.
`HandleError` is called from a transport promise when some stream/connection error is encountered. Hence when a stream trailing metadata is passed to the call stack, it MUST be passed with a cancelled status.
Tanvi Jagtap [Wed, 3 Dec 2025 15:21:53 +0000 (07:21 -0800)]
[PH2] Misc items
1. Move `SourceConstructed` to after the party is instantiated.
2. Update TODOs and comments.
3. Add debug info where mark (@roth) had left a TODO.
4. Rename GetActiveStreamCount to GetActiveStreamCountLocked
Aananth V [Wed, 3 Dec 2025 05:39:00 +0000 (21:39 -0800)]
Chaotic Good: Verify Peer in Chaotic Good Handshake during Data Endpoint creation
Since Chaotic Good enables using a group of TCP connections as a composite channel we need to ensure that all TCP connections are established with the same peer. In this change, we store a Ref to the `grpc_auth_context` of the Connection that created the Control Endpoint and compare it to the `grpc_auth_context` of the Connection requesting each Data Endpoint using the [Injectable Peer Comparison API](https://github.com/grpc/grpc/pull/39610). If no peer comparison API is installed, the identity verification will not be performed.
The updated Chaotic Good handshake is as follows: (changed steps are in **bolded**)
First the control channel is established:
1. ALTS/TLS/LOAS/PSP: Each new TCP connection goes through the “normal” security handshakes for gRPC, checking certificates, establishing identity
2. A Chaotic Good Settings frame is sent from the client, with data_channel == 0
3. The server processes the received Settings frame, creates N pending data connections, and responds with a Settings frame with a randomly generated set of connection ids: 1 per requested data connection. **The created PendingDataConnections hold a reference to the Control Channel’s grpc_auth_context.**
4. The client processes the received Settings frame and creates one data connection per received connection_id.
For each data channel requested:
1. The TCP connection proceeds as usual (same as 1 above)
2. The Settings frame sent will relay the connection_id for this data channel, with data_channel == 1
3. The server responds with a Settings frame with data_channel == 1.
4. **Finally, server looks up the association for this connection_id and verifies the equivalence of the current connection’s grpc_auth_context and the stored grpc_auth_context of the control channel.**
- **If lookup is successful and peer is equivalent, we bind the connection with that chaotic good channel.**
- **Else, we abort the connection.**
Track allocations in tsi_zero_copy_grpc_protector towards ResourceQuota.
This change introduces a `set_allocator` method to the `tsi_zero_copy_grpc_protector` vtable and API. The ALTS zero-copy frame protector implementation is updated to use a provided allocator callback (`tsi_zero_copy_grpc_protector_allocator_cb`) for allocating protected and unprotected slices, falling back to `GRPC_SLICE_MALLOC` if no custom allocator is set.
Akshit Patel [Wed, 3 Dec 2025 02:54:40 +0000 (18:54 -0800)]
[PH2][E2E] Fix a race condition in stream_data_queue
The `stream_id_` is currently accessed both in Enqueue and Dequeue operations resulting in the race. Technically, in the Enqueue flow `stream_id_` is only used for logs which is redundant and hence being removed.
Tanvi Jagtap [Tue, 2 Dec 2025 14:49:23 +0000 (06:49 -0800)]
[PH2][Settings][Refactor] Step 3.3
1. Removes unused includes of http2_settings_manager.h
2. Moves settings ACK handling into SettingsPromiseManager from Http2SettingsManager
3. Deletes `MaybeSendAck` related tests from http2_settings_test.cc
4. Moved tests as-is into settings_timeout_manager_test.cc from http2_transport_test.cc
Aananth V [Tue, 2 Dec 2025 13:32:29 +0000 (05:32 -0800)]
Include GlobalCollectionScope in StatsPluginGroup::GetCollectionScope.
Also adds a requirement that the Collection Scope returned by StatsPlugin::GetCollectionScope is a Root Scope (i.e. has no parents). This is to avoid Diamond structures in the DAG (doesn't fix the problem entirely but is a good failsafe for now).
Tanvi Jagtap [Tue, 2 Dec 2025 08:37:13 +0000 (00:37 -0800)]
[PH2][Settings][Refactor] Step 3.1
This CL refactors HTTP/2 settings ACK handling by moving the did_previous_settings_promise_resolve_ flag from Http2SettingsManager to Http2SettingsPromiseManager. did_previous_settings_promise_resolve_ is now fully managed by Http2SettingsPromiseManager so other classes don't need to check it or set it.
Step 2.2
Move object of Http2SettingsManager class into SettingsPromiseManager and the Http2ClientTransport will use Http2SettingsManager via SettingsPromiseManager
Tanvi Jagtap [Tue, 2 Dec 2025 05:28:07 +0000 (21:28 -0800)]
[PH2][Bug][Stream]
1. Fixes a bug by preventing DATA frame processing on streams that have not yet received initial metadata.
2. Minor refactoring of existing code.
Tanvi Jagtap [Sat, 29 Nov 2025 19:11:02 +0000 (11:11 -0800)]
[PH2][Settings][Refactor] Move MaybeGetSettingsAndSettingsAckFrames
Make MaybeGetSettingsAndSettingsAckFrames a data member of class SettingsPromiseManager.
Tanvi Jagtap [Fri, 28 Nov 2025 11:23:28 +0000 (03:23 -0800)]
[PH2][Settings][Refactor] Step 4 : Rename
Step 1 : https://github.com/grpc/grpc/pull/41103
Step 2, 3 : WIP
Step 4 : (This PR)
Rename variables and functions to ensure that the common confusion between SENT and RECEIVED settings is not there. The current structure and naming makes it hard to differentiate. We really have wasted a LOT of time here.
[Python] Disable layering check in grpc_tools:protoc_lib (#41142)
Python Bazel tests have been failing since yesterday after layering check was enabled in grpcio_tools build in commit: https://github.com/grpc/grpc/commit/756389e9e75ba93d7316ef9eae2ca83126ad9f94
Temporarily disabling it after discussing IRL with @rishesh007
void TypicalTransportFunction(){
... other non-settings work ...
object1.DetailedWork1();
object2.DetailedWork2();
object3.DetailedWork3();
... other non-settings work ...
}
};
```
New Design
```
class Http2ClientTransport{
SettingsPromiseManager settings_manager_;
void TypicalTransportFunction(){
... other non-settings work ...
settings_manager_.SomeWork();
... other non-settings work ...
}
};
class SettingsPromiseManager{
Http2SettingsManager settings_;
Refactor Step 1
1. Merge class `SettingsTimeoutManager` and `PendingIncomingSettings` into a new class named `SettingsPromiseManager`
2. Replace usage of `PendingIncomingSettings` and `SettingsTimeoutManager` with usage of `SettingsPromiseManager`
3. Replace `pending_incoming_settings_` with `transport_settings_`
Future Steps
1. Step 2 : Move object of `Http2SettingsManager` class into `SettingsPromiseManager` and the `Http2ClientTransport` will use `Http2SettingsManager` via `SettingsPromiseManager`
2. Step 3 : Earlier the `Http2ClientTransport` class had interactions between `Http2SettingsManager` `SettingsTimeoutManager` and `PendingIncomingSettings` in the transport. Move this into our new `SettingsPromiseManager` class. This will make the transport lean. This PR will need careful review to the business logic. This will also make multiple permutations of settings very easily testable and debuggable.
3. Step 4 : Rename variables and functions to ensure that the common confusion between SENT and RECEIVED settings is not there. The current structure and naming makes it hard to differentiate. We really have wasted a LOT of time here.
4. Step 5 : Write unit tests for `SettingsPromiseManager` class, modelling scenarios similar to how the transport will be using the settings. Also add missing tests to `Http2SettingsManager` if needed.
Tanvi Jagtap [Thu, 27 Nov 2025 04:26:24 +0000 (20:26 -0800)]
[PH2][E2E] E2E . Multiple Changes
1. Enable logging for 2 flaking HPack tests
2. Writing a new function which will enable logging for PH2 for flaking tests
3. Splitting the CANCEL and DEADLINE test suites so that these can be switched on and off separately.
Akshit Patel [Wed, 26 Nov 2025 02:42:51 +0000 (18:42 -0800)]
[PH2][E2E] Fix channelZ AddData race with transport deletion.
This CL moves `SourceDestructing` from the destructor to `Orphan`. It is possible that `AddData` call tries to take a ref on the transport while the transport is being destructed (before `SourceDestructing` is invoked). Calling `SourceDestructing` from `Orphan` ensures that `AddData` is not called after dropping the external transport ref.
Tanvi Jagtap [Wed, 26 Nov 2025 02:02:43 +0000 (18:02 -0800)]
[PH2][Settings] Multiple changes
1. Complete the ProcessHttp2SettingsFrame function
2. Applying the incoming settings in the MultiplexerLoop and sending an ACK for incoming settings
3. Managing initial window size settings for acked settings (this was missed in previous PR).
4. Decoupling ApplyIncomingSettings from OnSettingsReceived
Tanvi Jagtap [Tue, 25 Nov 2025 15:37:06 +0000 (07:37 -0800)]
[PH2][Bug] Move transport loop spawning out of the constructor
Spawning transport loops from the Http2ClientTransport constructor creates a race condition. An initialization error can trigger a shutdown, causing the transport to be destroyed from within its own constructor.
This CL moves the loop-spawning logic to a new public method, SpawnTransportLoops(). The Chtttp2Connector now calls this method after the transport is fully constructed. This ensures a clean separation between object construction and the start of asynchronous operations, preventing premature closure and potential bugs.
Tanvi Jagtap [Fri, 21 Nov 2025 09:34:31 +0000 (01:34 -0800)]
[PH2][Settings] MaybeSpawnWaitForSettingsTimeout
This PR takes care of
1. Sending a SETTINGS frame to the peer.
2. Starting a timer to wait for the ACK
3. Processing the SETTINGS ACK received from the peer.
This does NOT include sending a SETTINGS ACK or processing a received SETTING frame.
Changes :
1. Renamed functions MarkPeerSettingsResolved to MarkPeerSettingsPromiseResolved. And renamed SpawnWaitForSettingsTimeout to MaybeSpawnWaitForSettingsTimeout
2. Moved all functions to the cc file
3. Added an if check to MaybeSpawnWaitForSettingsTimeout to prevent incorrect spawning when no settings has been sent.
4. Some plumbing.
PiperOrigin-RevId: 835120786
Akshit Patel [Thu, 20 Nov 2025 08:42:52 +0000 (00:42 -0800)]
[PH2] Handle unknown stream IDs. This CL addresses the following:
1. On getting a HEADER/CONTINUATION/DATA/Window Update frame with a stream ID that is not expected will now be treated as a connection error based on the RFC.
Tanvi Jagtap [Wed, 19 Nov 2025 11:30:58 +0000 (03:30 -0800)]
[PH2][Common][Refactor] IncomingMetadataTracker
1. Moving out incoming header state and management into class IncomingMetadataTracker
2. Fixing bug in CloseStream. The state should not be altered in this case .
3. Two parameters to function ParseAndDiscardHeaders were actually data members. So I removed them. ParseAndDiscardHeaders will access the data members directly.
4. Fixing clangs issues.
5. Moving helpers from header_assembler_test into the common test class.
Aananth V [Wed, 19 Nov 2025 08:09:18 +0000 (00:09 -0800)]
Set security protocol type in AuthContext.
The Injectable Peer Comparison API added in https://github.com/grpc/grpc/pull/39610 uses the `protocol_` field of the `grpc_auth_context` to 1) Lookup the registered comparators, and 2) Perform an initial comparison to ensure that the two compared auth contexts have the same protocol. However, this field is currently unset for all types of credentials.
This change populates the `protocol` field in `grpc_auth_context` with the name of the security connector type after the peer check in the security handshaker. E2E Tests are updated to verify that the `AuthContext` contains the correct protocol type.