Tanvi Jagtap [Wed, 3 Dec 2025 15:21:53 +0000 (07:21 -0800)]
[PH2] Misc items
1. Move `SourceConstructed` to after the party is instantiated.
2. Update TODOs and comments.
3. Add debug info where mark (@roth) had left a TODO.
4. Rename GetActiveStreamCount to GetActiveStreamCountLocked
Aananth V [Wed, 3 Dec 2025 05:39:00 +0000 (21:39 -0800)]
Chaotic Good: Verify Peer in Chaotic Good Handshake during Data Endpoint creation
Since Chaotic Good enables using a group of TCP connections as a composite channel we need to ensure that all TCP connections are established with the same peer. In this change, we store a Ref to the `grpc_auth_context` of the Connection that created the Control Endpoint and compare it to the `grpc_auth_context` of the Connection requesting each Data Endpoint using the [Injectable Peer Comparison API](https://github.com/grpc/grpc/pull/39610). If no peer comparison API is installed, the identity verification will not be performed.
The updated Chaotic Good handshake is as follows: (changed steps are in **bolded**)
First the control channel is established:
1. ALTS/TLS/LOAS/PSP: Each new TCP connection goes through the “normal” security handshakes for gRPC, checking certificates, establishing identity
2. A Chaotic Good Settings frame is sent from the client, with data_channel == 0
3. The server processes the received Settings frame, creates N pending data connections, and responds with a Settings frame with a randomly generated set of connection ids: 1 per requested data connection. **The created PendingDataConnections hold a reference to the Control Channel’s grpc_auth_context.**
4. The client processes the received Settings frame and creates one data connection per received connection_id.
For each data channel requested:
1. The TCP connection proceeds as usual (same as 1 above)
2. The Settings frame sent will relay the connection_id for this data channel, with data_channel == 1
3. The server responds with a Settings frame with data_channel == 1.
4. **Finally, server looks up the association for this connection_id and verifies the equivalence of the current connection’s grpc_auth_context and the stored grpc_auth_context of the control channel.**
- **If lookup is successful and peer is equivalent, we bind the connection with that chaotic good channel.**
- **Else, we abort the connection.**
Track allocations in tsi_zero_copy_grpc_protector towards ResourceQuota.
This change introduces a `set_allocator` method to the `tsi_zero_copy_grpc_protector` vtable and API. The ALTS zero-copy frame protector implementation is updated to use a provided allocator callback (`tsi_zero_copy_grpc_protector_allocator_cb`) for allocating protected and unprotected slices, falling back to `GRPC_SLICE_MALLOC` if no custom allocator is set.
Akshit Patel [Wed, 3 Dec 2025 02:54:40 +0000 (18:54 -0800)]
[PH2][E2E] Fix a race condition in stream_data_queue
The `stream_id_` is currently accessed both in Enqueue and Dequeue operations resulting in the race. Technically, in the Enqueue flow `stream_id_` is only used for logs which is redundant and hence being removed.
Tanvi Jagtap [Tue, 2 Dec 2025 14:49:23 +0000 (06:49 -0800)]
[PH2][Settings][Refactor] Step 3.3
1. Removes unused includes of http2_settings_manager.h
2. Moves settings ACK handling into SettingsPromiseManager from Http2SettingsManager
3. Deletes `MaybeSendAck` related tests from http2_settings_test.cc
4. Moved tests as-is into settings_timeout_manager_test.cc from http2_transport_test.cc
Aananth V [Tue, 2 Dec 2025 13:32:29 +0000 (05:32 -0800)]
Include GlobalCollectionScope in StatsPluginGroup::GetCollectionScope.
Also adds a requirement that the Collection Scope returned by StatsPlugin::GetCollectionScope is a Root Scope (i.e. has no parents). This is to avoid Diamond structures in the DAG (doesn't fix the problem entirely but is a good failsafe for now).
Tanvi Jagtap [Tue, 2 Dec 2025 08:37:13 +0000 (00:37 -0800)]
[PH2][Settings][Refactor] Step 3.1
This CL refactors HTTP/2 settings ACK handling by moving the did_previous_settings_promise_resolve_ flag from Http2SettingsManager to Http2SettingsPromiseManager. did_previous_settings_promise_resolve_ is now fully managed by Http2SettingsPromiseManager so other classes don't need to check it or set it.
Step 2.2
Move object of Http2SettingsManager class into SettingsPromiseManager and the Http2ClientTransport will use Http2SettingsManager via SettingsPromiseManager
Tanvi Jagtap [Tue, 2 Dec 2025 05:28:07 +0000 (21:28 -0800)]
[PH2][Bug][Stream]
1. Fixes a bug by preventing DATA frame processing on streams that have not yet received initial metadata.
2. Minor refactoring of existing code.
Tanvi Jagtap [Sat, 29 Nov 2025 19:11:02 +0000 (11:11 -0800)]
[PH2][Settings][Refactor] Move MaybeGetSettingsAndSettingsAckFrames
Make MaybeGetSettingsAndSettingsAckFrames a data member of class SettingsPromiseManager.
Tanvi Jagtap [Fri, 28 Nov 2025 11:23:28 +0000 (03:23 -0800)]
[PH2][Settings][Refactor] Step 4 : Rename
Step 1 : https://github.com/grpc/grpc/pull/41103
Step 2, 3 : WIP
Step 4 : (This PR)
Rename variables and functions to ensure that the common confusion between SENT and RECEIVED settings is not there. The current structure and naming makes it hard to differentiate. We really have wasted a LOT of time here.
[Python] Disable layering check in grpc_tools:protoc_lib (#41142)
Python Bazel tests have been failing since yesterday after layering check was enabled in grpcio_tools build in commit: https://github.com/grpc/grpc/commit/756389e9e75ba93d7316ef9eae2ca83126ad9f94
Temporarily disabling it after discussing IRL with @rishesh007
void TypicalTransportFunction(){
... other non-settings work ...
object1.DetailedWork1();
object2.DetailedWork2();
object3.DetailedWork3();
... other non-settings work ...
}
};
```
New Design
```
class Http2ClientTransport{
SettingsPromiseManager settings_manager_;
void TypicalTransportFunction(){
... other non-settings work ...
settings_manager_.SomeWork();
... other non-settings work ...
}
};
class SettingsPromiseManager{
Http2SettingsManager settings_;
Refactor Step 1
1. Merge class `SettingsTimeoutManager` and `PendingIncomingSettings` into a new class named `SettingsPromiseManager`
2. Replace usage of `PendingIncomingSettings` and `SettingsTimeoutManager` with usage of `SettingsPromiseManager`
3. Replace `pending_incoming_settings_` with `transport_settings_`
Future Steps
1. Step 2 : Move object of `Http2SettingsManager` class into `SettingsPromiseManager` and the `Http2ClientTransport` will use `Http2SettingsManager` via `SettingsPromiseManager`
2. Step 3 : Earlier the `Http2ClientTransport` class had interactions between `Http2SettingsManager` `SettingsTimeoutManager` and `PendingIncomingSettings` in the transport. Move this into our new `SettingsPromiseManager` class. This will make the transport lean. This PR will need careful review to the business logic. This will also make multiple permutations of settings very easily testable and debuggable.
3. Step 4 : Rename variables and functions to ensure that the common confusion between SENT and RECEIVED settings is not there. The current structure and naming makes it hard to differentiate. We really have wasted a LOT of time here.
4. Step 5 : Write unit tests for `SettingsPromiseManager` class, modelling scenarios similar to how the transport will be using the settings. Also add missing tests to `Http2SettingsManager` if needed.
Tanvi Jagtap [Thu, 27 Nov 2025 04:26:24 +0000 (20:26 -0800)]
[PH2][E2E] E2E . Multiple Changes
1. Enable logging for 2 flaking HPack tests
2. Writing a new function which will enable logging for PH2 for flaking tests
3. Splitting the CANCEL and DEADLINE test suites so that these can be switched on and off separately.
Akshit Patel [Wed, 26 Nov 2025 02:42:51 +0000 (18:42 -0800)]
[PH2][E2E] Fix channelZ AddData race with transport deletion.
This CL moves `SourceDestructing` from the destructor to `Orphan`. It is possible that `AddData` call tries to take a ref on the transport while the transport is being destructed (before `SourceDestructing` is invoked). Calling `SourceDestructing` from `Orphan` ensures that `AddData` is not called after dropping the external transport ref.
Tanvi Jagtap [Wed, 26 Nov 2025 02:02:43 +0000 (18:02 -0800)]
[PH2][Settings] Multiple changes
1. Complete the ProcessHttp2SettingsFrame function
2. Applying the incoming settings in the MultiplexerLoop and sending an ACK for incoming settings
3. Managing initial window size settings for acked settings (this was missed in previous PR).
4. Decoupling ApplyIncomingSettings from OnSettingsReceived
Tanvi Jagtap [Tue, 25 Nov 2025 15:37:06 +0000 (07:37 -0800)]
[PH2][Bug] Move transport loop spawning out of the constructor
Spawning transport loops from the Http2ClientTransport constructor creates a race condition. An initialization error can trigger a shutdown, causing the transport to be destroyed from within its own constructor.
This CL moves the loop-spawning logic to a new public method, SpawnTransportLoops(). The Chtttp2Connector now calls this method after the transport is fully constructed. This ensures a clean separation between object construction and the start of asynchronous operations, preventing premature closure and potential bugs.
Tanvi Jagtap [Fri, 21 Nov 2025 09:34:31 +0000 (01:34 -0800)]
[PH2][Settings] MaybeSpawnWaitForSettingsTimeout
This PR takes care of
1. Sending a SETTINGS frame to the peer.
2. Starting a timer to wait for the ACK
3. Processing the SETTINGS ACK received from the peer.
This does NOT include sending a SETTINGS ACK or processing a received SETTING frame.
Changes :
1. Renamed functions MarkPeerSettingsResolved to MarkPeerSettingsPromiseResolved. And renamed SpawnWaitForSettingsTimeout to MaybeSpawnWaitForSettingsTimeout
2. Moved all functions to the cc file
3. Added an if check to MaybeSpawnWaitForSettingsTimeout to prevent incorrect spawning when no settings has been sent.
4. Some plumbing.
PiperOrigin-RevId: 835120786
Akshit Patel [Thu, 20 Nov 2025 08:42:52 +0000 (00:42 -0800)]
[PH2] Handle unknown stream IDs. This CL addresses the following:
1. On getting a HEADER/CONTINUATION/DATA/Window Update frame with a stream ID that is not expected will now be treated as a connection error based on the RFC.
Tanvi Jagtap [Wed, 19 Nov 2025 11:30:58 +0000 (03:30 -0800)]
[PH2][Common][Refactor] IncomingMetadataTracker
1. Moving out incoming header state and management into class IncomingMetadataTracker
2. Fixing bug in CloseStream. The state should not be altered in this case .
3. Two parameters to function ParseAndDiscardHeaders were actually data members. So I removed them. ParseAndDiscardHeaders will access the data members directly.
4. Fixing clangs issues.
5. Moving helpers from header_assembler_test into the common test class.
Aananth V [Wed, 19 Nov 2025 08:09:18 +0000 (00:09 -0800)]
Set security protocol type in AuthContext.
The Injectable Peer Comparison API added in https://github.com/grpc/grpc/pull/39610 uses the `protocol_` field of the `grpc_auth_context` to 1) Lookup the registered comparators, and 2) Perform an initial comparison to ensure that the two compared auth contexts have the same protocol. However, this field is currently unset for all types of credentials.
This change populates the `protocol` field in `grpc_auth_context` with the name of the security connector type after the peer check in the security handshaker. E2E Tests are updated to verify that the `AuthContext` contains the correct protocol type.
Aananth V [Tue, 18 Nov 2025 08:19:54 +0000 (00:19 -0800)]
Introduce UpDownCounter instrument type.
This change adds a new instrument type, `UpDownCounter`, to the gRPC telemetry system. Unlike a standard `Counter`, an `UpDownCounter` can be incremented and decremented. It can be thought of as a UInt Gauge that is stored rather than being queried when the MetricsQuery is run.
Mark D. Roth [Tue, 18 Nov 2025 05:09:26 +0000 (21:09 -0800)]
[subchannel connector] pass initial MAX_CONCURRENT_STREAMS value from connector (#41064)
This is needed for A105 (https://github.com/grpc/proposal/pull/516).
The subchannel will wind up getting the transport's MAX_CONCURRENT_STREAMS value via the new StateWatcher API that I added in #40952. However, because the subchannel does not start that watch until after it has a connection and reports READY to the LB policy, this means that RPCs can start on the subchannel before the subchannel knows the transport's MAX_CONCURRENT_STREAMS value. This can cause us to incorrectly scale up the number of connections when we shouldn't.
To avoid that race, this PR changes the notify_on_receive_settings hook to pass the transport's initial MAX_CONCURRENT_STREAMS value back to the connector, which in turn passes it back to the subchannel. This will allow the subchannel to use that initial value when dispatching RPCs until it receives the first notification from the StateWatcher.
I would ideally like to completely remove this bespoke notify_on_receive_settings hook and instead have the connector use the new StateWatcher API, but that would require a bit more refactoring work than I want to do right now.
siddharth nohria [Mon, 17 Nov 2025 08:14:37 +0000 (00:14 -0800)]
Server Wide Max Outstanding Streams: Add Build changes (#41076)
Allow servers to set max outstanding streams limit per server. This pull request only adds the BUILD changes required for this. The core logic will follow in a later PR.
This PR moves the macro to a separate bazel profile config called `postmortem`, which is not enabled by default.
Instead, this config will be enabled in all remote CIs via tools/remote_build/include/test_config_common.bazelrc: https://github.com/grpc/grpc/blob/ba4984e8a0d21270a6cfc0481efd2de1595601d9/tools/remote_build/include/test_config_common.bazelrc#L26-L27
For the list of affected CI jobs, see my comment on this PR.
Niraj Nepal [Fri, 14 Nov 2025 12:42:09 +0000 (04:42 -0800)]
[ruby] Fix version comparison for the ruby_abi_version symbol for ruby 4 compatibility (#41061)
The next version of Ruby will be 4.0.0. Previously, development versions didn't load properly due to grpc.so not exporting the ruby_abi_version symbol. Correct the version comparison logic so we export the symbol on version 4.0.
Mark D. Roth [Fri, 14 Nov 2025 01:25:24 +0000 (17:25 -0800)]
[transport] add new watcher API to be used by subchannel (#40952)
This adds a new transport state watcher API. The normal connectivity state watcher API is not what we really want in the transport, since we don't expect to see any state-change event except for disconnection, and when that happens, we want to see a lot more info about the disconnection than is available via a connectivity state watch (see [gRFC A94](https://github.com/grpc/proposal/blob/master/A94-subchannel-otel-metrics.md)). In addition, we also need to get reports of the peer's MAX_CONCURRENT_STREAMS setting as part of implementing connection scaling (see WIP [gRFC A105](https://github.com/grpc/proposal/pull/516)).
This new API goes directly from the subchannel to the transport, bypassing the filter stack. This is consistent with our desire to remove the transport op API in the filter stack as part the promise migration.
Eventually, this API should be used on the server side too, but that's a project for another day.
As part of this, we also change the way that keepalive data is sent from the subchannel to the channel. This will also be needed as part of A105, where we need to propagate keepalive info to the channel even when the subchannel's connectivity state does not change.
Atan Bhardwaj [Thu, 13 Nov 2025 03:33:10 +0000 (19:33 -0800)]
Refactor `proto_reflection_descriptor_database` in util to use `absl::flat_hash_map` and `absl::flat_hash_set` instead of `std::unordered_map` and `std::unordered_set` for potential performance improvements. This also involves including the necessary absl headers and updating the `BUILD` file.
Replace `.insert()` with `.emplace()` for `missing_symbols_` in
`proto_reflection_descriptor_database.cc.`
Mark D. Roth [Tue, 11 Nov 2025 01:11:48 +0000 (17:11 -0800)]
[subchannel] remove no-op code to remove channelz linkage (#41041)
This code appears to be a no-op. It removes the `Subchannel`'s channelz node as a parent of the `ConnectedSubchannel`'s channelz node when the `ConnectedSubchannel` becomes disconnected. However, the `ConnectedSubchannel`'s channelz node [is exactly the same node as the `Subchannel`'s channelz node](https://github.com/grpc/grpc/blob/4a97f0950a5fbe422f8cc2965c3055fa2d70444e/src/core/client_channel/subchannel.cc#L842), so this is attempting to remove a node from its own parent list -- and it should never be present in its own parent list in the first place, so this will be a no-op.
This code was originally intended to remove the linkage between the subchannel node and the *socket* node (owned by the transport, not the `ConnectedSubchannel`) when the transport reports disconnection to the subchannel. However, it looks like it was accidentally broken in c675ef88599d9a86684c68ff90cb5420b38f9703 (cl/786454401) as part of switching from recording children to recording parents.
Rather than fixing this, however, I think we should just remove this code. The transport will report a "disconnection" to the subchannel when it receives a GOAWAY, but the connection will actually remain alive for a short time afterwards if there are still RPCs in flight on the connection, and I think we want channelz to show the connection as still being associated with the subchannel until it actually gets closed, since that provides useful context about where the connection came from. (That behavior would have been hard to achieve when we were still recording children instead of parents.)
Note that the subchannel may wind up creating a new connection before the old connection is actually closed, but the channelz data model allows multiple sockets per subchannel for exactly this case. Java and Go both keep the socket associated with the subchannel until it's closed, so this change makes the behavior consistent across languages.
Mark D. Roth [Fri, 7 Nov 2025 23:40:13 +0000 (15:40 -0800)]
[client channel] combine two subchannel data structures into one (#40880)
Currently, the client channel maintains two different data structures for tracking subchannels:
- `subchannel_refcount_map_`, which tracks the number of subchannel wrappers per subchannel. This is used to determine when to create and remove channelz node linkage.
- `subchannel_wrappers_`, which tracks the set of all subchannel wrappers across all subchannels. This was originally introduced back in #20039 as part of propagating the health check service name to subchannels, but we switched to using a different approach for the health check service name in #26441. However, in between those two, we started using this data structure for handling keepalive time in #23313 -- which is actually somewhat inefficient, since we may wind up setting the keepalive time on the underlying subchannel more than once in the case where there is more than one subchannel wrapper per subchannel (which happens frequently during updates).
This PR combines these two data structures into one: a map from subchannel to a set of subchannel wrappers for that subchannel. This is used both for channelz node updates and keepalive propagation -- and it's more efficient for the latter, because we can now update each subchannel exactly once.
This also paves the way for a subsequent change that will be needed as part of the MAX_CONCURRENT_STREAMS design.
Luwei Ge [Fri, 7 Nov 2025 21:55:29 +0000 (13:55 -0800)]
[core][credentials] share the common http token fetching logic between oauth2 and jwt call creds (#40907)
Those two call creds only differ in how they parse the token from the HTTP response. So I changed the meaning of the `on_done_` in the `FetchRequest` to also include the parsing logic and let each credentials prepare this callback that includes the parsing and the one given to the `FetchToken` call.
The HTTP response status handling of both credentials was switched to follow the `JwtTokenFetcherCredentials`, as that supports better retry behavior.
This approach inevitably changed a little bit the error msg the JWT token fetcher returns, but it should still provide useful info to the caller.
NOTE: This PR does not touch the `ExternalFetchRequest` in `external_account_credentials.h`. That type of credentials does not always making HTTP requests. It makes multiple hops to get the final token, so it introduced the `FetchBody` class. I find it challenging to unify all these three cases.
This refactors the call buffering code for the v1 stack, which avoids some repetition between the resolver queue and the LB pick queue. This code will also be used in the future in the subchannel as part of implementing the MAX_CONCURRENT_STREAMS design.
As part of this, I also eliminated the subclassing in the v1 client channel implementation, which has not been necessary since the v2 code was removed.
[BCR] Update BCR maintainers in the metadata template. (#41024)
Update the template used to export BCR releases. Fixes the issue
with emeritus owners getting pinged/assigned for BCR gRPC PRs.
Once this is merged, @yuanweiz will manually update corresponding
https://github.com/bazelbuild/bazel-central-registry/blob/1c61954df7e15bfbd03eef8accc7ae4e98ec548b/modules/grpc/metadata.json
- Offboard @veblush, @eugeneo, @hork, @yashkt, @apolcyn, @stanleycheung
- Add @yuanweiz, BCR integration owner
- Add @asheshvidyut, now owns Ruby and other wrapped languages