Tanvi Jagtap [Fri, 28 Nov 2025 11:23:28 +0000 (03:23 -0800)]
[PH2][Settings][Refactor] Step 4 : Rename
Step 1 : https://github.com/grpc/grpc/pull/41103
Step 2, 3 : WIP
Step 4 : (This PR)
Rename variables and functions to ensure that the common confusion between SENT and RECEIVED settings is not there. The current structure and naming makes it hard to differentiate. We really have wasted a LOT of time here.
[Python] Disable layering check in grpc_tools:protoc_lib (#41142)
Python Bazel tests have been failing since yesterday after layering check was enabled in grpcio_tools build in commit: https://github.com/grpc/grpc/commit/756389e9e75ba93d7316ef9eae2ca83126ad9f94
Temporarily disabling it after discussing IRL with @rishesh007
void TypicalTransportFunction(){
... other non-settings work ...
object1.DetailedWork1();
object2.DetailedWork2();
object3.DetailedWork3();
... other non-settings work ...
}
};
```
New Design
```
class Http2ClientTransport{
SettingsPromiseManager settings_manager_;
void TypicalTransportFunction(){
... other non-settings work ...
settings_manager_.SomeWork();
... other non-settings work ...
}
};
class SettingsPromiseManager{
Http2SettingsManager settings_;
Refactor Step 1
1. Merge class `SettingsTimeoutManager` and `PendingIncomingSettings` into a new class named `SettingsPromiseManager`
2. Replace usage of `PendingIncomingSettings` and `SettingsTimeoutManager` with usage of `SettingsPromiseManager`
3. Replace `pending_incoming_settings_` with `transport_settings_`
Future Steps
1. Step 2 : Move object of `Http2SettingsManager` class into `SettingsPromiseManager` and the `Http2ClientTransport` will use `Http2SettingsManager` via `SettingsPromiseManager`
2. Step 3 : Earlier the `Http2ClientTransport` class had interactions between `Http2SettingsManager` `SettingsTimeoutManager` and `PendingIncomingSettings` in the transport. Move this into our new `SettingsPromiseManager` class. This will make the transport lean. This PR will need careful review to the business logic. This will also make multiple permutations of settings very easily testable and debuggable.
3. Step 4 : Rename variables and functions to ensure that the common confusion between SENT and RECEIVED settings is not there. The current structure and naming makes it hard to differentiate. We really have wasted a LOT of time here.
4. Step 5 : Write unit tests for `SettingsPromiseManager` class, modelling scenarios similar to how the transport will be using the settings. Also add missing tests to `Http2SettingsManager` if needed.
Tanvi Jagtap [Thu, 27 Nov 2025 04:26:24 +0000 (20:26 -0800)]
[PH2][E2E] E2E . Multiple Changes
1. Enable logging for 2 flaking HPack tests
2. Writing a new function which will enable logging for PH2 for flaking tests
3. Splitting the CANCEL and DEADLINE test suites so that these can be switched on and off separately.
Akshit Patel [Wed, 26 Nov 2025 02:42:51 +0000 (18:42 -0800)]
[PH2][E2E] Fix channelZ AddData race with transport deletion.
This CL moves `SourceDestructing` from the destructor to `Orphan`. It is possible that `AddData` call tries to take a ref on the transport while the transport is being destructed (before `SourceDestructing` is invoked). Calling `SourceDestructing` from `Orphan` ensures that `AddData` is not called after dropping the external transport ref.
Tanvi Jagtap [Wed, 26 Nov 2025 02:02:43 +0000 (18:02 -0800)]
[PH2][Settings] Multiple changes
1. Complete the ProcessHttp2SettingsFrame function
2. Applying the incoming settings in the MultiplexerLoop and sending an ACK for incoming settings
3. Managing initial window size settings for acked settings (this was missed in previous PR).
4. Decoupling ApplyIncomingSettings from OnSettingsReceived
Tanvi Jagtap [Tue, 25 Nov 2025 15:37:06 +0000 (07:37 -0800)]
[PH2][Bug] Move transport loop spawning out of the constructor
Spawning transport loops from the Http2ClientTransport constructor creates a race condition. An initialization error can trigger a shutdown, causing the transport to be destroyed from within its own constructor.
This CL moves the loop-spawning logic to a new public method, SpawnTransportLoops(). The Chtttp2Connector now calls this method after the transport is fully constructed. This ensures a clean separation between object construction and the start of asynchronous operations, preventing premature closure and potential bugs.
Tanvi Jagtap [Fri, 21 Nov 2025 09:34:31 +0000 (01:34 -0800)]
[PH2][Settings] MaybeSpawnWaitForSettingsTimeout
This PR takes care of
1. Sending a SETTINGS frame to the peer.
2. Starting a timer to wait for the ACK
3. Processing the SETTINGS ACK received from the peer.
This does NOT include sending a SETTINGS ACK or processing a received SETTING frame.
Changes :
1. Renamed functions MarkPeerSettingsResolved to MarkPeerSettingsPromiseResolved. And renamed SpawnWaitForSettingsTimeout to MaybeSpawnWaitForSettingsTimeout
2. Moved all functions to the cc file
3. Added an if check to MaybeSpawnWaitForSettingsTimeout to prevent incorrect spawning when no settings has been sent.
4. Some plumbing.
PiperOrigin-RevId: 835120786
Akshit Patel [Thu, 20 Nov 2025 08:42:52 +0000 (00:42 -0800)]
[PH2] Handle unknown stream IDs. This CL addresses the following:
1. On getting a HEADER/CONTINUATION/DATA/Window Update frame with a stream ID that is not expected will now be treated as a connection error based on the RFC.
Tanvi Jagtap [Wed, 19 Nov 2025 11:30:58 +0000 (03:30 -0800)]
[PH2][Common][Refactor] IncomingMetadataTracker
1. Moving out incoming header state and management into class IncomingMetadataTracker
2. Fixing bug in CloseStream. The state should not be altered in this case .
3. Two parameters to function ParseAndDiscardHeaders were actually data members. So I removed them. ParseAndDiscardHeaders will access the data members directly.
4. Fixing clangs issues.
5. Moving helpers from header_assembler_test into the common test class.
Aananth V [Wed, 19 Nov 2025 08:09:18 +0000 (00:09 -0800)]
Set security protocol type in AuthContext.
The Injectable Peer Comparison API added in https://github.com/grpc/grpc/pull/39610 uses the `protocol_` field of the `grpc_auth_context` to 1) Lookup the registered comparators, and 2) Perform an initial comparison to ensure that the two compared auth contexts have the same protocol. However, this field is currently unset for all types of credentials.
This change populates the `protocol` field in `grpc_auth_context` with the name of the security connector type after the peer check in the security handshaker. E2E Tests are updated to verify that the `AuthContext` contains the correct protocol type.
Aananth V [Tue, 18 Nov 2025 08:19:54 +0000 (00:19 -0800)]
Introduce UpDownCounter instrument type.
This change adds a new instrument type, `UpDownCounter`, to the gRPC telemetry system. Unlike a standard `Counter`, an `UpDownCounter` can be incremented and decremented. It can be thought of as a UInt Gauge that is stored rather than being queried when the MetricsQuery is run.
Mark D. Roth [Tue, 18 Nov 2025 05:09:26 +0000 (21:09 -0800)]
[subchannel connector] pass initial MAX_CONCURRENT_STREAMS value from connector (#41064)
This is needed for A105 (https://github.com/grpc/proposal/pull/516).
The subchannel will wind up getting the transport's MAX_CONCURRENT_STREAMS value via the new StateWatcher API that I added in #40952. However, because the subchannel does not start that watch until after it has a connection and reports READY to the LB policy, this means that RPCs can start on the subchannel before the subchannel knows the transport's MAX_CONCURRENT_STREAMS value. This can cause us to incorrectly scale up the number of connections when we shouldn't.
To avoid that race, this PR changes the notify_on_receive_settings hook to pass the transport's initial MAX_CONCURRENT_STREAMS value back to the connector, which in turn passes it back to the subchannel. This will allow the subchannel to use that initial value when dispatching RPCs until it receives the first notification from the StateWatcher.
I would ideally like to completely remove this bespoke notify_on_receive_settings hook and instead have the connector use the new StateWatcher API, but that would require a bit more refactoring work than I want to do right now.
siddharth nohria [Mon, 17 Nov 2025 08:14:37 +0000 (00:14 -0800)]
Server Wide Max Outstanding Streams: Add Build changes (#41076)
Allow servers to set max outstanding streams limit per server. This pull request only adds the BUILD changes required for this. The core logic will follow in a later PR.
This PR moves the macro to a separate bazel profile config called `postmortem`, which is not enabled by default.
Instead, this config will be enabled in all remote CIs via tools/remote_build/include/test_config_common.bazelrc: https://github.com/grpc/grpc/blob/ba4984e8a0d21270a6cfc0481efd2de1595601d9/tools/remote_build/include/test_config_common.bazelrc#L26-L27
For the list of affected CI jobs, see my comment on this PR.
Niraj Nepal [Fri, 14 Nov 2025 12:42:09 +0000 (04:42 -0800)]
[ruby] Fix version comparison for the ruby_abi_version symbol for ruby 4 compatibility (#41061)
The next version of Ruby will be 4.0.0. Previously, development versions didn't load properly due to grpc.so not exporting the ruby_abi_version symbol. Correct the version comparison logic so we export the symbol on version 4.0.
Mark D. Roth [Fri, 14 Nov 2025 01:25:24 +0000 (17:25 -0800)]
[transport] add new watcher API to be used by subchannel (#40952)
This adds a new transport state watcher API. The normal connectivity state watcher API is not what we really want in the transport, since we don't expect to see any state-change event except for disconnection, and when that happens, we want to see a lot more info about the disconnection than is available via a connectivity state watch (see [gRFC A94](https://github.com/grpc/proposal/blob/master/A94-subchannel-otel-metrics.md)). In addition, we also need to get reports of the peer's MAX_CONCURRENT_STREAMS setting as part of implementing connection scaling (see WIP [gRFC A105](https://github.com/grpc/proposal/pull/516)).
This new API goes directly from the subchannel to the transport, bypassing the filter stack. This is consistent with our desire to remove the transport op API in the filter stack as part the promise migration.
Eventually, this API should be used on the server side too, but that's a project for another day.
As part of this, we also change the way that keepalive data is sent from the subchannel to the channel. This will also be needed as part of A105, where we need to propagate keepalive info to the channel even when the subchannel's connectivity state does not change.
Atan Bhardwaj [Thu, 13 Nov 2025 03:33:10 +0000 (19:33 -0800)]
Refactor `proto_reflection_descriptor_database` in util to use `absl::flat_hash_map` and `absl::flat_hash_set` instead of `std::unordered_map` and `std::unordered_set` for potential performance improvements. This also involves including the necessary absl headers and updating the `BUILD` file.
Replace `.insert()` with `.emplace()` for `missing_symbols_` in
`proto_reflection_descriptor_database.cc.`
Mark D. Roth [Tue, 11 Nov 2025 01:11:48 +0000 (17:11 -0800)]
[subchannel] remove no-op code to remove channelz linkage (#41041)
This code appears to be a no-op. It removes the `Subchannel`'s channelz node as a parent of the `ConnectedSubchannel`'s channelz node when the `ConnectedSubchannel` becomes disconnected. However, the `ConnectedSubchannel`'s channelz node [is exactly the same node as the `Subchannel`'s channelz node](https://github.com/grpc/grpc/blob/4a97f0950a5fbe422f8cc2965c3055fa2d70444e/src/core/client_channel/subchannel.cc#L842), so this is attempting to remove a node from its own parent list -- and it should never be present in its own parent list in the first place, so this will be a no-op.
This code was originally intended to remove the linkage between the subchannel node and the *socket* node (owned by the transport, not the `ConnectedSubchannel`) when the transport reports disconnection to the subchannel. However, it looks like it was accidentally broken in c675ef88599d9a86684c68ff90cb5420b38f9703 (cl/786454401) as part of switching from recording children to recording parents.
Rather than fixing this, however, I think we should just remove this code. The transport will report a "disconnection" to the subchannel when it receives a GOAWAY, but the connection will actually remain alive for a short time afterwards if there are still RPCs in flight on the connection, and I think we want channelz to show the connection as still being associated with the subchannel until it actually gets closed, since that provides useful context about where the connection came from. (That behavior would have been hard to achieve when we were still recording children instead of parents.)
Note that the subchannel may wind up creating a new connection before the old connection is actually closed, but the channelz data model allows multiple sockets per subchannel for exactly this case. Java and Go both keep the socket associated with the subchannel until it's closed, so this change makes the behavior consistent across languages.
Mark D. Roth [Fri, 7 Nov 2025 23:40:13 +0000 (15:40 -0800)]
[client channel] combine two subchannel data structures into one (#40880)
Currently, the client channel maintains two different data structures for tracking subchannels:
- `subchannel_refcount_map_`, which tracks the number of subchannel wrappers per subchannel. This is used to determine when to create and remove channelz node linkage.
- `subchannel_wrappers_`, which tracks the set of all subchannel wrappers across all subchannels. This was originally introduced back in #20039 as part of propagating the health check service name to subchannels, but we switched to using a different approach for the health check service name in #26441. However, in between those two, we started using this data structure for handling keepalive time in #23313 -- which is actually somewhat inefficient, since we may wind up setting the keepalive time on the underlying subchannel more than once in the case where there is more than one subchannel wrapper per subchannel (which happens frequently during updates).
This PR combines these two data structures into one: a map from subchannel to a set of subchannel wrappers for that subchannel. This is used both for channelz node updates and keepalive propagation -- and it's more efficient for the latter, because we can now update each subchannel exactly once.
This also paves the way for a subsequent change that will be needed as part of the MAX_CONCURRENT_STREAMS design.
Luwei Ge [Fri, 7 Nov 2025 21:55:29 +0000 (13:55 -0800)]
[core][credentials] share the common http token fetching logic between oauth2 and jwt call creds (#40907)
Those two call creds only differ in how they parse the token from the HTTP response. So I changed the meaning of the `on_done_` in the `FetchRequest` to also include the parsing logic and let each credentials prepare this callback that includes the parsing and the one given to the `FetchToken` call.
The HTTP response status handling of both credentials was switched to follow the `JwtTokenFetcherCredentials`, as that supports better retry behavior.
This approach inevitably changed a little bit the error msg the JWT token fetcher returns, but it should still provide useful info to the caller.
NOTE: This PR does not touch the `ExternalFetchRequest` in `external_account_credentials.h`. That type of credentials does not always making HTTP requests. It makes multiple hops to get the final token, so it introduced the `FetchBody` class. I find it challenging to unify all these three cases.
This refactors the call buffering code for the v1 stack, which avoids some repetition between the resolver queue and the LB pick queue. This code will also be used in the future in the subchannel as part of implementing the MAX_CONCURRENT_STREAMS design.
As part of this, I also eliminated the subclassing in the v1 client channel implementation, which has not been necessary since the v2 code was removed.
[BCR] Update BCR maintainers in the metadata template. (#41024)
Update the template used to export BCR releases. Fixes the issue
with emeritus owners getting pinged/assigned for BCR gRPC PRs.
Once this is merged, @yuanweiz will manually update corresponding
https://github.com/bazelbuild/bazel-central-registry/blob/1c61954df7e15bfbd03eef8accc7ae4e98ec548b/modules/grpc/metadata.json
- Offboard @veblush, @eugeneo, @hork, @yashkt, @apolcyn, @stanleycheung
- Add @yuanweiz, BCR integration owner
- Add @asheshvidyut, now owns Ruby and other wrapped languages
Tanvi Jagtap [Tue, 4 Nov 2025 14:48:38 +0000 (06:48 -0800)]
[PH2][Debugging][Logging] Adding logs
It is painful to debug http2_client_transport_test without these logs. Very hard to see which frames got serialized in which order.
[Fix][Python] Increase timeout for linux distribtests jobs for release and master branch (#41000)
The newly migrated pyproject.toml build system (in #40833) seems to have much slower build times compared to the previous setup.py builds. This difference has also been noted by few users ([Ref1](https://github.com/pypa/pip/issues/7294), [Ref2](https://zameermanji.com/blog/2021/6/14/building-wheels-with-pip-is-getting-slower/)) most likely due to the overhead of creating isolated build environments for each build.
Some new runtimes for each variant target noted from different recent runs below:
```
2025-11-01 10:19:02,937 PASSED: build_artifact.python_musllinux_1_2_x86_cp312-cp312 [time=4267.1sec, retries=0:0]
2025-11-01 10:19:05,501 PASSED: build_artifact.python_musllinux_1_2_x86_cp311-cp311 [time=4269.7sec, retries=0:0]
2025-11-01 10:19:07,152 PASSED: build_artifact.python_musllinux_1_2_x86_cp310-cp310 [time=4287.0sec, retries=0:0]
The builds take anywhere between 70-90 minutes for each target. We have a total of 30 targets currently and 12 targets are built parallelly in a batch. So for about 3 batches of 12 targets each and a worst case time of 90 minutes per batch, the `build_artifact` phase itself should take about 4.5 hours to complete, followed by the remaining `package` and `distribtest` phases which should be comparatively shorter.
So, I believe about 7 hours should be enough, but just keeping an extra buffer and increasing the timeout to 8 hours.
Tanvi Jagtap [Mon, 3 Nov 2025 08:31:20 +0000 (00:31 -0800)]
[PH2][FlowControl][Bug] Stall ReadLoop
Prioritize sending flow control updates over reading data. If we continue reading while Urgent flow control updates are pending, we might exhaust the flow control window. This prevents us from sending window updates to the peer, causing the peer to block unnecessarily while waiting for flow control tokens.
Ashesh Vidyut [Sat, 1 Nov 2025 02:22:46 +0000 (19:22 -0700)]
[Ruby] Fix Ruby Distribution Test Linux (#40978)
### Description
Since `October 25, 2025` Distribution Tests Ruby Linux started failing. On the same day a new version of [rake-compiler-dock](https://rubygems.org/gems/rake-compiler-dock) was released - `1.10.0`.
To temporarily fix the Distribution Tests Ruby Linux - fixing the `rake-compiler-dock` to previous version `1.9.1`
Sergii Tkachenko [Thu, 30 Oct 2025 20:53:32 +0000 (13:53 -0700)]
[interop] Add v1.76.0 for C++, Ruby, PHP (#40975)
Python version not updated because the image build is broken by a runtime dependency issue. The fix #40959 can't be backported retroactively to the tag.
Siddharth Nohria [Thu, 30 Oct 2025 20:19:16 +0000 (13:19 -0700)]
[Part 1] Resource Quota Write memory tracking.
Pass a MemoryAllocator to the serialize function, so that the write memory can be allocated towards Resource Quota accounting. Add templates for SerializationTraits, to allow implementations to continue using an implementation of Serialize, which does not take the allocator as a parameter. This change is a no-op for now, because all the callers of Serialize pass nullptr for the allocator.
Tanvi Jagtap [Thu, 30 Oct 2025 07:14:58 +0000 (00:14 -0700)]
[PH2][Flowcontrol] Multiple changes
1. Calling TriggerWriteCycle after WindowUpdate
2. Using the Finish() function which was earlier not used in the transport
3. Making internal flow control classe objects non-copyable, non-moveable and non-assignable.
4. Adding missing headers
Craig Tiller [Thu, 30 Oct 2025 04:08:18 +0000 (21:08 -0700)]
Export gRPC counters --> OTEL
This CL starts the work to export the new instrument domains system to OTEL.
Only counters are attempted at this point, the work is behind an experiment, and that experiment is disabled.