Arjan Singh Bal [Tue, 28 Oct 2025 16:59:58 +0000 (22:29 +0530)]
pickfirst: Remove old pickfirst (#8672)
Fixes: #8561
Addresses: #6472
The new pickfirst has been the default since [gRPC Go
v1.71.0](https://github.com/grpc/grpc-go/releases/tag/v1.71.0) and all
reported bugs have been fixed. This PR removes the old pickfirst policy
completely.
The exported symbols in the `pickfirstleaf` package are retained with a
deprecation notice for removal after one release.
RELEASE NOTES:
* pickfirst: Remove the old `pick_first` LB policy. The new `pick_first`
has been the default since v 1.71.0.
Arjan Singh Bal [Tue, 28 Oct 2025 06:40:03 +0000 (12:10 +0530)]
mem: Remove Reader interface and export the concrete struct (#8669)
This PR changes the exported slice reader from an interface to a
concrete struct.
This approach follows the precedent set by standard library packages,
such as [`bufio`'s `bufio.Reader`](https://pkg.go.dev/bufio#Reader).
This interface was not intended for users to implement, and gRPC does
not plan to provide alternative implementations. Users who require an
interface for abstraction or testing can define one in their own
packages.
This change provides two main advantages:
* Performance: It avoids a couple of heap allocations per stream that
were previously required to hold the interface value.
* Maintainability: Adding new methods to the concrete struct is a
backward-compatible change, whereas adding methods to an interface is a
breaking change.
## Benchmarks
```sh
# test command
$ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
-compression=off -maxConcurrentCalls=200 -trace=off \
-reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"
Pranjali-2501 [Mon, 27 Oct 2025 08:21:01 +0000 (13:51 +0530)]
xds/googlec2p: support custom bootstrap config per channel. (#8648)
xds/googlec2p: Fix channel-specific xDS bootstrap configurations by
allowing xdsclient creation with per-target config. Removes global
fallback config usage, enabling multiple distinct xDS clients to coexist
in the same process.
### Notable implementation detals
* `grpc.lb.backend_service` is not implemented yet (marked as optional
in the gRFC)
* modifies the tests to make sure we can cover all the cases for
`enforced`/`unenforced` without repeating the test setup.
RELEASE NOTES:
* outlierdetection: add metrics for enforced
(grpc.lb.outlier_detection.ejections_enforced) and unenforced
(grpc.lb.outlier_detection.ejections_unenforced) outlier ejections.
Arjan Singh Bal [Tue, 21 Oct 2025 05:14:02 +0000 (10:44 +0530)]
credentials/tls: Remove environment variable for disabling ALPN (#8660)
Related issue: https://github.com/grpc/grpc-go/issues/434
RELEASE NOTES:
* credentials/tls: Remove the `GRPC_ENFORCE_ALPN_ENABLED` environment
variable. ALPN is now enforced by default. Users who must disable ALPN
enforcement can temporarily use the [experimental transport
credentials](https://pkg.go.dev/google.golang.org/grpc@v1.76.0/experimental/credentials).
These experimental credentials will be removed in an upcoming release;
users who depend on them must vendor this version of gRPC or copy the
relevant code into their own codebase.
Arjan Singh Bal [Fri, 17 Oct 2025 06:49:47 +0000 (12:19 +0530)]
stats: Re-use objects while calling multiple Handlers (#8639)
This PR improves performance by eliminating heap allocations when
multiple stats handlers are configured.
Previously, iterating through a list of handlers caused one heap
allocation per handler for each RPC. This change introduces a Handler
that combines multiple Handlers and implements the `Handler` interface.
The combined handler delegates calls to the handlers it contains.
This approach allows gRPC clients and servers to operate as if there
were only a single `Handler` registered, simplifying the internal logic
and removing the per-RPC allocation overhead. To avoid any performance
impact when stats are disabled, the combined `Handler` is only created
when at least one handler is registered.
# Tested
Since existing benchmarks don't register stats handler, I modified the
benchmark to add 2 stats handlers each on the server and client
(https://github.com/grpc/grpc-go/pull/8639/commits/36ba616d7c40deb1cb79d5c8d1636057f94fc88a).
```sh
# test command
go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
-compression=off -maxConcurrentCalls=200 -trace=off \
-reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"
There is a new version of the envoy protos (v1.35.0) that was released
recently that contains proto changes required for the ext_authz support
that I'm currently working on.
Ran the following command twice (as mentioned in our release docs):
```
for x in $(find . -name 'go.mod' | xargs dirname | sort); do
pushd "${x}"
go get -u ./...
go mod tidy -compat=1.24
popd
done
```
The cancel method is not called in
[`deleteStream`](https://github.com/grpc/grpc-go/blob/2d922719c02bb46f34482d592c35e72dc4a9ad92/internal/transport/http2_server.go#L1302).
This change invokes `deleteStream` through `closeStream` in the flaking
test to ensure the stream is always cancelled to avoid leaking timers.
This PR *only* changes the implementation of the listener resource type
to adhere to the external xdsclient API. The other resource type
implementations will be handled in subsequent PRs, and once all resource
type implementations have switched to the external xdsclient API, we can
get rid of some of existing APIs.
eshitachandwani [Tue, 14 Oct 2025 09:01:27 +0000 (14:31 +0530)]
delegatingresolver: add default port to addresses (#8613)
Fixes: https://github.com/grpc/grpc-go/issues/8607
RELEASE NOTES:
- Fixes a bug where default port 443 was not being added to addresses
without port being sent to proxy.
- Adds a new environment variable
`GRPC_EXPERIMENTAL_ENABLE_DEFAULT_PORT_FOR_PROXY_TARGET` for adding a
default port to addresses being sent to proxy which is set by default.
xdsclient: fix the flaky ADS stream restart test (#8631)
The ADS stream restart test can be flaky for the following reason:
- It requests a CDS resource and unrequests it before the stream breaks.
- And then once the stream restarts, it verifies that this resource is
not requested again.
- But the ACK for this resource may or may not be received at the
management server before the stream breaks. This can falsely cause the
test to conclude that the request was re-requested after the restart.
This PR changes the test in the following ways:
- Use a single resource
- Verify ACK before the stream is restarted
Arjan Singh Bal [Thu, 9 Oct 2025 04:19:03 +0000 (09:49 +0530)]
transport: Replace closures with interfaces to avoid heap allocations (#8630)
In Go, creating a closure results in a heap allocation if the compiler
determines the closure might outlive the function in which it was
created. This change removes two such closures, replacing them with
interfaces that are implemented by the `ClientStream` and `ServerStream`
structs.
While this pattern may slightly reduce readability, the performance
benefit is worthwhile, as this transport code is executed for every new
stream. This reduces allocs/unary RPC by 2.5%.
## Testing
```sh
# test command
go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
-compression=off -maxConcurrentCalls=500 -trace=off \
-reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}" -recvBufferPool=simple
Arjan Singh Bal [Thu, 9 Oct 2025 04:03:51 +0000 (09:33 +0530)]
benchmark/benchmain: Enable buffer pooling by default (#8638)
This PR enables buffer pooling in the benchmark test to align with the
library's current default configuration.
The benchmark originally disabled buffer pooling because the feature was
opt-in when introduced (#5862). Since buffer pooling is now enabled by
default (#7356), this change ensures the benchmark accurately measures
the performance of gRPC's default behavior.
Madhav Bissa [Wed, 8 Oct 2025 20:28:35 +0000 (01:58 +0530)]
client: ignore http status header for gRPC streams (#8548)
Fixes https://github.com/grpc/grpc-go/issues/8486
When a gRPC response is received with content type application/grpc, we
then do not expect any information in the http status and the status
information needs to be conveyed by gRPC status only.
In case of missing gRPC status, we will throw an Internal error instead
of Unknown in accordance with https://grpc.io/docs/guides/status-codes/
Changes :
- Ignore http status in case of content type application/grpc
- Change the default rawStatusCode to return Internal for missing grpc
status
RELEASE NOTES:
* client : Ignore the HTTP header status for gRPC streams and return
Internal error for missing gRPC status.
xds: Store WeightedClusters as a slice instead of a map inside the Route (#8632)
Reasons for this change:
- The `WeightedClusters` field is never used as a map.
- Weighted clusters are stored as a list in the original envoy proto as
well.
- In tests that require deterministic WRR behavior, we use the
`testutils.NewTestWRR` to get rid of the randomness. But the output
still depends on the order in which items are added to the WRR. Maps in
Go are non-deterministic.
Elric [Tue, 7 Oct 2025 22:25:07 +0000 (07:25 +0900)]
transport: Increment metrics only when the stream is active (#8573)
Fixes: https://github.com/grpc/grpc-go/issues/8529
This PR fixes to increment metrics only when the stream is active which
is found in the activeStreams map.
#### as-is
- The deleteStream was incrementing channelz metrics every time it was
called, even when stream was already removed from activeStreams or not
exists in activeStreams.
#### to-be
- Added check to ensure metrics are only incremented once when a stream
is actually removed from activeStreams.
RELEASE NOTES:
* server: Fix a bug that caused overcounting of channelz metrics for
successful and failed streams.
Arjan Singh Bal [Tue, 7 Oct 2025 09:41:48 +0000 (15:11 +0530)]
transport: Reduce pointer usage in Stream structs (#8624)
The pprof profiles for unary RPC benchmarks indicate significant time
spent in `runtime.mallocgc` and `runtime.gcBgMarkWorker`. This indicates
gRPC is spending significant CPU cycles allocating or garbage
collecting.
This change reduces the number of pointer fields in the structs that
represent client and server stream. This will reduce number of memory
allocations (faster) and also reduce pressure on garbage collector
(faster garbage collections) since the GC doesn't need to scan
non-pointer fields. For structs which were stored as pointers to ensure
values are not copied, a `noCopy` struct is embedded that will cause `go
vet` to fail if copies are performed. Non-pointer fields are also moved
to the end of the struct to improve allocation speed.
## Results
There are improvements in QPS, latency and allocs/op for unary RPCs.
```sh
# test command
go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
-compression=off -maxConcurrentCalls=500 -trace=off \
-reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}" -recvBufferPool=simple
Gregory Cooke [Fri, 3 Oct 2025 19:32:56 +0000 (15:32 -0400)]
testing: SPIFFE Bundle Maps - Swap to a real unsupported key type (#8626)
EC Keys are actually supported. The test using this file previously
failed because we had `EC` in the `kty` field, but not the associated
`crv`, `x`, and `y` values. Change this test to use an actual
unsupported key type.
When using the health producer for health checks, and the health package
is not imported by the application, a no op health producer is used
without logging any errors. This PR adds an error log similar to the one
for the old health checks started by the subchannel.
Arjan Singh Bal [Fri, 3 Oct 2025 04:05:19 +0000 (09:35 +0530)]
pickfirstleaf: fix bug in address de-duplication (#8611)
Due to a bug in the new pickfirst balancer, it wasn't de-duplicating
addresses in the resolver update. The only user visible impact of this
seems to be less frequent picker updates after the first pass in happy
eyeballs and incorrect interleaving of IPv4/IPv6 addresses during the
first happy eyeballs pass.
RELEASE NOTES:
* balancer/pickfirst: Fix a bug where duplicate addresses were not being ignored as intended.
examples/health: fix markdown formatting and improve content (#8625)
This PR fixes some markdown formatting issues flagged by an internal
tool (when attempting to import recent changes into google3). It also
makes minor improvements to the content.
xdsclient: fix race in ADS stream flow control causing indefinite blocking (#8605)
Fixes https://github.com/grpc/grpc-go/issues/8594
The above issue clearly describes the condition under which the race
manifests. The changes in this PR are as follows:
- Remove the `readyCh` field in the flow control that was previously
used to block when waiting for flow control. Instead use a condition
variable.
- Have two bits of state inside the flow control:
- One to indicate if there is a pending update that is waiting
consumption by all watchers
- One to indicate that the stream is closed
- The flow control objects no longer needs to be recreated every time a
new stream is created
- The flow control object is stopped when the `adsStreamImpl` is stopped
This PR also makes other minor changes:
- Fix a flaky test by ensuring that the test stream implementation
unblocks from a `Recv` call when the underlying stream context is
cancelled
- Couple of logging improvements
RELEASE NOTES:
- xdsclient: fix a race in the ADS stream implementation that could
result in resource-not-found errors, causing the gRPC client channel to
move to `TransientFailure`
examples: improve interceptor example with better markdown formatting (#8612)
This PR changes the existing example to use fenced code blocks instead
of backticks to show the interceptor signatures. This greatly improves
how the document is rendered. The PR also changes the word `overload` to
`override` because it is the latter that we are showcasing here, and Go
does not support overloading anyways.
eshitachandwani [Wed, 1 Oct 2025 19:24:25 +0000 (00:54 +0530)]
xds/cdsbalancer: change tests to use xds resolver (#8579)
Change the tests in xds cds balancer to use xds resolver instead of
manual resolver
This change is being done as part of gRFC [A74 : xDS Config
tears](https://github.com/grpc/proposal/blob/master/A74-xds-config-tears.md).
This is to make sure the tests pass after the change too.
xds/resolver: minor cleanup in the config selector implementation (#8609)
This PR injects the dependencies of `configSelector` at creation time,
instead of passing a reference to the `xdsResolver` and having the
former directly access fields from the latter.
On connection breakage, the pickfirst leaf balancer enters idle and
returns an `Idle picker` that calls the balancer's `ExitIdle` method
only the first time `Pick` is called. The following sequence of events
will cause the balancer to get stuck in `Idle` state:
1. Existing connection breaks, SubConn [requests re-resolution and
reports
IDLE](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/clientconn.go#L1388-L1393).
In turn PF updates the ClientConn state to IDLE with an `Idle picker`.
1. An RPC is made, triggering `balancer.ExitIdle` through the idle
picker. The balancer attempts to re-connect the failed SubConn.
1. The resolver produces a new endpoint list, removing the endpoint used
by the existing SubConn. PF removes the existing SubConn. Since the
balancer didn't update the ClientConn state to CONNECTING yet, pickfirst
thinks that it's still in IDLE and doesn't start connecting to the new
endpoints.
1. New RPC requests trigger the idle picker, but it's a no-op since it
only [triggers the balancer's ExitIdle method
once](https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663https://github.com/grpc/grpc-go/blob/bb71072094cf533965450c44890f8f51c671c393/balancer/pickfirst/pickfirstleaf/pickfirstleaf.go#L663).
## Fix
This change moves the ClientConn into Connecting immediately when the
`ExitIdle` method is called. This ensures that the balancer continues to
re-connect when a new endpoint list is produced by the resolver.
RELEASE NOTES:
* balancer/pickfirst: Fix bug that can cause balancer to get stuck in
`IDLE` state on connection failure.
transport: Invoke `net.Conn.SetWriteDeadline` in `http2_client.Close` (#8534)
Fixes: #8425
This PR adds a call to `net.Conn.SetWriteDeadline`, as discussed in
https://github.com/grpc/grpc-go/issues/8425#issuecomment-3057938248.
Additionally, it updates the previous call to `SetReadDeadline` to log
any non-nil error value (this doesn't affect behavior but proved helpful
in some earlier debugging).
RELEASE NOTES:
* client: Set a read deadline when closing a transport to prevent it
from blocking indefinitely on a broken connection.
pickfirstleaf: Fix shuffling of addresses in resolver updates without endpoints (#8610)
The new `pick_first`, which is the default, doesn't shuffle the
addresses at all for resolver updates that are missing the `Endpoints`
field. This change fixes that. Since [gRPC automatically sets the the
missing
`Endpoints`](https://github.com/grpc/grpc-go/blob/1059e84f885bf7ed65b3b1a4fbe914360d8ab5b1/resolver_wrapper.go#L136-L138),
occurrence of this bug should be uncommon in practice.
RELEASE NOTES:
* balancer/pick_first: When configured, shuffle addresses in resolver
updates that lack endpoints. Since gRPC automatically adds endpoints to
resolver updates, this bug should only affect implementers of custom LB
policies that use pick_first for delegation but don't forward the
endpoints.
Evan Jones [Thu, 25 Sep 2025 17:53:20 +0000 (13:53 -0400)]
examples/features/health: Clarify docs for health import (#8597)
The google.golang.org/grpc/health package must be imported for client
health checking to work. I somehow missed this, even though it is in the
README, the client example, and the health package docs. Attempt to make
it clearer with a few extra mentions, since it is quite hard to debug
this misconfiguration.
* Remove deprecated grpc.WithBlock function
* Make service config const since it isn't modified
xdsclient: improve fallback test involving three servers (#8604)
The existing fallback test that involves three servers is flaky. The
reason for the flake is because some of the resources have the same name
in different servers. The listener resource is expected to have the same
name across the different management servers, but we generally expect
the other resources to have different names.
See the following from the gRFC:
- In
https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md#reservations-about-using-the-fallback-server-data,
we have the following:
```
We have no guarantee that a combination of resources from different xDS servers form a valid cohesive
configuration, so we cannot make this determination on a per-resource basis. We need any given gRPC
channel or server listener to only use the resources from a single server.
```
- In
https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md#config-tears,
we have the following:
```
Config tears happen when the client winds up using some combination of resources from the primary and
fallback servers at the same time, even though that combination of resources was never validated to work
together. In theory, this can cause correctness issues where we might send traffic to the wrong location or
the wrong way, or it can cause RPCs to fail. Note that this can happen only when the primary and fallback
server use the same resource names.
```
This PR ensures that all the different management servers have different
resource names for all resources except the listener. Also, ran the test
on forge 100K times with no failures.
This PR also improves a couple of logs that I found useful when
debugging the failures.
opentelemetry: Remove chatty log in client (#8606)
Removing this debug log to reduce noise. This log fires on every RPC
call but provides no useful debugging value. The action it logs (adding
callInfo to the context) is part of the normal flow, and the message
contains no helpful variables.
benchmark: Hold read+write lock while updating server state (#8601)
The `lastResetTime` and `rusageLastReset ` fields in the
`benchmarkServer` are written while holding a read lock. This can result
in concurrent modifications. This change replaces the `RWMutex` with a
regular `Mutex` to avoid such problems. This lock is acquired a couple
of times during the entire test run, so contention is not a major
concern.
encoding: Add a test-only function for temporarily registering compressors (#8587)
Fixes: https://github.com/grpc/grpc-go/issues/7960
This PR adds a function that allows tests to register a compressor with
arbitrary names and un-register them at the end of the test. This
prevents the compressor names from showing up in the encoding header in
subsequent tests. Previously, tests were using the name of the existing
compressor "gzip" and re-registering the original compressor to
workaround this problem.
xdsclient: fix TestConcurrentReportLoad to not run for 10s (#8598)
While working on the fix for the xDS client unsubscribe/resubscribe
race, I noticed that the tests in the `internal/xds/xdsclient/tests/`
directory were taking about a minute to run. Upon inspection I found
that `TestConcurrentReportLoad` was running for the configured test
timeout duration of `10s`, but was not failing.
This PR fixes the test to run in a short duration. It also makes a
couple of other cleanups that I noticed when fixing this test.
xdsclient/tests: move fallback tests to separate directory (#8600)
Currently, tests in the `internal/xds/xdsclient/tests` package can take
close to a minute to run. Almost half of that time is taken by the
fallback tests which actually have to run longer because they have to
wait for connections to go down and come up and for these events to be
detected by the code (before fallback is triggered).
Splitting the fallback tests into a separate directory almost reduces
the time by half since tests from these two packages can now run in
parallel.
We *could* possibly add a way for tests to add some dial options (to be
used when dialing the management server), and thereby reduce the time
spent in exponential backoff before connections are reattempted (during
the fallback process). But this would require non-trivial amount of
work, and could make the code more complicated. The change in this PR
seems like a good bang for the buck.
flowcontrol: change variable names for better understanding (#8578)
This PR aims to improve some variable names for better understanding.
Before the change, it took time for users to think about why there's a
`b` variable.
benchmark: Avoid spawning a goroutine per unary call (#8591)
The benchmark client is presently spawning a new goroutine per unary
call and blocking on its completion. Since the spawning goroutine is
blocked, it is more efficient to do the work in the spawning goroutine
itself. This change has the following effect on the [benchmark
performance](https://grafana-dot-grpc-testing.appspot.com/):
1. Unary 8-core: 184k QPS to 233k QPS (+26%)
2. Unary 30-core: 403k QPS to 624k QPS (+54%)
## Tested
* Ran the benchmark on the same GKE cluster to repro the results from
the dashboard.
* Created a docker image with the changes in this PR. Re-ran the
benchmark with the new image.
vet: add line numbers of offending lines to the output (#8593)
When vet fails because of offending whitespace, the output currently
only lists the offending file. This change adds the line number to the
output to make it easier on the developer to fix the issue.
credentials: Remove TODO from public godoc (#8589)
The TODO comment with a Github user's name shows up in the [public
godoc](https://pkg.go.dev/google.golang.org/grpc@v1.75.1/credentials#PerRPCCredentials).
Since this is a stable API, changing it now doesn't seem feasible, so
this change removes it completely.
client: minor improvements to log messages (#8564)
Couple of minor improvements to log messages from the gRPC channel
The improvements are:
- Log the target URI when we log a message for the creation of a gRPC
channel
- Separate the channelz identifier (which could be something like
`[Channel #X]` or `[Channel X][Subchannel Y]` etc) from the actual
message being logged with a space
Part one for https://github.com/grpc/proposal/pull/492 (A97).
This is done in a new `credentials/jwt` package to provide file-based
PerRPCCallCredentials. It can be used beyond XDS. The package handles
token reloading, caching, and validation as per A97 .
There will be a separate PR which uses it in `xds/bootstrap`.
Whilst implementing the above, I considered `credentials/oauth` and
`credentials/xds` packages instead of creating a new one. The former
package has `NewJWTAccessFromKey` and `jwtAccess` which seem very
relevant at first. However, I think the `jwtAccess` behaviour seems more
tailored towards Google services. Also, the refresh, caching, and error
behaviour for A97 is quite different than what's already there and
therefore a separate implementation would have still made sense.
WRT `credentials/xds`, it could have been extended to both handle
transport and call credentials. However, this is a bit at odds with A97
which says that the implementation should be non-XDS specific and, from
reading between the lines, usable beyond XDS.
I think the current approach makes review easier but because of the
similarities with the other two packages, it is a bit confusing to
navigate. Please let me know whether the structure should change.
Relates to https://github.com/istio/istio/issues/53532
xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache (#8521)
Fixes: #8435
### root cause of issue:
- I think there was a race condition when channel communicates between
the xDS resolver and test infrastructure
- insufficient buffer size: original channels (stateCh and errCh) had
only buffer size of 1
- blocking sends: When buffer is full, the resolver would block trying
to send the next update
- test deadlock: test infra might be waiting for a specific update while
the resolver was blocked trying to send a different update, creating a
deadlock
2) Non-blocking send pattern:
``` go
select {
case stateCh <- s: // the resolver try to send updates
default: // If channel is full, drain old message and retry
select {
case <-stateCh:
stateCh <- s
default:
}
}
```
- make it drain old messages preventing the resolver from blocking and just keeping the most latest updates.
3) Cleanup with draining goroutines:
``` go
go func() {
for range stateCh { } // Drain any remaining messages
}()
```
- it ensures the resolver never blocks on sends and prevents `goroutine leaks` during test cleanup.
internal/buffer: set closed flag when closing channel in the Load method (#8575)
## Description
This PR fixes a bug in the `Unbounded.Load()` method where the `closed`
flag was not being set to `true` when the channel was closed.
## Problem
In the `Load()` method, when the condition `b.closing && !b.closed` is
met, the code closes the channel but doesn't update the `closed` flag.
This creates an inconsistent state where:
- The channel is closed (no more data can be sent)
- But `b.closed` remains `false`
This inconsistency could potentially cause issues in code that relies on
the `closed` flag to determine the buffer's state.
## Solution
Added `b.closed = true` before `close(b.c)` in the `else if` branch of
the `Load()` method to ensure the closed flag accurately reflects the
buffer's state.
## Changes
- **File**: `internal/buffer/unbounded.go`
- **Method**: `Load()`
- **Line**: 86
- **Change**: Added `b.closed = true` before closing the channel
## Testing
- ✅ All existing tests pass
- ✅ No linter errors introduced
- ✅ The fix ensures consistent state between channel closure and closed
flag
## Impact
This is a bug fix that improves the correctness of the `Unbounded`
buffer implementation without changing its public API or behavior from a
user perspective.
Roy Salame [Mon, 15 Sep 2025 05:21:51 +0000 (01:21 -0400)]
encoding/proto: enable use cached size option (#8569)
Enable UseCachedSize in proto marshal to eliminate redundant size
computation
Fixes: https://github.com/grpc/grpc-go/issues/8570
The proto message size was previously being computed twice: once before
marshalling and again during the marshalling call itself. In
high-throughput workloads, this duplicated computation is expensive.
By enabling `UseCachedSize` on `MarshalOptions`, we reuse the size
calculated immediately before marshalling, avoiding the second call to
`proto.Size`.
In our application, the redundant size call accounted for ~12% of total
CPU time. With this change, we eliminate that overhead while preserving
correctness.
transport: avoid slice reallocation during header creation (#8547)
This PR improves the size estimate while pre-allocating `headerFields`
to avoid reallocations, which pprof showed were responsible for ~4% of
total memory allocations. This change improves performance, increasing
QPS by 1% while reducing bytes/op by 4% and latencies by 0.3-4%.
Revert "stats/opentelemetry: record retry attempts from clientStream (#8342)" (#8571)
This introduced flakiness in a test -
Test/TraceSpan_WithRetriesAndNameResolutionDelay
Failure:
https://github.com/grpc/grpc-go/actions/runs/17614152882/job/50042942932?pr=8547
Related issue: https://github.com/grpc/grpc-go/issues/8299
GoogleC2P: remove dependency on metadata server for IPv6 node metadata (#8550)
Remove reliance on metadata server since it's result is no longer
needed, hardcode IPv6 support in node metadata instead.
Related c++ change: https://github.com/grpc/grpc/pull/40571
Note we preserve prior behavior in case experiment `NewPickFirstEnabled`
is disabled, because our testing/qualification has not covered that
being disabled.
xds: move env var check for HTTP CONNECT metadata parsing to endpoint and locality parsing functions (#8551)
Currently, the env var check for parsing HTTP CONNECT metadata (A86) is
inside the function that parses custom metadata,
`validateAndConstructMetadata`.
This PR moves the check to the endpoint and locality parsing functions,
`parseEndpoint` and the top-level `parseEDSRespProto` which is where
localities are parsed. This allows multiple env vars to control
different custom metadata keys. We already support two custom metadata
keys (A76 and A86) and we plan to support more (A83).
This PR also ensures that the custom metadata used for ring_hash key
(A76) uses the recently added `StructMetadataValue` type. This ensures
that metadata parsing happens only once.
Since the location of the env var check is moved, the tests are also
restructured a little. This PR groups the custom metadata parsing tests
into three groups: one for success cases when the env var is turned on,
one for success cases when the env var is turned off, and one for
failure cases when the env var is turned on.
Use new-style atomic APIs instead of the old ones in the
`ignoreResolveNowClientConn` type.
The changes made in this PR improve the code in the following ways:
* Ergonomics: Method-based API vs function-based, no pointer management
needed
* Safety: Type safety prevents mixing atomic/non-atomic operations,
eliminates pointer errors
* Clarity: The `atomic.Uint32` type makes atomic intent explicit from
declaration
Fixes: https://github.com/grpc/grpc-go/issues/8485
RELEASE NOTES:
* client: Ignore http headers with status 1xx and `END_STREAM` flag
unset.
* client: Fail RPCs with status `INTERNAL` instead of `UNKNOWN` on
receiving http headers with status 1xx and `END_STREAM` flag set.
transport: allow stream cancellation on the server when blocked on flow control (#8528)
Fixes: #8517
This change allows `t.closeStream()` to be executed even if the stream
state is `done`. This is required to allow streams to be cancelled to
timed out. See issue for detailed root cause.
RELEASE NOTES:
* server: Fix bug preventing streams from being cancelled or timed out
when blocked on flow control.
eshitachandwani [Sat, 30 Aug 2025 14:24:14 +0000 (19:54 +0530)]
xdsclient: Fix race in SetWatchExpiryTimeoutForTesting (#8526)
Fixes: #8525
There is a race in
[SetWatchExpiryTimeoutForTesting](https://github.com/grpc/grpc-go/blob/fa0d6583208033fe4f69d359f80286736fd121d0/internal/xds/clients/xdsclient/xdsclient.go#L121)
which is used to override the watch expiry timeout of XDSClient for
testing. Currently it just sets the watchExpiryTimeout of the XDSClient
to the provided value without a mutex each time we call
[NewClientForTesting](https://github.com/grpc/grpc-go/blob/fa0d6583208033fe4f69d359f80286736fd121d0/internal/xds/xdsclient/pool.go#L116C16-L116C35)
which might of might not create a new XDSClient if one is already there.
Fix : Add a new field `WatchExpiryTimeout` to the xdsclient
[config](https://github.com/grpc/grpc-go/blob/30645d521be375d13fa4cb2baa0d2561ca44c342/internal/xds/clients/xdsclient/xdsconfig.go#L28)
which will now be used instead of `internal.WatchExpiryTImeout`
cjqzhao [Fri, 29 Aug 2025 16:57:00 +0000 (09:57 -0700)]
xds: add metadata registry (#8537)
Following
[A83](https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md)
and
[A86](https://github.com/grpc/proposal/blob/master/A86-xds-http-connect.md),
this adds a registry for custom metadata received in xDS protos for the
purpose of converting the received metadata into internal
representations.
eshitachandwani [Fri, 29 Aug 2025 03:48:05 +0000 (09:18 +0530)]
xds/resolver: change tests to update all resources (#8539)
Change the tests in xds resolver to update all resources in management
server instead of only listener and route resource.
This change is being done as part of gRFC [A74 : xDS Config
tears](https://github.com/grpc/proposal/blob/master/A74-xds-config-tears.md).
This is to make sure the tests pass after the change too.
eshitachandwani [Tue, 26 Aug 2025 05:29:33 +0000 (10:59 +0530)]
xdsclient: create LRSClient at time of initialisation (#8483)
Fixes: https://github.com/grpc/grpc-go/issues/8474
The race is in
[ReportLoad](https://github.com/grpc/grpc-go/blob/9186ebd774370e3b3232d1b202914ff8fc2c56d6/xds/internal/xdsclient/clientimpl_loadreport.go#L35C2-L44C21)
function of clientImpl. The implementation was recently changed as the
part of [xds client
migration](https://github.com/grpc/grpc-go/commit/082a9275c79a9d78fdaa4a93018e5e53a4a3af18).
The
[comment](https://github.com/grpc/grpc-go/blob/85240a5b02defe7b653ccba66866b4370c982b6a/xds/internal/xdsclient/clientimpl.go#L86C2-L87C16)
says that `lrsclient.LRSClient` should be initialized only at creation
time but that was not the case. It was being initialized at the time of
calling `ReportLoad` function.
RELEASE NOTES:
- lrsclient:
- Fix a race condition where the `LRSClient` was not initialized at
creation time but it was being initialized at the time of calling the
`ReportLoad` function.
- Creating an `LRSClient` no longer requires a node ID.
Pranjali-2501 [Mon, 25 Aug 2025 19:24:23 +0000 (00:54 +0530)]
client: Roll-forward PR #8278(with changes): Restore the existing behavior to return io.EOF on repeated RecvMsg() calls for client-streaming RPCs (#8523)
Changes:
- Modifies client.RecvMsg() so that successive calls after stream ends
return io.EOF.
- Adds extra state to track calls to client.recvmsg(required to return
Cardinality Violation only in case zero response)
RELEASE NOTES:
* client: Return status code INTERNAL when a server sends 0 response
messages for a unary or client streaming RPC.
The change being reverted here (#8369) is a prime suspect for a race
that can show up with the following sequence of events:
- create a new gRPC channel with the `xds:///` scheme
- make an RPC
- close the channel
- repeat (possibly from multiple goroutines)
The observable behavior from the race is that the xDS client thinks that
a Listener resource is removed by the control plane when it clearly is
not. This results in the user's gRPC channel moving to TRANSIENT_FAILURE
and subsequent RPC failures.
The reason the above mentioned PR is not being rolled back using `git
revert` is because the xds directory structure has changed significantly
since the time the PR was originally merged. Manually performing the
revert seemed much easier.
RELEASE NOTES:
* xdsclient: Revert a change that introduces a race with xDS resource
processing, leading to RPC failures
Arjan Singh Bal [Thu, 21 Aug 2025 06:50:13 +0000 (12:20 +0530)]
transport: ensure header mutex is held while copying trailers in handler_server (#8519)
Fixes: https://github.com/grpc/grpc-go/issues/8514
The mutex that guards the trailers should be held while copying the
trailers. We do lock the mutex in [the regular gRPC server
transport](https://github.com/grpc/grpc-go/blob/9ac0ec87ca2ecc66b3c0c084708aef768637aef6/internal/transport/http2_server.go#L1140-L1142),
but have missed it in the std lib http/2 transport. The only place where
a write happens is `writeStatus()` is when the status contains a proto.
eunsang [Tue, 19 Aug 2025 17:05:46 +0000 (02:05 +0900)]
xds: move all functionality from `xds/internal` to `internal/xds` (#8515)
Fixes grpc#7290, ensuring that only user-facing functionality remains in
the top-level xds package.
Updates all import paths and aliases to reference the new internal/xds
package, using aliases (e.g., `internal` → `xds` or `xdsinternal`) where
needed to minimize changes to call sites.
No functional changes intended; this is purely a package path
reorganization.
eshitachandwani [Mon, 18 Aug 2025 05:15:30 +0000 (10:45 +0530)]
xds/cdsbalancer: increase buffer size of requested resource channel in test (#8467)
RELEASE NOTES: N/A
Fixes: https://github.com/grpc/grpc-go/issues/8462
The main issue was that the requests were getting dropped since we use a
[non-blocking
send](https://github.com/grpc/grpc-go/blob/a5e7cd6d4c2c31b1e6649789c2ddc9a82ad6b5fa/xds/internal/balancer/cdsbalancer/cdsbalancer_test.go#L222C5-L227C6)
for resources in test along with buffer size of just
[one](https://github.com/grpc/grpc-go/blob/a5e7cd6d4c2c31b1e6649789c2ddc9a82ad6b5fa/xds/internal/balancer/cdsbalancer/cdsbalancer_test.go#L210)
which was resulting in resource request updates being dropped if the
receiver is not executing at the exact moment.
Fix:
Changed the `setupManagementServer` to take `listener` and `OnStreamReq`
function as a parameter and in the `TestWatcher` added a blocking send
whenever a cluster resource is requested.
xdsclient: schedule serializer callback from the authority instead of from the xdsChannel (#8498)
This is a small code change that simplifies how a callback is scheduled.
The `xdsChannel` will no longer directly access the serializer inside
the `authority` type. Instead, the authority type will now handle the
scheduling itself. This makes the code cleaner and moves the scheduling
logic to where it belongs.
grpcsync: use context.AfterFunc to close buffer after context canceled in CallbackSerializer (#8489)
[The current minimum supported Go version is now
1.23](https://github.com/grpc/grpc-go/blob/62ec29fd9b3f9ea3cea6dc08a31e837aa92678b7/go.mod#L3).
`context.AfterFunc` is available for all of grpc-go's latest version
users. Thus we can do this pending TODO.
`context.AfterFunc` would invoke the given function for both _immediate_
context cancelation and timer-based context cancelation (`WithTimeout`,
`WithDeadline`). So I think this change is safe.
This PR updates Prometheus-related dependencies in grpc-go to fix
compatibility issues caused by recent API changes in
github.com/prometheus/otlptranslator.
Complementing the broader dependency updates made in PR #8497.
Oleksandr Redko [Tue, 12 Aug 2025 06:39:40 +0000 (09:39 +0300)]
grpclb: simplify stringifying of IPv6 with net.JoinHostPort (#8503)
This PR simplifies IP address handling in
`lbBalancer.processServerList`.
From [net.JoinHostPort](https://pkg.go.dev/net#JoinHostPort):
> JoinHostPort combines host and port into a network address of the form
"host:port". If host contains a colon, as found in literal IPv6
addresses, then JoinHostPort returns "[host]:port".