pickfirstleaf: Fix shuffling of addresses in resolver updates without endpoints (#8610)
The new `pick_first`, which is the default, doesn't shuffle the
addresses at all for resolver updates that are missing the `Endpoints`
field. This change fixes that. Since [gRPC automatically sets the the
missing
`Endpoints`](https://github.com/grpc/grpc-go/blob/1059e84f885bf7ed65b3b1a4fbe914360d8ab5b1/resolver_wrapper.go#L136-L138),
occurrence of this bug should be uncommon in practice.
RELEASE NOTES:
* balancer/pick_first: When configured, shuffle addresses in resolver
updates that lack endpoints. Since gRPC automatically adds endpoints to
resolver updates, this bug should only affect implementers of custom LB
policies that use pick_first for delegation but don't forward the
endpoints.
Evan Jones [Thu, 25 Sep 2025 17:53:20 +0000 (13:53 -0400)]
examples/features/health: Clarify docs for health import (#8597)
The google.golang.org/grpc/health package must be imported for client
health checking to work. I somehow missed this, even though it is in the
README, the client example, and the health package docs. Attempt to make
it clearer with a few extra mentions, since it is quite hard to debug
this misconfiguration.
* Remove deprecated grpc.WithBlock function
* Make service config const since it isn't modified
xdsclient: improve fallback test involving three servers (#8604)
The existing fallback test that involves three servers is flaky. The
reason for the flake is because some of the resources have the same name
in different servers. The listener resource is expected to have the same
name across the different management servers, but we generally expect
the other resources to have different names.
See the following from the gRFC:
- In
https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md#reservations-about-using-the-fallback-server-data,
we have the following:
```
We have no guarantee that a combination of resources from different xDS servers form a valid cohesive
configuration, so we cannot make this determination on a per-resource basis. We need any given gRPC
channel or server listener to only use the resources from a single server.
```
- In
https://github.com/grpc/proposal/blob/master/A71-xds-fallback.md#config-tears,
we have the following:
```
Config tears happen when the client winds up using some combination of resources from the primary and
fallback servers at the same time, even though that combination of resources was never validated to work
together. In theory, this can cause correctness issues where we might send traffic to the wrong location or
the wrong way, or it can cause RPCs to fail. Note that this can happen only when the primary and fallback
server use the same resource names.
```
This PR ensures that all the different management servers have different
resource names for all resources except the listener. Also, ran the test
on forge 100K times with no failures.
This PR also improves a couple of logs that I found useful when
debugging the failures.
opentelemetry: Remove chatty log in client (#8606)
Removing this debug log to reduce noise. This log fires on every RPC
call but provides no useful debugging value. The action it logs (adding
callInfo to the context) is part of the normal flow, and the message
contains no helpful variables.
benchmark: Hold read+write lock while updating server state (#8601)
The `lastResetTime` and `rusageLastReset ` fields in the
`benchmarkServer` are written while holding a read lock. This can result
in concurrent modifications. This change replaces the `RWMutex` with a
regular `Mutex` to avoid such problems. This lock is acquired a couple
of times during the entire test run, so contention is not a major
concern.
encoding: Add a test-only function for temporarily registering compressors (#8587)
Fixes: https://github.com/grpc/grpc-go/issues/7960
This PR adds a function that allows tests to register a compressor with
arbitrary names and un-register them at the end of the test. This
prevents the compressor names from showing up in the encoding header in
subsequent tests. Previously, tests were using the name of the existing
compressor "gzip" and re-registering the original compressor to
workaround this problem.
xdsclient: fix TestConcurrentReportLoad to not run for 10s (#8598)
While working on the fix for the xDS client unsubscribe/resubscribe
race, I noticed that the tests in the `internal/xds/xdsclient/tests/`
directory were taking about a minute to run. Upon inspection I found
that `TestConcurrentReportLoad` was running for the configured test
timeout duration of `10s`, but was not failing.
This PR fixes the test to run in a short duration. It also makes a
couple of other cleanups that I noticed when fixing this test.
xdsclient/tests: move fallback tests to separate directory (#8600)
Currently, tests in the `internal/xds/xdsclient/tests` package can take
close to a minute to run. Almost half of that time is taken by the
fallback tests which actually have to run longer because they have to
wait for connections to go down and come up and for these events to be
detected by the code (before fallback is triggered).
Splitting the fallback tests into a separate directory almost reduces
the time by half since tests from these two packages can now run in
parallel.
We *could* possibly add a way for tests to add some dial options (to be
used when dialing the management server), and thereby reduce the time
spent in exponential backoff before connections are reattempted (during
the fallback process). But this would require non-trivial amount of
work, and could make the code more complicated. The change in this PR
seems like a good bang for the buck.
flowcontrol: change variable names for better understanding (#8578)
This PR aims to improve some variable names for better understanding.
Before the change, it took time for users to think about why there's a
`b` variable.
benchmark: Avoid spawning a goroutine per unary call (#8591)
The benchmark client is presently spawning a new goroutine per unary
call and blocking on its completion. Since the spawning goroutine is
blocked, it is more efficient to do the work in the spawning goroutine
itself. This change has the following effect on the [benchmark
performance](https://grafana-dot-grpc-testing.appspot.com/):
1. Unary 8-core: 184k QPS to 233k QPS (+26%)
2. Unary 30-core: 403k QPS to 624k QPS (+54%)
## Tested
* Ran the benchmark on the same GKE cluster to repro the results from
the dashboard.
* Created a docker image with the changes in this PR. Re-ran the
benchmark with the new image.
vet: add line numbers of offending lines to the output (#8593)
When vet fails because of offending whitespace, the output currently
only lists the offending file. This change adds the line number to the
output to make it easier on the developer to fix the issue.
credentials: Remove TODO from public godoc (#8589)
The TODO comment with a Github user's name shows up in the [public
godoc](https://pkg.go.dev/google.golang.org/grpc@v1.75.1/credentials#PerRPCCredentials).
Since this is a stable API, changing it now doesn't seem feasible, so
this change removes it completely.
client: minor improvements to log messages (#8564)
Couple of minor improvements to log messages from the gRPC channel
The improvements are:
- Log the target URI when we log a message for the creation of a gRPC
channel
- Separate the channelz identifier (which could be something like
`[Channel #X]` or `[Channel X][Subchannel Y]` etc) from the actual
message being logged with a space
Part one for https://github.com/grpc/proposal/pull/492 (A97).
This is done in a new `credentials/jwt` package to provide file-based
PerRPCCallCredentials. It can be used beyond XDS. The package handles
token reloading, caching, and validation as per A97 .
There will be a separate PR which uses it in `xds/bootstrap`.
Whilst implementing the above, I considered `credentials/oauth` and
`credentials/xds` packages instead of creating a new one. The former
package has `NewJWTAccessFromKey` and `jwtAccess` which seem very
relevant at first. However, I think the `jwtAccess` behaviour seems more
tailored towards Google services. Also, the refresh, caching, and error
behaviour for A97 is quite different than what's already there and
therefore a separate implementation would have still made sense.
WRT `credentials/xds`, it could have been extended to both handle
transport and call credentials. However, this is a bit at odds with A97
which says that the implementation should be non-XDS specific and, from
reading between the lines, usable beyond XDS.
I think the current approach makes review easier but because of the
similarities with the other two packages, it is a bit confusing to
navigate. Please let me know whether the structure should change.
Relates to https://github.com/istio/istio/issues/53532
xds/resolver_test: fix flaky test ResolverBadServiceUpdate_NACKedWithoutCache (#8521)
Fixes: #8435
### root cause of issue:
- I think there was a race condition when channel communicates between
the xDS resolver and test infrastructure
- insufficient buffer size: original channels (stateCh and errCh) had
only buffer size of 1
- blocking sends: When buffer is full, the resolver would block trying
to send the next update
- test deadlock: test infra might be waiting for a specific update while
the resolver was blocked trying to send a different update, creating a
deadlock
2) Non-blocking send pattern:
``` go
select {
case stateCh <- s: // the resolver try to send updates
default: // If channel is full, drain old message and retry
select {
case <-stateCh:
stateCh <- s
default:
}
}
```
- make it drain old messages preventing the resolver from blocking and just keeping the most latest updates.
3) Cleanup with draining goroutines:
``` go
go func() {
for range stateCh { } // Drain any remaining messages
}()
```
- it ensures the resolver never blocks on sends and prevents `goroutine leaks` during test cleanup.
internal/buffer: set closed flag when closing channel in the Load method (#8575)
## Description
This PR fixes a bug in the `Unbounded.Load()` method where the `closed`
flag was not being set to `true` when the channel was closed.
## Problem
In the `Load()` method, when the condition `b.closing && !b.closed` is
met, the code closes the channel but doesn't update the `closed` flag.
This creates an inconsistent state where:
- The channel is closed (no more data can be sent)
- But `b.closed` remains `false`
This inconsistency could potentially cause issues in code that relies on
the `closed` flag to determine the buffer's state.
## Solution
Added `b.closed = true` before `close(b.c)` in the `else if` branch of
the `Load()` method to ensure the closed flag accurately reflects the
buffer's state.
## Changes
- **File**: `internal/buffer/unbounded.go`
- **Method**: `Load()`
- **Line**: 86
- **Change**: Added `b.closed = true` before closing the channel
## Testing
- ✅ All existing tests pass
- ✅ No linter errors introduced
- ✅ The fix ensures consistent state between channel closure and closed
flag
## Impact
This is a bug fix that improves the correctness of the `Unbounded`
buffer implementation without changing its public API or behavior from a
user perspective.
Roy Salame [Mon, 15 Sep 2025 05:21:51 +0000 (01:21 -0400)]
encoding/proto: enable use cached size option (#8569)
Enable UseCachedSize in proto marshal to eliminate redundant size
computation
Fixes: https://github.com/grpc/grpc-go/issues/8570
The proto message size was previously being computed twice: once before
marshalling and again during the marshalling call itself. In
high-throughput workloads, this duplicated computation is expensive.
By enabling `UseCachedSize` on `MarshalOptions`, we reuse the size
calculated immediately before marshalling, avoiding the second call to
`proto.Size`.
In our application, the redundant size call accounted for ~12% of total
CPU time. With this change, we eliminate that overhead while preserving
correctness.
transport: avoid slice reallocation during header creation (#8547)
This PR improves the size estimate while pre-allocating `headerFields`
to avoid reallocations, which pprof showed were responsible for ~4% of
total memory allocations. This change improves performance, increasing
QPS by 1% while reducing bytes/op by 4% and latencies by 0.3-4%.
Revert "stats/opentelemetry: record retry attempts from clientStream (#8342)" (#8571)
This introduced flakiness in a test -
Test/TraceSpan_WithRetriesAndNameResolutionDelay
Failure:
https://github.com/grpc/grpc-go/actions/runs/17614152882/job/50042942932?pr=8547
Related issue: https://github.com/grpc/grpc-go/issues/8299
GoogleC2P: remove dependency on metadata server for IPv6 node metadata (#8550)
Remove reliance on metadata server since it's result is no longer
needed, hardcode IPv6 support in node metadata instead.
Related c++ change: https://github.com/grpc/grpc/pull/40571
Note we preserve prior behavior in case experiment `NewPickFirstEnabled`
is disabled, because our testing/qualification has not covered that
being disabled.
xds: move env var check for HTTP CONNECT metadata parsing to endpoint and locality parsing functions (#8551)
Currently, the env var check for parsing HTTP CONNECT metadata (A86) is
inside the function that parses custom metadata,
`validateAndConstructMetadata`.
This PR moves the check to the endpoint and locality parsing functions,
`parseEndpoint` and the top-level `parseEDSRespProto` which is where
localities are parsed. This allows multiple env vars to control
different custom metadata keys. We already support two custom metadata
keys (A76 and A86) and we plan to support more (A83).
This PR also ensures that the custom metadata used for ring_hash key
(A76) uses the recently added `StructMetadataValue` type. This ensures
that metadata parsing happens only once.
Since the location of the env var check is moved, the tests are also
restructured a little. This PR groups the custom metadata parsing tests
into three groups: one for success cases when the env var is turned on,
one for success cases when the env var is turned off, and one for
failure cases when the env var is turned on.
Use new-style atomic APIs instead of the old ones in the
`ignoreResolveNowClientConn` type.
The changes made in this PR improve the code in the following ways:
* Ergonomics: Method-based API vs function-based, no pointer management
needed
* Safety: Type safety prevents mixing atomic/non-atomic operations,
eliminates pointer errors
* Clarity: The `atomic.Uint32` type makes atomic intent explicit from
declaration
Fixes: https://github.com/grpc/grpc-go/issues/8485
RELEASE NOTES:
* client: Ignore http headers with status 1xx and `END_STREAM` flag
unset.
* client: Fail RPCs with status `INTERNAL` instead of `UNKNOWN` on
receiving http headers with status 1xx and `END_STREAM` flag set.
transport: allow stream cancellation on the server when blocked on flow control (#8528)
Fixes: #8517
This change allows `t.closeStream()` to be executed even if the stream
state is `done`. This is required to allow streams to be cancelled to
timed out. See issue for detailed root cause.
RELEASE NOTES:
* server: Fix bug preventing streams from being cancelled or timed out
when blocked on flow control.
eshitachandwani [Sat, 30 Aug 2025 14:24:14 +0000 (19:54 +0530)]
xdsclient: Fix race in SetWatchExpiryTimeoutForTesting (#8526)
Fixes: #8525
There is a race in
[SetWatchExpiryTimeoutForTesting](https://github.com/grpc/grpc-go/blob/fa0d6583208033fe4f69d359f80286736fd121d0/internal/xds/clients/xdsclient/xdsclient.go#L121)
which is used to override the watch expiry timeout of XDSClient for
testing. Currently it just sets the watchExpiryTimeout of the XDSClient
to the provided value without a mutex each time we call
[NewClientForTesting](https://github.com/grpc/grpc-go/blob/fa0d6583208033fe4f69d359f80286736fd121d0/internal/xds/xdsclient/pool.go#L116C16-L116C35)
which might of might not create a new XDSClient if one is already there.
Fix : Add a new field `WatchExpiryTimeout` to the xdsclient
[config](https://github.com/grpc/grpc-go/blob/30645d521be375d13fa4cb2baa0d2561ca44c342/internal/xds/clients/xdsclient/xdsconfig.go#L28)
which will now be used instead of `internal.WatchExpiryTImeout`
cjqzhao [Fri, 29 Aug 2025 16:57:00 +0000 (09:57 -0700)]
xds: add metadata registry (#8537)
Following
[A83](https://github.com/grpc/proposal/blob/master/A83-xds-gcp-authn-filter.md)
and
[A86](https://github.com/grpc/proposal/blob/master/A86-xds-http-connect.md),
this adds a registry for custom metadata received in xDS protos for the
purpose of converting the received metadata into internal
representations.
eshitachandwani [Fri, 29 Aug 2025 03:48:05 +0000 (09:18 +0530)]
xds/resolver: change tests to update all resources (#8539)
Change the tests in xds resolver to update all resources in management
server instead of only listener and route resource.
This change is being done as part of gRFC [A74 : xDS Config
tears](https://github.com/grpc/proposal/blob/master/A74-xds-config-tears.md).
This is to make sure the tests pass after the change too.
eshitachandwani [Tue, 26 Aug 2025 05:29:33 +0000 (10:59 +0530)]
xdsclient: create LRSClient at time of initialisation (#8483)
Fixes: https://github.com/grpc/grpc-go/issues/8474
The race is in
[ReportLoad](https://github.com/grpc/grpc-go/blob/9186ebd774370e3b3232d1b202914ff8fc2c56d6/xds/internal/xdsclient/clientimpl_loadreport.go#L35C2-L44C21)
function of clientImpl. The implementation was recently changed as the
part of [xds client
migration](https://github.com/grpc/grpc-go/commit/082a9275c79a9d78fdaa4a93018e5e53a4a3af18).
The
[comment](https://github.com/grpc/grpc-go/blob/85240a5b02defe7b653ccba66866b4370c982b6a/xds/internal/xdsclient/clientimpl.go#L86C2-L87C16)
says that `lrsclient.LRSClient` should be initialized only at creation
time but that was not the case. It was being initialized at the time of
calling `ReportLoad` function.
RELEASE NOTES:
- lrsclient:
- Fix a race condition where the `LRSClient` was not initialized at
creation time but it was being initialized at the time of calling the
`ReportLoad` function.
- Creating an `LRSClient` no longer requires a node ID.
Pranjali-2501 [Mon, 25 Aug 2025 19:24:23 +0000 (00:54 +0530)]
client: Roll-forward PR #8278(with changes): Restore the existing behavior to return io.EOF on repeated RecvMsg() calls for client-streaming RPCs (#8523)
Changes:
- Modifies client.RecvMsg() so that successive calls after stream ends
return io.EOF.
- Adds extra state to track calls to client.recvmsg(required to return
Cardinality Violation only in case zero response)
RELEASE NOTES:
* client: Return status code INTERNAL when a server sends 0 response
messages for a unary or client streaming RPC.
The change being reverted here (#8369) is a prime suspect for a race
that can show up with the following sequence of events:
- create a new gRPC channel with the `xds:///` scheme
- make an RPC
- close the channel
- repeat (possibly from multiple goroutines)
The observable behavior from the race is that the xDS client thinks that
a Listener resource is removed by the control plane when it clearly is
not. This results in the user's gRPC channel moving to TRANSIENT_FAILURE
and subsequent RPC failures.
The reason the above mentioned PR is not being rolled back using `git
revert` is because the xds directory structure has changed significantly
since the time the PR was originally merged. Manually performing the
revert seemed much easier.
RELEASE NOTES:
* xdsclient: Revert a change that introduces a race with xDS resource
processing, leading to RPC failures
Arjan Singh Bal [Thu, 21 Aug 2025 06:50:13 +0000 (12:20 +0530)]
transport: ensure header mutex is held while copying trailers in handler_server (#8519)
Fixes: https://github.com/grpc/grpc-go/issues/8514
The mutex that guards the trailers should be held while copying the
trailers. We do lock the mutex in [the regular gRPC server
transport](https://github.com/grpc/grpc-go/blob/9ac0ec87ca2ecc66b3c0c084708aef768637aef6/internal/transport/http2_server.go#L1140-L1142),
but have missed it in the std lib http/2 transport. The only place where
a write happens is `writeStatus()` is when the status contains a proto.
eunsang [Tue, 19 Aug 2025 17:05:46 +0000 (02:05 +0900)]
xds: move all functionality from `xds/internal` to `internal/xds` (#8515)
Fixes grpc#7290, ensuring that only user-facing functionality remains in
the top-level xds package.
Updates all import paths and aliases to reference the new internal/xds
package, using aliases (e.g., `internal` → `xds` or `xdsinternal`) where
needed to minimize changes to call sites.
No functional changes intended; this is purely a package path
reorganization.
eshitachandwani [Mon, 18 Aug 2025 05:15:30 +0000 (10:45 +0530)]
xds/cdsbalancer: increase buffer size of requested resource channel in test (#8467)
RELEASE NOTES: N/A
Fixes: https://github.com/grpc/grpc-go/issues/8462
The main issue was that the requests were getting dropped since we use a
[non-blocking
send](https://github.com/grpc/grpc-go/blob/a5e7cd6d4c2c31b1e6649789c2ddc9a82ad6b5fa/xds/internal/balancer/cdsbalancer/cdsbalancer_test.go#L222C5-L227C6)
for resources in test along with buffer size of just
[one](https://github.com/grpc/grpc-go/blob/a5e7cd6d4c2c31b1e6649789c2ddc9a82ad6b5fa/xds/internal/balancer/cdsbalancer/cdsbalancer_test.go#L210)
which was resulting in resource request updates being dropped if the
receiver is not executing at the exact moment.
Fix:
Changed the `setupManagementServer` to take `listener` and `OnStreamReq`
function as a parameter and in the `TestWatcher` added a blocking send
whenever a cluster resource is requested.
xdsclient: schedule serializer callback from the authority instead of from the xdsChannel (#8498)
This is a small code change that simplifies how a callback is scheduled.
The `xdsChannel` will no longer directly access the serializer inside
the `authority` type. Instead, the authority type will now handle the
scheduling itself. This makes the code cleaner and moves the scheduling
logic to where it belongs.
grpcsync: use context.AfterFunc to close buffer after context canceled in CallbackSerializer (#8489)
[The current minimum supported Go version is now
1.23](https://github.com/grpc/grpc-go/blob/62ec29fd9b3f9ea3cea6dc08a31e837aa92678b7/go.mod#L3).
`context.AfterFunc` is available for all of grpc-go's latest version
users. Thus we can do this pending TODO.
`context.AfterFunc` would invoke the given function for both _immediate_
context cancelation and timer-based context cancelation (`WithTimeout`,
`WithDeadline`). So I think this change is safe.
This PR updates Prometheus-related dependencies in grpc-go to fix
compatibility issues caused by recent API changes in
github.com/prometheus/otlptranslator.
Complementing the broader dependency updates made in PR #8497.
Oleksandr Redko [Tue, 12 Aug 2025 06:39:40 +0000 (09:39 +0300)]
grpclb: simplify stringifying of IPv6 with net.JoinHostPort (#8503)
This PR simplifies IP address handling in
`lbBalancer.processServerList`.
From [net.JoinHostPort](https://pkg.go.dev/net#JoinHostPort):
> JoinHostPort combines host and port into a network address of the form
"host:port". If host contains a colon, as found in literal IPv6
addresses, then JoinHostPort returns "[host]:port".
Chris Staite [Mon, 14 Jul 2025 17:52:09 +0000 (18:52 +0100)]
credentials: allow audience to be configured (#8421) (#8442)
There are competing specifications around whether a method should be included in a JWT audience or not. For example #4713 specifically excluded the method referencing https://google.aip.dev/auth/4111 whereas GCE IAP requires the full URI https://cloud.google.com/iap/docs/authentication-howto.
In order to facilitate both methods, we introduce a new environment variable, namely GRPC_AUDIENCE_IS_FULL_PATH, to allow the method stripping to be disabled. This defaults to the existing behaviour of stripping the method, but can be set to avoid this.