Nginx proxy_pass到AWS ALB的504问题

发表于 2024-01-23 评论数：本文字数： 2.3k 阅读时长 ≈ 9 分钟

　　我们的部分后端服务正在经历容器化的改造，由于历史包袱，现网的网关等设施无法一次性迁移到 k8s 集群中，因此使用 Nginx proxy_pass 转发到 AWS ALB 这样一个曲线救国的临时方案。
　　但是在使用时，我们发现一段时间后 Nginx 出现了 504 的错误，检查后端服务均是正常的，而单独访问 ALB 也是正常响应的，因此便有了此文。

问题描述

我们的 upstream 配置如下:

1
2
3

location /xxx-service/ {
  proxy_pass http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/xxx-service/;
}

在重载了 Nginx 后恢复了正常，但过一段时间后同样的问题又出现了，检查 Nginx 的错误日志如下：

[error] 297612#297612: *2235585 no live upstreams while connecting to upstream, client: 3.0.xx.183, server: xxx.xxx.xxx, request: "GET /health HTTP/1.1", upstream: "http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/", host: "xxx.xxx.xxx"

... # reload 后过一段时间再次出现 error 

[error] 297612#297612: *2235596 no live upstreams while connecting to upstream, client: 210.3.xx.148, server: xxx.xxx.xxx, request: "GET /health HTTP/1.1", upstream: "http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/", host: "xxx.xxx.xxx"

问题排查

如上的问题描述有个关键的点： nginx -s reload 后即恢复了正常，通过这个点可以察觉到这个问题可能出现在 Nginx 上，而不是网关应用上。
根据 Nginx 的错误日志，我突然发现 upstream 的 IP 地址在 reload 后变化了。那么便有了方向了，问题可能出现在 Nginx 对 proxy_pass 中域名的解析上。

通过查阅资料发现：
原生的 Nginx 使用 proxy_pass 到一个包含域名的 upstream 时，会在配置加载时对这个域名做一次 DNS Query，之后会将这次解析到的 DNS Record 缓存，直到下一次配置加载或重启时才会重新做 DNS Query。
而 AWS ALB 作为托管的弹性负载均衡器，默认情况下的 IP 地址是会不定期进行变化更新的：
About dynamic change of IP address when using ELB | AWS
Application Load Balancer IP Change Event

这就导致了当 ALB 的 IP 地址发生变化时，Nginx 无法感知到 DNS Record 的变化，没有正确的将流量转发到新的 uptream，引发了 504 的问题。

解决方案

明确了问题原因，要想解决这个问题，当然是要在 ALB 的 DNS 发生变化的时候，告诉 Nginx，让其获取最新的 DNS Record，从而正确的路由流量。

那我们是要定时来 reload Nginx 吗？显然这一点都不优雅，我们有更好的方式来实现同样的目的：

使用变量动态解析

在 Nginx 官方文档Module ngx_http_proxy_module 中有这么一段话：

Parameter value can contain variables. In this case, if an address is specified as a domain name, the name is searched among the described server groups, and, if not found, is determined using a resolver.

这里提到，我们的 proxy_pass value 可以是一个变量，这样 Nginx 会从 resolver 中去做 DNS Query 获取 IP 地址。

这样就好办了，我们可以将配置修改为如下:

server {
  listen 80;
  listen 443 ssl;
  server_name xxx.xxx;

  # 指定 DNS resolver 
  resolver 8.8.8.8;

  # 定义一个变量 lb_upstream
  set $lb_upstream http://gateway-service-alb-xxx.xxx.elb.amazonaws.com;

  location / {
    
    # 使用变量形式指定 proxy_pass 
    proxy_pass $lb_upstream;
  }
  # ...
}

指定 Nginx 使用 resolver 动态解析 proxy_pass 的 DNS，按照设想，Nginx 每次请求都会去请求 DNS Query 来获得最新的 DNS 解析记录。

这里我们在 Nginx 所在服务器来进行 DNS Query 的抓包，以验证我们的猜测：

1 2	# 抓取所有网卡中 53 端口（也就是 DNS Query）相关的包，过滤出我们指定的 8.8.8.8 DNS resolver: sudo tcpdump -i any -n 'udp port 53 or tcp port 53'\|grep '8.8.8.8.53'

使用 tcpdump 抓包后，访问我们的 Nginx，可以得到如下类似日志：

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes


14:42:32.020783 IP 10.0.60.121.34798 > 8.8.8.8.53: 9593+ A? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87)
14:42:32.020801 IP 10.0.60.121.34798 > 8.8.8.8.53: 37392+ AAAA? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87)
14:42:32.026549 IP 8.8.8.8.53 > 10.0.60.121.34798: 9593 2/0/0 A x.x.x.x, A x.x.x.2x (119) 
14:42:32.071872 IP 8.8.8.8.53 > 10.0.60.121.34798: 37392 0/1/0 (169)

诶，似乎不太对， Nginx 并没有在每次请求时都去对 proxy_pass 做 DNS Query，这时候我再回想起 resolver 文档中有个选项：

By default, nginx caches answers using the TTL value of a response. An optional valid parameter allows overriding it:
resolver 127.0.0.1 [::1]:5353 valid=30s;

也就是说，Nginx resolver 默认是遵循 DNS 的 TTL 的，而 AWS ALB 的域名 TTL 默认为 60s：

dig gateway-service-alb-xxx.xxx.elb.amazonaws.com  @8.8.8.8


; <<>> DiG 9.16.1-Ubuntu <<>> gateway-service-alb-xxx.xxx.elb.amazonaws.com @8.8.8.8
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36168
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;gateway-service-alb-xxx.xxx.elb.amazonaws.com. IN A

;; ANSWER SECTION:
gateway-service-alb-xxx.xxx.elb.amazonaws.com. 60 IN A x.x.x.x
gateway-service-alb-xxx.xxx.elb.amazonaws.com. 60 IN A x.x.x.2x

;; Query time: 4 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Wed Jan 24 15:56:47 CST 2024
;; MSG SIZE  rcvd: 130

我们在上面的配置中加入这一参数再次来验证一下：

1 2	# 指定 DNS resolver resolver 8.8.8.8 valid=1s;

nginx -s reload 后，再次抓包后发现，现在每隔 1s 的请求都会进行 DNS Query，验证了我们的猜想：

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes

14:43:32.020783 IP 10.0.60.121.34798 > 8.8.8.8.53: 9593+ A? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87)
14:43:32.020801 IP 10.0.60.121.34798 > 8.8.8.8.53: 37392+ AAAA? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87)
14:43:32.026549 IP 8.8.8.8.53 > 10.0.60.121.34798: 9593 2/0/0 A x.x.x.x, A x.x.x.2x (119) 
14:43:32.071872 IP 8.8.8.8.53 > 10.0.60.121.34798: 37392 0/1/0 (169)


14:43:33.020783 IP 10.0.60.121.34798 > 8.8.8.8.53: 9593+ A? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87)
14:43:33.020801 IP 10.0.60.121.34798 > 8.8.8.8.53: 37392+ AAAA? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87)
14:43:33.026549 IP 8.8.8.8.53 > 10.0.60.121.34798: 9593 2/0/0 A x.x.x.x, A x.x.x.2x (119) 
14:43:33.071872 IP 8.8.8.8.53 > 10.0.60.121.34798: 37392 0/1/0 (169)



14:43:34.020783 IP 10.0.60.121.34798 > 8.8.8.8.53: 9593+ A? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87)
14:43:34.020801 IP 10.0.60.121.34798 > 8.8.8.8.53: 37392+ AAAA? http://gateway-service-alb-xxx.xxx.elb.amazonaws.com. (87)
14:43:34.026549 IP 8.8.8.8.53 > 10.0.60.121.34798: 9593 2/0/0 A x.x.x.x, A x.x.x.2x (119) 
14:43:34.071872 IP 8.8.8.8.53 > 10.0.60.121.34798: 37392 0/1/0 (169)

那么我们真的需要手动去指定这个 valid 参数吗？
其实是不用的，在此文的这个场景下，直接遵循 ALB 域名的 DNS TTL 即可，过于频繁的 DNS Query 并不是一件好事，这会带来额外的不必要的性能开销，我们也无法决定 ALB 的 TTL。
如果是用于类似 DDNS 等需要快速获得最新 DNS 记录的场景，这时候才需要按需手动调整 valid 参数。

upstream使用变量带来的问题

前面我们虽然利用变量来解决了 DNS 解析的问题，但同时引入了一个新的问题，当 location 参数不为 /，而 proxy_pass 的参数是一个变量时，proxy_pass 的行为与我们预期的有些不同：

proxy_pass 不使用变量

当我们的 proxy_pass 不使用变量，且不带 /：

1
2
3

location /a/ {
    proxy_pass http://127.0.0.1:8080;
}

我们访问 nginx/a/b/c 时，Nginx 会将请求转发至 http://127.0.0.1:8080/a/b/c

当我们在 proxy_pass 后面带上了 /:

1
2
3

location /a/ {
    proxy_pass http://127.0.0.1:8080/; # 注意后面的 /
}

我们访问 nginx/a/b/c 时, Nginx 会将在 location 中匹配的参数部分截掉，这样请求转发到的就是 http://127.0.0.1:8080/b/c ,匹配到的 /a/ 被截取掉了。

proxy_pass 使用变量

当我们使用上文说到的，在 upstream 中使用变量来实现动态解析时，上述的行为就变成了这样：
当我们的 proxy_pass 使用变量，不带 /：

  # 指定 DNS resolver 
  resolver 8.8.8.8;

  # 定义一个变量 lb_upstream
  set $lb_upstream http://gateway-service-alb-xxx.xxx.elb.amazonaws.com;

location /a/ {
    proxy_pass $lb_upstream;
}

我们访问 nginx/a/b/c 时，Nginx 会将请求转发至 http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/a/b/c 这个行为与 proxy_pass 不使用变量是一样的，符合预期。

当我们的 proxy_pass 使用变量，但 upsteam 变量带了 /：

  # 指定 DNS resolver 
  resolver 8.8.8.8;

  # 定义一个变量 lb_upstream
  set $lb_upstream http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/;  # 注意后面这里带了/

location /a/ {
    proxy_pass $lb_upstream;
}

我们访问 nginx/a/b/c 时, Nginx 会将请求直接转发至 http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/ ，既不是我们期望的 /b/c 也不是 /a/b/c，直接转发到了 /。

那么我们应该怎样去实现我们期望的转发到 /b/c 呢？答案就是不要在变量的尾部添加 /, 转而使用 rewrite 在 location 中重写:

  # 指定 DNS resolver 
  resolver 8.8.8.8;

  # 定义一个变量 lb_upstream
  set $lb_upstream http://gateway-service-alb-xxx.xxx.elb.amazonaws.com;

location /a/ {
    rewrite ^/a/(.*) /$1 break;
    proxy_pass $lb_upstream;
}

上述配置，当我们访问 nginx/a/b/c 时, Nginx 会将请求直接转发至 http://gateway-service-alb-xxx.xxx.elb.amazonaws.com/b/c

其他方法

除去上述 Nginx 原生的方案，我们还有很多选择：

ngx_http_upstream_dynamic_module

Alibaba 的 Tengine 实现了一个动态 upstream 模块：
ngx_http_upstream_dynamic_module | Tengine

The ‘fail_timeout’ parameter specifies how long time tengine considers the DNS server as unavailiable if a DNS query fails for a server in the upstream. In this period of time, all requests comming will follow what ‘fallback’ specifies.

只需要使用如下配置即可：

upstream backend {
    dynamic_resolve fallback=stale fail_timeout=30s;

    server a.com;
    server b.com;
}

server {
    ...

    proxy_pass http://backend;
}

这个模块提供了 failback 机制，你如果使用的是 Tengine 的话，这将是个比较优雅的解决方案。
值得注意的是，在 Tengine 2.3 开始，这个模块并不内置，在后续的版本里，你可能需要重新编译。

使用 ngx_upstream_jdomain

ngx_upstream_jdomain | Nginx
ngx_upstream_jdomain | Github
该模块默认情况下，会每秒做一次 DNS 解析。

使用 nginx-upstream-dynamic-servers

nginx-upstream-dynamic-servers
该模块在第一次启动的时候会进行一次解析，之后遵循 TTL 再次发起解析请求。

Nginx Plus

Nginx Plus 是商业版本，提供了动态解析的特性：
http-load-balancer

参考文档

Tengine Github
ngx_http_upstream_dynamic_module | Tengine
resolver
Module ngx_http_proxy_module
Nginx with dynamic upstreams
NGINX proxy_pass to ELB with Variable