Evolution of the technical architecture of the cloud development gateway

Tencent CloudBase (TCB) is a cloud-native integrated development environment and tool platform provided by Tencent Cloud, which provides developers with high-availability, auto-scaling back-end cloud services, including serverless capabilities such as computing, storage, and hosting, and can be used to develop a variety of applications (such as Mini Programs, Official Accounts, Web Applications, and Flutter clients) in the cloud.

This article describes in detail the migration process of gateway architecture design for cloud development, and why it has evolved from a two-tier architecture to a single-tier architecture, which has a strong reference role for the industry.

introduction

"Cloud Development Gateway" is a cloud development security access gateway for APP, WeChat/WeChat Enterprise Mini Program, and Official Account H5/web based on Envoy, providing capabilities such as cloud development private links, traffic management, and weak network acceleration, providing secure, stable, and cloud-native access for applications.

Security link is the core capability of the Cloud Development Gateway (hereinafter referred to as the gateway), which uses private links to provide underlying support for user traffic security and anti-crawling.

1.1 Problems with HTTPS

More than 90% of the world's HTTPS websites have been used, and HTTPS itself has encrypted the traffic of business requests using TLS, and brute-force cracking is almost impossible with the existing encryption strength of TLS1.2/1.3 and the current computing power. Is it necessary for the gateway to use a private link to encrypt service traffic again? In fact, it is necessary to target HTTPS attackers to use MITM to obtain incoming client and server traffic.

Evolution of the technical architecture of the cloud development gateway

To decrypt HTTPS traffic through MITM, the client needs to trust the third-party root certificate issued by the middleman. And installing a root certificate itself has some thresholds. Therefore, in common attack methods, attackers usually do not use this method. First, this attack condition usually requires the attacker and the attackee to be under the same LAN, and second, trusting the attacker's root certificate requires the cooperation of the attacker, and also requires administrator or root privileges.

The use of MITM harsh conditions makes this method less common in actual attacks, and more often uses Msfvenom to generate payloads to trick the attacker into executing them and thus gain privileges on the machine. Since MITM is difficult to be exploited directly, can we ignore this security risk in our business scenarios? In general business, it can indeed be considered that the use of HTTPS has met the security requirements, but in some fields with high requirements for security scenarios, it is not enough to use HTTPS alone. For example, the price of the e-commerce platform and the number source information of the registration platform; Competitors can monitor the commodity prices of competitors in real time through MITM, so as to adjust the prices of their own platforms in real time, so as to ensure their own low price advantages; Scalpers can query the source of the number in real time, register as soon as possible after the number is released, and so on.

1.2 Measures for MITM

Since MITM itself is a certain risk to the business, how can it be avoided in general?

Use mTLS for mutual authentication.
使用 SSL Pinning 做域名和证书的绑定校验。
Check if the user's network is using a proxy or VPN (commonly used by banking apps).
Encrypt the service and use a private link for transmission.

The essence of mTLS is the two-way authentication of the client and the server certificate, which can usually be solved in the APP scenario, but the real business generally requires full support, and in the scenario of H5/Web and WeChat Mini Program, there is no permission to obtain the certificate information. SSL Pinning has a similar problem. Whether mTLS or SSL Pinning verifies the certificate, it usually relies on the system API provided by the operating system for verification, and the system API is easy to bypass using tools such as Xposed and Frida. The way to check the user agent is also less reliable, and its judgment is still based on the system API, and the attacker can also use the Tun virtual NIC to bypass it.

Using a private link to encrypt business data at one time can solve the above problems well, and has better compatibility and scalability. Even if it is a multi-terminal scenario, it can be handled well by using the same set of solutions. In addition to the security of the link, the private link can also hide the real business of the server, and some automated crawlers and attack scripts will directly reject the corresponding requests by the gateway because they do not know the specific protocol of the private link.

Double-tier architecture design

经过网关的流量都是 HTTP （L7 层）流量，一个标准的 HTTP 请求包含：请求行（Request line）、请求头部(Request Header)、请求消息体（Request Body）三个部分；HTTP 返回包含：响应状态（Status line）、响应首部（Response Header）、响应消息体（Response Body）三个部分。

In practice, some customers use URL parameters to authenticate sensitive information such as their signatures, and some customers put sensitive information in the header of the request. If only the requested message body is encrypted, the user's authentication information may still be intercepted and tampered with. This requires the gateway to protect not only the message body of the request, but also the header and line of the request. Similarly, the response state, response header, and response message body of the service should also be protected.

If the various parts of the request are encrypted directly, encrypted and forwarded directly to the gateway, which decrypts them in the same way, it seems that the problem will be solved. However, the request is not a standard HTTP, so the traffic of the request is degraded from the L7 layer to the L4 layer. It makes sense to use L4 for apps like that, but the H5/Web scenario is limited, and modern browsers still don't support raw socket connections, which makes it difficult to unify app design and H5/Web architecture.

2.1 Service traffic encapsulation

Whether it is APP, H5/Web or WeChat Mini Program, HTTP requests are supported. This requires our architectural design, and the underlying layer also needs to be implemented based on HTTP. For security reasons, it is necessary to encrypt the request line, request header, and request message body of the business, so it is more appropriate to use the HTTP in HTTP transmission mode. Encrypt the request line information, header, and message body of the service and put it into the message body of the private link before forwarding, and then combine some serialization methods (such as Protocol Buffers) to compress the request data, which can ensure high performance and small transmission length.

For different types of clients, SDKs can be distributed to integrate into the business. The client of the business uses the SDK to call HTTP requests, and the SDK completes the encryption of the request. In addition, the SDK of the business can also add buried information, and when a business failure occurs, it can be combined with the log and alarm mechanism to find the problem in a more timely manner.

2.2 Early Architecture Design

Traffic forwarded to the gateway needs to be decrypted before it can be further processed, so it was in an early design. The first consideration is to add a layer of encryption and decryption modules to deal with it. The corresponding design is:

Client-side HTTP -> Gateway SDK -> Encryption and Decryption Module -> Gateway cluster (underlying Envoy, usually corresponding to DownStream) -> Back-to-origin service (commonly referred to as Upstream)

The encryption and decryption module requires a large number of CPU operations to process business requests, so it is more suitable for cluster deployment when deployed, and as long as the HPA is reasonably configured, it can basically meet the needs of the business. In addition to CPU, you also need to consider some special scenarios, such as: in the scenario of flash sales, the time consumption of requests may increase due to the increased load of business services; The increase in time consumption also means the accumulation of connections in a short period of time, and the number of requests in a short period of time may increase further, which may eventually lead to a certain link of the link exceeding the load and completely denying service. The increase in the time consumption of upstream may bring catastrophic consequences, and in this scenario, the requests from the encryption and decryption module to the gateway cluster need to be pooled and the connection should be reused as much as possible. It is also necessary to configure the appropriate timeout time and connection hold time, and optimize this scenario by failing requests quickly. For requests that exceed the maximum number of connection pools, whether to reject or remove earlier connections also needs to be considered based on actual business scenarios.

The two-tier architecture adapts well to the early business scenarios, but there are some drawbacks:

The encryption and decryption module lacks the necessary health checks and all-dead logic
The monitoring information of the encryption and decryption module is not perfect, and the business indicators need to be actively registered with Promtheus. Adding new metrics requires republishing the service
Increased resource costs and maintenance costs

In the scenario of the two-layer architecture, the encryption and decryption module acts as the first hop of the entire link, which is the first to bear the brunt in the high-concurrency scenario. However, the performance of encryption and decryption and the number of connections are not the main bottlenecks; On the one hand, the encryption and decryption module uses Go coroutines to process each request, and its performance can be well guaranteed. In addition, Go C10k is not a problem for a long time, but when the encryption and decryption module is used as a client request, its IP is fixed, and based on the quadruple of the connection, it can be seen that the local port of the request may be full due to abnormal conditions, resulting in the inability to create new requests, but with the above pooling guarantee, there is no need to worry too much.

Single-layer architecture design

The underlying layer of the gateway uses open-source Envoy to forward traffic, and Envoy itself has rich monitoring information and perfect health check logic. So is it possible to incorporate the encryption and decryption module into Envoy? Absolutely, but there are some technical difficulties that need to be solved. In a two-tier architecture, the traffic processed by Envoy is the traffic of the business, so it can do centralized frequency throttling based on certain headers, dynamically add and remove certain headers, or add risk levels based on certain information.

In a single-tier architecture, the traffic that Envoy actually handles is the outer layer of HTTP in HTTP traffic, that is, the traffic of the private link, so the following issues need to be addressed:

How does Envoy integrate encryption and decryption modules?
How does Envoy parse out the business traffic for each request? A request to override a private link, that is, replace the private link traffic with service traffic?
How do I ensure that the decryption process after the request is executed before the frequency limiting logic, and the encryption process after the return is executed after the frequency limiting logic?

In addition, due to the use of HTTP in HTTP, it is also necessary to consider the internal traffic of cookies and cross-domain information of the original business.

How to ensure that the business set-cookie can be executed normally in the case of a private link?
如何正确处理业务跨域头部（比如：Access-Control-Allow-Origin、Access-Control-Allow-Headers 等）等等。

The single-layer architecture is also the direction of the evolution of the unified architecture of various gateways for cloud development, so in addition to considering the scenarios of private links, it is also necessary to be compatible with some scenarios of direct access to the public network and WebSocket.

3.1 Envoy's Interceptors

Envoy provides a variety of interceptors (Envoy Filters), which can dynamically filter, modify, and listen to certain fields, and can implement more complex business logic through Envoy Filters. The most commonly used interceptors are Lua Filter, External Processing Filter, etc.

The Lua Filter itself is relatively lightweight and is often used to handle some simple business scenarios. However, since Envoy itself is a multi-worker threading mechanism, each worker has its own Lua execution environment, which means that the Lua Filter has no real global variables. In addition, Lua Filter is executed synchronously when processing each request, which can cause Envoy's performance to be significantly degraded if some network IO operations are required. As a result, the gateway will only use Lua Fitler in conjunction with a small number of modification requests or when performance requirements are extremely high.

External Processing Filter (hereinafter referred to as gRPC interceptor) provides a gRPC interface for remote calling, which can dynamically modify almost all the requested and returned data, which is exactly what is needed in the scenario of gateway private link. The gRPC interceptor splits a request into 4 gRPC serial calls

ProcessingRequest_RequestHeaders, request header processing.
ProcessingRequest_RequestBody, request the Body to process.
ProcessingRequest_ResponseHeaders, return to the head for processing.
ProcessingRequest_ResponseBody, return to Body processing.

After an encrypted request is sent to the gateway, it will first receive a RequestHeaders message, which will be relatively simple to determine whether it is a probe OPTIONS or an internal health check, and if so, it will directly return 204, and then dynamically return the header required across the domain according to the origin of the request. The RequestBody carries the complete request information of the service, which needs to be decrypted first and then HTTP parser to obtain the service request line, request header and message body. Then, the parsed information will be overwritten with the header and message body of the request. After changing to a single-layer gateway, Envoy acts as the first hop of the entire link, and also needs to copy the X-Forwarded-For of the request to a remote address to prevent the possibility of forgery.

The return header is basically the same as the request header, with one difference being that the Set cookie supports multiple fields, which need to be merged here. For ResponseBody, it needs to be repackaged and encrypted before returning; Since the status code of the service may be abnormal, and the private link itself should be returned normally, the status code of the service cannot be copied to the private link to avoid the interruption of the request.

3.2 Order of interceptors

The use of gRPC interceptors solves the problem of traffic encryption and decryption, but the collaboration of multiple filters still needs to be handled. Envoy executes interceptors in a top-down order at the time of request; The processing of the return is the opposite, bottom-up.

Therefore, when the request is preprocessed by Lua Request, the gRPC interceptor of the private link is decrypted, and when the decrypted traffic is resent to the frequency limit/waterproof wall, it is already business data. When returning, it is also preprocessed by Lua Reponse, and then encapsulated with a gRPC interceptor of the private link, so that the entire link is opened.

summary

In general, adding an intermediate layer can solve the problems encountered by the business, but adding a layer of mapping also brings new problems; Adding the middle tier also requires computing resources, and it is necessary to add multiple replicas for high availability, which reduces the ROI of the entire system. In the gateway scenario, the two-layer architecture evolves from a two-layer architecture to a single-layer architecture by merging the encryption and decryption layers into the gateway access layer, and the gateway architecture is further unified, and logs, monitoring, and alarms can be directly reused. In the current context, it is a better solution.

Author: Li Haoyu

Source-WeChat Official Account: Tencent Cloud Developer

Source: https://mp.weixin.qq.com/s/ghIQ7IX2WitzMdDSzwcq1w

Evolution of the technical architecture of the cloud development gateway

Read on