Large scale blocking of TLS-based censorship circumvention tools in China

The page collects a GitHub user named HyeonSeungi's comments, he revealed he as an employee of Internet censorship institute/company and disclosed many about the GFW's strategies in /net4people/bbs's Large scale blocking of TLS-based censorship circumvention tools in China thread and later deleted all comments and also his account. All these content below is from Wayback Machine archived webpages:

1

hello, all. i am working for a censorship vendor company. my company is a censorship member of guangzhou international internet exchange. i can confirm that some of the things you mentioned are correct. this tls in tls detect system is not realtime censorship, they automatically collect data connections with highspeed transmission or cumulative traffic greater than a preset value. these pcap packets will be sent to different vendors for detection, just like the popular covid-19 pcr test. if the provider inform that there has proxy data in the pcap, we have push rule to the edge bypass routing facility near the user for bgp flowspec reroute. these images were not sent by firewall operations staff, and it is certain that these vendors violated some confidentiality policies. based on the existing data, these vendors can only detect the fingerprints of tls1.2 and tls1.3. so using legacy tls protocol like tls1.0, tls1.1 is a good choice, you can also use sm algorithm, these protocols will not be detected. of course, there is only one way to avoid this detect, and that is to abandon e2e, and use self-signed certificates to sign these proxy websites after decryption on the server side and then the plaintext is send to the client through single tls, it's can ensure that tls in tls is not be detect .

2

And I'm surprised that the legacy TLS 1.0, TLS 1.1 are not detected by the vendors. Would you mind explaining more about it? Don't all versions of TLS implementations have corresponding fingerprints?

sslv3, tls1.0 and tls1.1 have no significant fingerprint.
new features of tls1.2 and tls1.3 make them very distinctive.
the popularization of tls1.2 and tls1.3 by well-known websites makes detection easier.
you cannot change the tls protocol of the target website when you visit using a proxy, except that the website you visit does not use tls.
only detected when you're transferring.

3

I think self-signed certificates make the server more suspicious

internet transmission needs to use a trusted certificate, proxy server and proxy client transmission need to be transmitted through single layer tls, and the proxy client use self-signed certificate to issue a certificate for the website you access. avoid tls handshake again in a tls masquerade tunnel.

4

@Gowee @rfbzs

many large foreign companies in shenzhen that need to use legacy sslvpn. these vpn are still using legacy protocols such as sslv3 and tls1.0. they often complain about sslvpn, the authorities require us to add these to the whitelist.
stateowned enterprises and their foreign branches usually use sslvpn to communicate. These enterprises authorities are required to use sm cipher suites, they use firewall hardware made in china, using tls 1.1, tls 1.2 and sm cipher suites, are also in the whitelist.
instructions provided by the vendor: sslv3, tls1.0, tls1.1 have no available detection features, and the false positive rate is high, may damage the Internet, and few people use it, so it can be ignored
even if your proxy tunnel use tls1.0, if you access a tls1.2 website through the proxy, this can be detected

5

even if your proxy tunnel use tls1.0, if you access a tls1.2 website through the proxy, this can be detected

So downgrading the TLS version is pointless

they don't care what proxy protocol you use, they only detect multiple tls interaction characteristics of the same tcp connection. the vendor's instructions, they can still detect tls in tls with 40% random padding, but the detection takes time very long.

6

they can still detect tls in tls with 40% random padding

There are papers claiming higher numbers in experimental setup. Without false positive metrics on real world large volume datasets, it doesn't mean anything, can't be used in production, etc.

vendors can apply it to production at any cost if required by the competent authority. this tls in tls detection is an example. they forced the vendor to apply it to production before a certain date. this detection system was only used in the test environment before. the vendor urgently produced an automated traffic sampling and rule push for production. but, there are not enough gpus for production now.

7

Also, I have never heard that any public websites are using SM or some other similar cipher suites. And I bet the market share of it is trivial or just zero.

RFC8998 TLS1.3_SM4_GCM_SM3 is now rapidly promoted by the authorities as a policy requirement, and as such, it's in the whitelist.

8

The text shown on this diagram is nonsense. How could TLS 1.3 use RSA 2048 key exchange at all?

@Gowee they mean is probably that the server key length for these tls characteristics is RSA2048 maybe server certificate private key length, these diagrams are simply explained by the vendor based on the data for the competent authorities who do not understand the internet, so they will not be very rigorous.

9

. Is it related to the SM cipher you mentioned?

@rfbzs yes, sm cipher includes sm2 sm3 sm4

Aren't there anything, like, acceptance criteria before production?

no any acceptance criteria, complete the Internet censorship requirements of the competent authority then can apply to production

Okay, if firewall policymaking is regionalized on province basis, the quality control could be even lower because the impact from false positives is also regionalized. It's an interesting analogy to COVID-19 PCR testing because that also has this story of elevated, technically unsound, acceptance for false positives due to regional politics, e.g. 层层加码.

different competent authorities in different regions, such as the Communications Administration, the Internet Police, and other organizations. A fun fact, there are a lot of people in the internet censorship agencies in many regions who don't know anything about the internet.

The missing piece here for me is who is creating the training labels for evaluting vendor performance? One has to maintain a very large farm of real world devices and be expert in all kind of circumvention tools and have unfettered data link to google.com, twitter.com, youtube.com, etc, without firewalls. It does makes sense that in these screenshots the traffic captures sharing systems are all located near national internet gateways. The training label providers should be located very near these sharing systems.

they are different vendors at everywhere, many vendors are located in beijing, shanghai and they have public lines that are not censored. these internet censorship facilities have different names in different places, and there are different deployment methods. there are centralized deployment methods and edge deployment methods. usually, the vendor will arrange a team of about 10-30 people to live in the data center in everywhere.

10

such as the Communications Administration, the Internet Police

My new guess is this highly coordinated top-down architecture of marketization would not happen without the 2018 reorg when CNCERT was assigned to report to CAC, which has vast powers to enact deep reforms like this.

the Communications Administration and the MIIT is not conduct Internet censorship. internet censorship is carried out by the Central Cyberspace Administration of the Communist Party of China, is not a state agency, and now there are more anti-fraud agencies everywhere, and they now censor the Internet together, often resulting in conflict with each other.

11

A fact, this tls in tls detection mechanism is not developed by a Chinese company. It is to buy products from Russian companies, it's have many powerful products for Internet censorship.

12

These pcap sharing systems are probably an endpoint of a greater pipeline. I can imagine some kind of ensemble learning methods that can handle bad results from a single vendor, for which reason lack of comprehensive testing in single vendors is less likely to result in actual catastrophic failure.

The real source of value of a massive ML system like this comes from the data, the training sets, the ground truth. Generating positive samples is relatively easy, but generating negative samples of sufficient coverage is hard. I'm curious how they can capture the long tail of the entire Internet.

Internet traffic sampling has been rapidly deployed on the edge and closer to the user, such as bras, olt mirroring, and onu in the user's home. for small samples, they will manually mark some IPs and continue sampling for a period of time, such as 3 months, 6 months and a year, and these pcaps will be provided to the vendor for processing.

13

the popularization of tls1.2 and tls1.3 by well-known websites makes detection easier.

Would you mind explaining the reason?

because the websites you access through the proxy have tls1.2 and tls1.3 websites, the tls1.2 and tls1.3 characteristics are quite obvious, so they don't care what proxy protocol you use, they only detect multiple tls interaction characteristics of the same tcp connection. if you access a tls1.2 and tls1.3 website through the proxy, this can be detected.

14

The reason for uTLS won't be useful is because only the tls proxy tunnel is the uTLS fingerprint. access tls1.2 and tls1.3 website through the tls proxy tunnel to the same tcp connection will show your browser tls fingerprint, it's a distinct fingerprint, it's indicates that you are using a tls masquerade proxy tunnel. a products provided by Russian companies can detect encrypted tls fingerprints, so you access tls1.2 and tls1.3 websites through any known proxy protocol and it can be detected.

15

[complement] Some months ago(may be two or three) China mobile internet have can't transmit traffic even at the same time websites could open normally.

I also noticed the same fact that my disguised website keeps accessible while its TLS-based proxy function is not working. This makes it more plausible that they may find a way to detect the TLS-over-TLS protocol.

if your ip and port are marked for tls tunnel proxy behavior, at ordinary times they are used to test a set of tools that automatically blocking obfuscating/encrypted inner tls fingerprint packets, are used to test blocking tls in tls packets without affects normal website access, such as cloudflare, akamai, azure cdn. just because of some things, this plan was disrupted.

16

detect

emulating popular browser fingerprints is not exactly the same as the browser fingerprints that users use for proxy access. detecting encrypted tls fingerprints does not require breaking tls encryption.

17

Internet traffic sampling has been rapidly deployed on the edge and closer to the user, such as bras, olt mirroring, and onu in the user's home

This does not guarantee clean negative labels are collected. The users could use any circumvention tools. That would poison the already sensitive base rate (in Bayesian sense).

a part of the sample is the vendor's own sampling using different proxy protocols, and the other part is manually marked, and continuously samples the manually marked traffic. they also buy some proxy services, get their ips, sample those ips for a long time, and finally send these proxy service providers to the police.

18

it has been abused now, such as police, anti-fraud, isp, which is also the reason for the inconsistency of the situation everywhere. so, the system of Internet exchange will be restored in 2021, and now edge and Internet exchange are used at the same time. for security, they use a nationally interconnected private management network.

19

a part of the negative sample is the vendor's own sampling using different proxy protocols

No, positive sample means samples of the proxy targets that you want to detect. Negative sample means everything else, browsers, RPCs, Bittorrent, realtime messagings, industrial protocols, and a long tail that I cannot imagine, which is the hardest part to simulate realistically. You do want to maintain some level of low false positive rate so to not indiscriminately detecting non-proxy traffic. If you have 50% positive samples and 50% negative samples as your training sets, it's called the base rate fallacy.

this is provided to CAC by a Russian company, our company is not used for sample processing, so I don't know the specific content. we only send pcaps to these vendors, they push the results asynchronously, and everything I know is from the syndicated meeting. maybe eversec, company and netpower company have this information. they don't care about wrong judgment, wrong blocking, unless there is a complaint from a large company.

20

@HyeonSeungri Can you reveal something about the protocols that use UDP transport and whether the vendor is now able to perform detection processing?For example https://github.com/HyNetwork/hysteria .

udp is a different vendor, I don't know, our company only processing tcp

21

@HyeonSeungri

based on the existing data, these vendors can only detect the fingerprints of tls1.2 and tls1.3. so using legacy tls protocol like tls1.0, tls1.1 is a good choice, you can also use sm algorithm, these protocols will not be detected.

many large foreign companies in shenzhen that need to use legacy sslvpn. these vpn are still using legacy protocols such as sslv3 and tls1.0. they often complain about sslvpn, the authorities require us to add these to the whitelist.

stateowned enterprises and their foreign branches usually use sslvpn to communicate. These enterprises authorities are required to use sm cipher suites, they use firewall hardware made in china, using tls 1.1, tls 1.2 and sm cipher suites, are also in the whitelist.

instructions provided by the vendor: sslv3, tls1.0, tls1.1 have no available detection features, and the false positive rate is high, may damage the Internet, and few people use it, so it can be ignored

even if your proxy tunnel use tls1.0, if you access a tls1.2 website through the proxy, this can be detected I think I did not get the point here.

By whitelist, I think it means those traffic are exempted from censorship to some extent to avoid undesired collateral damage.

For SSL VPN, there must be tons of different types of traffic inside the tunnel, where TLS traffic would constitute a large portion, which inevitably resulting in TLS over (insecure) TLS/SSL.

If these insecure TLS version are indeed whitelisted, I think TLS over TLS would also be exempted? Then this assumption would contradict even if your proxy tunnel use tls1.0, if you access a tls1.2 website through the proxy, this can be detected.

If TLS over TLS would still be identified and blocked when an insecure TLS version is used as the outer camouflage, the whitelist is meaningless in the context of the discussion. What is the benefit to use an insecure TLS version then? We could just use the secure TLS 1.2/1.3 as the outer protocol and manage to mitigate the leakage of traffic patterns from inside the tunnel.

they can detect these tls1.2 traffic in the tls1.0 tunnel, but won't be pushed to edge rules to enforce port blocking, at least for now. because there are many enterprises in shenzhen that contribute a lot of gdp and need to use sslvpn based on tls1.0, this will lead to a large number of enterprise complaints, so this will not be blocked at least in guangdong, they are protected by the guangdong government. some foreignfunded enterprise employees need to connect to the headquarters sslvpn at any location, so some special combinations of tls tunnels such as fortigate sslvpn, juniper sslvpn, and anyconnect are in the shenzhen whitelist, at least for now. there are many companies that need ipsec to connect to global branches such as microsoft, so ipsec is also whitelisted, because their ip change frequently, so they are whitelisted by protocol instead of ip whitelist.

22

the whitelist depends on whether there are many foreignfunded enterprises in your local area. If there are few foreign-funded enterprises in your locality, there is almost no whitelist and strict Internet cnsorship policies. if there are many foreignfunded enterprises in your locality, such as shenzhen and shanghai, they will have many, many internet protocols on the whitelist and will not be censorship, and have very easy internet censorship policies. in short, if you live like in shenzhen or shanghai, you will rarely be censored by tls in tls, unless your proxy traffic exceeds a reasonable range for your own use. for an example, shenzhen is sampling 10-20GB per 100GB of internet traffic for internet censorship, guangzhou is sampling 25GB per 100GB of internet traffic for internet censorship, and my colleagues in shanghai told me that they sample 1-5GB per 100GB for internet censorship, incredible.

23

Meaning, they do care about false positives, but the cost is weighted by its economic impact, and signals of economic impact come from big corporate users, not individuals. They cannot afford to and do not care to preemptively test for individual user experience. And to avoid having too many complaints from businesses, they do need an acceptable baseline of false positives a priori.

yes, enterprises with large GDP contributions and all stateowned enterprises have the right to complain, and no one cares about other companies and individuals. follow principle: the stronger economy, the less censorship.

24

of course, there are many service providers and some ip that forward proxy traffic in shanghai ,guangzhou ,dongguan and shenzhen. these are not exempted, such as cn2 forwarding and iplc forwarding. these special traffic or peer-to-peer traffic will be separately included in stricter internet censorship. not covered by any whitelist, no exemptions can be applied.

25

If tls is encrypted, such as shadowsocks, the outer layer again using tls, this tls in shadowsocks in tls, whether there are still obvious features can be detected?

this can be detected, according to the Russian vendor's instructions, they can detect tls1.2/1.3 in tunnel in tunnel in tunnel in tunnel, unless the website you visit does not use tls. they can also detect shadowsocks, and the formerly famous shadowsocks realtime blocking algorithm is provided by them. in short, if you access tls1.2/1.3 websites through proxy tunnel, they can be detect. Russia has a strong computer science power, they can even censor data in ipsec.

26

Which company is that, RDP.RU?

this is just one of the vendors. it's impossible to list all of them based on some principles, but it is certain that roskomnadzor and its affiliates are some commercial companies derived from KGB. they have strong technical reserves for internet censorship and are capable of internet censorship strongest institution. their sales already occupy half of the market, such as india, iran, germany, france, uk, indonesia, singapore, thailand, myanmar, korea, jpan and many other countries. in fact, many countries have purchased chinese and russian internet surveillance equipment, and they only monitor internet with little or no censorship. these devices can also be used for other purposes, such as anti-piracy, anti-anonymous, anti-fraud, tracking anonymous people, tracking network attacks, etc. so almost every country has these devices installed, no one can be anonymous in the internet, an interesting example, if you in china through a proxy connection to twitter make some remarks that the authorities deem inappropriate, their police always can find you in 1-2 weeks and say hello to your home. in fact, many authorities have the ability to track a person's actual identity in telegram, and they can match you in the telegram proxy data traffic through a large number of surveillance telegram bots, through characteristics such as the time, length, etc. they can match your messages, proxy connection real local ip and real your address used to send messages, as long as you are in their country and make requests through their Internet, the more messages you send, the more accurate the clues you match.

27

I asked some colleagues and they said that this is based on the characteristics of space and time, the real time is constant, and the speed of time passing is constant, but after being collected as a data packet set, the computer can replay the space and time of the data packet transmission , high-speed playback, low-speed playback, the computer can perform countless deductions for this period of time, with the characteristics of data packet noise, data packet length, data packet combination, combination of specific bytes of data packets, etc.

some examples

when you and your friends are talking in the room, there is loud music and sounds in the background. a person outside the door can record it and play it back repeatedly. after eliminating the interference through technology, extract the content of your conversation as much as possible.
when you are watching an adult/porn movie, you have encountered a hateful mosaic, but you can remove the mosaic by imagining what is under the mosaic, or by deep learning.

28

Hello guys, the message I posted above will be deleted on 22/10/10 00:00:00+8 to prevent unnecessary trouble for me. I am sending this because the authorities have used many, many government annual budgets for covid-19 prevention and control, and they have no funds left in their coffers, so we civil servants and employees of state-owned enterprises have been paid less, and for more than a year we have been paid only a minimum salary of about $400 per month, which is not enough to cover the cost of living. Since August we were notified to stop paying salaries temporarily, no more monthly salaries will be paid together until sometime in 2023. Mind you, this was a stellar job that paid over $12,000 per month in 2019, which is a very well paying job locally. But now I have nothing left and I have to make monthly payments on my house loan. So I got angry and told you guys something I shouldn't have disclosed.

Go back