cancel
Showing results for 
Search instead for 
Did you mean: 

File downloads are truncated on Three Broadband

electricworry
Active

I'm looking for some verification about whether the issue I have is isolated to me, my area, or if it's a general Three-wide problem as I think it is.

I use Three 5G broadband and I'm about 50 metres away from the gNodeB so I've got excellent uninterrupted signal. It's not a Layer 1 problem I'm facing. The problem I have is that TCP connections are terminated prematurely (i.e. a RST packet is sent) before all data is received. Here's a simple test to verify if you have the problem or not.
The following command will attempt to download an 8MiB file (all NULs) from a website in AWS. It should work the same on Linux, MacOS, and modern Windows computers just the same. For me, I get the error "curl: (18) transfer closed with XXXXXX bytes remaining to read", which is the problem.

curl -H "Connection: close" https://electricworry.net/test-8 -o test-curl

If you're not comfortable connecting to my server, the following third-party download test should produce the same result (it does for me!):

curl -H "Connection: close" https://files.testfile.org/ZIPC/15MB-Corrupt-Testfile.Org.zip -o test-curl

When I tested, I collected a packet capture at both sides and I can see that my server sends the whole 8MiB file in the TLS session and then terminates the connection with a RST packet at the end (which it does because we sent a "Connection: close" header). However on my client side, only half of the file comes through before the session is impolitely terminated.

Would people on Three 5G broadband mind testing please to help confirm/deny whether this is a general problem or an individual one?

I've done a lot of testing over the past month and I've got a hypothesis.

  • Comparing the server and client packet captures, the packets do not match up; the sequence and ack numbers - though they start the same - end up being very different. It appears that something in the middle is buffering the stream and ACKing the packets on my behalf.
  • The problem only happens when I'm on my Three 5G Broadband service. If I take my laptop into work, the problem is gone. The problem doesn't occur when I use my Giffgaff mobile as a hotspot either.
  • The problem exists on all websites (I suffer *a lot* from Ubuntu APT packages being half-downloaded and rejected on my workstation).
  • Since the times on my server and client are synchronised as good as possible with NTP I can compare progress of the stream at both sides. When my server has finished transmitting (and received the final ACK) it correctly sends a RST packet according to the standard. However, at that same time on the client all of the stream has not been received (we're about half-way) and I certainly haven't sent an ACK for it. then a RST comes in tearing down the session before it's finished and truncating the download.
  • The problem only happens if "Connection: close" header is used. If "Connection: keep-alive" is used, then it's the responsibility of the client to terminate the connection once it's done. In this case, no problem! However, a lot of things don't use that. A web browser generally uses keep-alive for efficiency - hence 99% of users won't encounter or know about the problem - but a lot of systems (e.g. APT, Ansible) will use "close", which is why it's such a problem for me in my work.
  • Changing APN and PDP type in the router has zero impact; it doesn't matter whether I'm using IPv4, IPv6, IPv4v6, APN "3internet", "3secure", or "three.co.uk". The problem for me is general.

Ultimately, my hypothesis is that Three have some sort of connection buffering to optimise the user experience or maybe to prevent wasted re-transmissions, but there's a glaring bug in it whereby it resets the connection and discards the buffer it holds for the session once the server has closed the connection. This would make sense for an ISP based solely on a Radio Area Network because if clients exist in grey spots where the connection can go down momentarily much of the time it is helpful to buffer the lost packets for the clients rather than have the server spamming their link with retries of the unACKd packets (and further polluting the radio waves). So I think Three ACKing the packets on my behalf is by design. Only the implementation is bad and it mistakenly assumes it can throw away the buffer when the server terminates the connection.

Any help/testing/solidarity would be much appreciated because Three technical support have been zero help since I raised it with them over a month ago. I sent over detailed evidence, but all they can muster is a call occasionally to incorrectly restate the problem and ask if I'm still having it. Really awful experience; I've never seen a team so completely unable to escalate to responsible people who might actually be able to help eventually.

35 REPLIES 35
jr0
Rising star

The TCP session/stream is between the client and the server. Even if you would have a load balancer or a proxy in the middle I would say it would still replicate/forward the client and server answers/packets (in other words it would not decide to teardown on its own). However a FW could teardown a connection/stream.

Question, at TCP level, do you see the packets arriving in order or out of order at the client? Check the TCP sequence number or if using wireshark you can use follow TCP stream

I'm just wondering if teardown arrives first (before the rest of the download) and then correctly assumes the socket/stream/resource is unusable. I'm also wondering the teardown is independently sent, really doesn't have a sequence number.

electricworry
Active

Hi @jr0. You're potentially veering off into not helpful territory here because we can get excited about the specifics of what's happening but I don't want to fill this topic with an epic amount of data when Three haven't even acknowledged a problem yet - the test cases that I and @TheEnglishman have provided should be enough to get some movement. Remember this happens 100% of the time for me on Three with the ZTE router and never over my Giffgaff mobile hotspot, my old BT FTTC broadband, at work etc.

However, to give proper respect to your questions I will try and answer as best as I can:

  • At the TCP level I do sometimes see packets out of order. However that's not really enough to cause a problem as my OS will park them until the packet with the correct sequence comes in. They are sufficiently in order that it's not an area of doubt.
  • More interestingly the acknowledgements the server sees are very different to the acknowledgements I send. For example, the sequence numbers start the same at both ends (server sends packet seq=1309 len=1153, and I send ack=2462). The server sees the same coming in.
    But later on, after more packets, I send an ack=3928. The server got completely different acks: 2620, 5236, etc. The window values are completely different too; my client OS starts with a modest window value of 64k, but immediately the server receives a window of around 4MB. Something in the middle is heavily mangling Layer 4. If you could see the packet captures from both sides you would agree something in the middle is mangling/aggregating it.
  • Regarding your hypothesis that out of order packets could be part of the problem, remember that the SERVER only sends the reset packet once it's completed sending data AND it's received the ACK from the CLIENT. It has to wait for that ACK because further retransmissions will be required if an ACK isn't forthcoming.
    Indeed it has received an ACK for that data, just not from me! Because I've only ACKd half of the download and suddenly a RST comes in tearing it down. And that RST packet the CLIENT receives is not some magical out of order packet from the future. Its Seq number is about half of what it should be, and it also has different TCP flags set. (The SERVER sends FIN, ACK, and the CLIENT receives FIN, PSH, ACK.)

I'll not get into a debate about whether it's only a firewall that would engage in this sort of behaviour and a load-balancer would not, mainly because I don't know. However, from networking we know anything is possible. If you want to see example packet captures I am happy to send them - you will agree something is seriously wrong - but I don't want to discuss any further here as the furthest I've gotten with Three support is pretty much "everything is ok, but we'll call you back once a week to ask if you've still got a problem".

jr0
Rising star

I still recall TCP/IP lectures and I have had similar problems, where TCP streams being terminated not by the server nor client because the TCP timeouts where hours, where the tcptimeout on the firewall was minutes

Is your server hosted in a cloud service or under the desk and using BT broaband?

You have seen that I tried and succeed to download WP pluggins through my 3 hotspot and no problems. And like englishman and you said, it looks like the problem might be on the 5G router/hub.

Have you tried to put the Three SIM into your phone and create a hotspot and retest?

electricworry
Active

@jr0 wrote:

I still recall TCP/IP lectures and I have had similar problems, where TCP streams being terminated not by the server nor client because the TCP timeouts where hours, where the tcptimeout on the firewall was minutes

Is your server hosted in a cloud service or under the desk and using BT broaband?


My server is hosted on AWS. I do not have any other servers to test on. We can find other servers that have the same issue, but we might not know where it's hosted (e.g. the Wordpress example).

Yep, you can have a firewall that decides to kill a TCP connection because of a timeout. Generally that happens if there's no data flowing and it can't keep arbitrary connections open forever (since its session table is finite). That's not relevant here as data is flowing and I've shown that when its the server that closes the connection ("Connection: close") the stream is truncated, but when the server waits for the client to do the closing ("Connection: keep-alive") it's fine.


You have seen that I tried and succeed to download WP pluggins through my 3 hotspot and no problems. And like englishman and you said, it looks like the problem might be on the 5G router/hub.


Yes, I can see you're unaffected.


Have you tried to put the Three SIM into your phone and create a hotspot and retest?


I have not. It's something I could do and I may at some point out of curiosity. It really is a good idea and I thank you for the suggestion! However, I do not have time right now and I would really like Three to engage on the issue. Even if it is the ZTE router (which is more and more plausible now we've been discussing it) I still need a router that works.

jr0
Rising star

Well I didn't use curl nor set the http header to "connection: close" as browsers don't do that... I'm still wondering why are you enforcing this option when the default response from most servers seem to be to timeout the connection...
When I earlier asked about the tcp order, I don't think I explain myself in the right way - I know the OS will reassemble the packets in right order, but what I meant is imagine the connection is being closed prematurely on the client side (e.g. packet loss or retrasmission...).
Do you see the client (curl) ack the TLS Close Notify by sending its own Close Notify after receiving the all data(file)?
I'm looking at the curl man page have you tried the --retry and -C options? as curl will abort the download rather than trying to recover

Check if the ZTE router supports IP Passthrough Mode? and/or disable its firewall?

...above are just ideas that I would try...

electricworry
Active

 


@jr0 wrote:

Well I didn't use curl nor set the http header to "connection: close" as browsers don't do that... I'm still wondering why are you enforcing this option when the default response from most servers seem to be to timeout the connection...


The curl example is demonstrative. There's a lot of stuff out there that isn't a web browser that will use "Connection: close" header by default. And there's a lot of stuff that doesn't send a header by default at all, but in the case that @TheEnglishman sent (downloading from the Wordpress server) the server forces "Connection: close". I'm not being contrary with the example, it's a simple proof.

If I had said, "take this Ansible playbook to install a package and it fails", then people would assume that there's something wrong with Ansible because "look my browser works".

The problem is with Three (or the router they supply). I've demonstrated that and @TheEnglishman has confirmed it. Please stop.