If you start with the problem of how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP. It just is the right solution for the job.
The three drawbacks of the original TCP algorithm were the window size (the maximum value is just too small for today's speeds), poor handling of missing packets (addressed by extensions such as selective-ACK), and the fact that it only manages one stream at a time, and some applications want multiple streams that don't block each other. You could use multiple TCP connections, but that adds its own overhead, so SCTP and QUIC were designed to address those issues.
The congestion control algorithm is not part of the on-the-wire protocol, it's just some code on each side of the connection that decides when to (re)send packets to make the best use of the available bandwidth. Anything that implements a reliable stream on top of datagrams needs to implement such an algorithm. The original ones (Reno, Vegas, etc) were very simple but already did a good job, although back then network equipment didn't have large buffers. A lot of research is going into making better algorithms that handle large buffers, large roundtrip times, varying bandwidth needs and also being fair when multiple connections share the same bandwidth.
I'll take flak for saying it, but I feel web developers are partially at fault for laziness on this one. I've often seen them trigger a swath of connections (e.g. for uncoordinated async events), when carefully managed multiplexing over one or a handful will do just fine.
Eg. In prehistoric times I wrote a JavaScript library that let you queue up several downloads over one stream, with control over prioritization and cancelability.
It was used in a GreaseMonkey script on a popular dating website, to fetch thumbnails and other details of all your matches in the background. Hovering over a match would bring up all their photos, and if some hadn't been retrieved yet they'd immediately move to the top of the queue. I intentionally wanted to limit the number of connections, to avoid oversaturating the server or the user's bandwidth. Idle time was used to prefetch all matches on the page (IIRC in a sensible order responsive to your scroll location). If you picked a large enough pagination, then stepped away to top up your coffee, by the time you got back you could browse through all of your recent matches instantly, without waiting for any server roundtrip lag.
It was pretty slick. I realize these days modern stacks give you multiplexing for free, but to put in context this was created in the era before even JQuery was well-known.
Funny story, I shared it with one of my matches and she found it super useful but was a bit surprised that, in a way, I was helping my competition. Turned out OK... we're still together nearly two decades later and now she generously jokes I invented Tinder before it was a thing.
Sure, you can reimplement multiplexing on the application level, but it just makes more sense to do it on the transport level, so that people don't have to do it in JavaScript.
This is wonderful to hear.
I have a naive question. Is this the reason most websites/web servers absolutely need CDNs (apart from their edge capabilities) because they understand caching much more than a web developer does? But I would think the person more closer to the user access pattern would know the optimal caching strategy.
CDNs became popular back in the old days, when some people thought that if two websites are using jquery-1.2.3.min.js, CDN could cache it and second site would load quicker. These days, browser don't do that, they'll ignore cached assets from other websites because it somehow helps to protect user privacy and they value privacy over performance in this case.
There are some reasons CDNs might be helpful. Edge capability probably is the most important one. Another reason is that serving lots of static data might be a complicated task for a small website, so it makes sense to offload it to a specialised service. These days, CDNs went beyond static data. They can hide your backend, so public user won't know its address and can't DDoS it. They can handle TLS for you. They can filter bots, tor and people from countries you don't like. All in a few clicks in the dashboard, no need to implement complicated solutions.
But nothing you couldn't write yourself in a few days, really.
[Not a web dev but] I thought each site gets a handful of connections (4) to each host and more requests would have to wait to use one of them. That's pretty close to what I'd want with a reasonably fast connection.
That's basically right. Back when I made this, many servers out there still limited you to just 2 (or sometimes even 1) concurrent connections. As sites became more media-heavy that number trended up. HTTP/2 can handle many concurrent streams on one connection, I'm not sure if you get as fine-grained control as with the library I wrote (maybe!).
> If you start with the problem of how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP.
I'll add that at the time of TCP's writing, the telephone people far outnumbered everyone else in the packet switching vs circuit switching debate. TCP gives you a virtual circuit over a packet switched network as a pair of reliable-enough independent byte streams over IP. This idea, that the endpoints could implement reliability through retransmission came from an earlier French network, Cylades, and ends up being a core principle of IP networks.
We're still "suffering" from the latency and jitter effects of the packet switching victory. (The debate happened before my time and I don't know if I would have really agreed with circuit switching.) Latency and jitter on the modern Internet are very best effort emphasis on "effort".
True, but with circuit switching, we'd probably still be paying by the minute, so most of these jittery/bufferbloated connections would not exist in the first place.
Also, circuit switching is harder (well, more expensive) to do at scale, especially with different providers (probably a reason the traditional telecoms pushed it so hard - to protect their traditional positions). Even modern circuit technologies like MPLS are mostly contained to within a network (though there can be and is cross-networking peering) and aren't as connection oriented as previous circuits like ATM or Frame Relay.
Circuit switching is not harder to do, it's simply less efficient. In the PSTN and ISDN world, circuits consumed bandwidth regardless of whether it was actively in use or not. There was no statistical multiplexing as a result.
Circuit switching packets means carrying metadata about the circuit rather than simply using the destination MAC or IP address to figure out routing along the way. ATM took this to an extreme with nearly 10% protocol overhead (48 bytes of payload in a 53 byte cell) and 22 bytes of wasted space in the last ATM cell for a 1500 byte ethernet packet. That inefficiency is what really hurt. Sadly the ATM legacy lives on in GPON and XGSPON -- EPON / 10GEPON are far better protocols. As a result, GPON and XGSPON require gobs of memory per port for frame reassembly (128 ONUs x 8 priorities x 9KB for jumbo frames = 9MB per port worst case), whereas EPON / 10GEPON do not.
MPLS also has certain issues that are solved by using the IPv6 next header feature which avoids having to push / pop headers (modifying the size of the packet which has implications for buffering and the associated QoS issues making the hardware more complex) in the transport network. MPLS labels made sense at the time of introduction in the early 2000s when transport network hardware was able to utilize a small table to look up the next hop of a frame instead of doing a full route lookup. The hardware constraints of those early days requiring small SRAMs have effectively gone away since modern ASICs have billions of transistors which make on chip route tables sufficient for many use-cases.
The telephone people were basically right with their criticisms of TCP/IP such as:
What about QoS? Jitter, bandwidth, latency, fairness guarantees? What about queuing delay? What about multiplexing and tunneling? Traffic shaping and engineering? What about long-haul performance? Easy integration with optical circuit networks? etc. ATM addressed these issues, but TCP/IP did not.
All of these things showed up again once you tried to do VOIP and video conferencing, and in core ISPs as well as access networks, and they weren't (and in many cases still aren't) easy to solve.
If that is true, then why did the telcos rapidly move the entire backbone of the telephone network to IP in the 1990s?
And why are they trying to persuade regulators to let them get rid of the remaining (peripheral) part of the old circuit-switched network, i.e., to phase out old-school telephone hardware, requiring all customers to have IP phone hardware?
They moved to IP because it was improving faster in speed and commoditization vs. ATM. But in order to make it work, they had to figure out how to make QoS work on IP networks, which wasn't easy. It still isn't easy (see: crappy zoom calls.)
Modern circuit switched networks use optics rather than the legacy copper circuits which date back to telegraphy.
Packet switching is cheaper; even though it can't make any guarantees about latency and bandwidth the way circuit switching could, it uses scarce long-haul bandwidth more efficiently. I regularly see people falling off video calls, like, multiple times a week. So, in some ways, it's a worse product, but costs much less.
You can criticize something and still select it as the best option. I do this daily with Apple. If you can’t find a flaw in a technical solution you probably aren’t looking close enough.
TCP has another unfixable flaw - it cannot be properly secured. Writing a security layer on top of TCP can at most detect, not avoid, attacks.
It is very easy for a malicious actor anywhere in the network to inject data into a connection. By contrast, it is much harder for a malicious actor to break the legitimate traffic flow ... except for the fact that TCP RST grants any rando the power to upgrade "inject" to "break". This is quite common in the wild for any traffic that does not look like HTTP, even when both endpoints are perfectly healthy.
Blocking TCP RST packets using your firewall will significantly improve reliability, but this still does not project you from more advanced attackers which cause a desynchronization due to forged sequence numbers with nonempty payload.
As a result, it is mandatory for every application to support a full-blown "resume on a separate connection" operation, which is complicated and hairy and also immediately runs into the additional flaw that TCP is very slow to start.
---
While not an outright flaw, I also think it has become clear by now that it is highly suboptimal for "address" and "port" to be separate notions.
"... some applications want multiple streams that don't block each other. You could use multiple TCP connections, but that adds its own overhead, so SCTP and QUIC were designed to address those issues."
Other applications work just fine with a single TCP connection
If I am using TCP for DNS, for example, and I am retrieving data from a single host such as a DNS cache, I can send multiple queries over a single TCP connection and receive multiple responses over the same single TCP single connection, out of order. No blocking.^1 If the cache (application) supports it, this is much faster than receiving answers sequentially and it's more efficient and polite than opening multiple TCP connections
1. I do this every day outside the browser with DNS over TLS (DoT) using something like streamtcp from NLNet Labs. I'm not sure that QUIC is faster, server support for QUIC is much more limited, but QUIC may have other advantages
I also do it with DNS over HTTPS (DoH), outside the browser, using HTTP/1.1 pipelining, but there I receive answers sequentially. I'm still not convinced that HTTP/2 is faster for this particular use case, i.e., downloading data from a single host using multiple HTTP requests (compared to something like integrating online advertising into websites, for example)
> I can send multiple queries over a single TCP connection and receive multiple responses over the same single TCP single connection, out of order.
This is because DoT allows the DNS server to resolve queries concurrently and send query responses out of order.
However, this is an application layer feature, not a transport layer one. The underlying TCP packets still have to arrive in order and therefore are subject to blocking.
> I can send multiple queries over a single TCP connection and receive multiple responses over the same single TCP single connection, out of order. No blocking.
You're missing the point. You have one TCP connection, and the sever sends you response1 and then response2. Now if response1 gets lost or delayed due to network conditions, you must wait for response1 to be retransmitted before you can read response2. That is blocking, no way around it. It has nothing to do with advertising(?), and the other protocols mentioned don't have this drawback.
I work on an application that does a lot of high frequency networking in a tcp like custom framework. Our protocol guarantees ordering per “channel” so you can send requesr1 on channel 1 and request2 on channel 2 and receive the responses in any order. (But if you send request 1 and then request 2 on the same channel you’ll get them back in order)
It’s a trade off, and there’s a surprising amount of application code involved on the receiving side in the application waiting for state to be updated on both channels. I definitely prefer it, but it’s not without its tradeoffs.
Yeah the fact that the congestion control algorithm isn’t part of the wire protocol is very ahead of its time and gave the protocol flexibility that’s much needed in retrospective. OTOH a lot of college courses about TCP don’t really emphasize this fact and still many people I interacted with thought that TCP had a single defined congestion control algorithm.
> how to create a reliable stream of data on top of an unreliable datagram layer, then the solution that comes out will look virtually identical to TCP. It just is the right solution for the job
A stream of bytes made sense in the 1970s for remote terminal emulation. It still sort of makes sense for email, where a partial message is useful (though downloading headers in bulk followed by full message on demand probably makes more sense.)
But in 2025 much of communication involves messages that aren't useful if you only get part of them. It's also a pain to have to serialize messages into a byte stream and then deserialize the byte stream into messages (see: gRPC etc.) and the byte stream ordering is costly, doesn't work well with multipathing, and doesn't provide much benefit if you are only delivering complete messages.
TCP without congestion control isn't particularly useful. As you note traditional TCP congestion control doesn't respond well to reordering. Also TCP's congestion control traditionally doesn't distinguish between intentional packet drops (e.g. due to buffer overflow) and packet loss (e.g. due to corruption.) This means, for example that it can't be used directly over networks with wireless links (which is why wi-fi has its own link layer retransmission).
TCP's traditional congestion control is designed to fill buffers up until packets are dropped, leading to undesirable buffer bloat issues.
TCP's traditional congestion control algorithms (additive increase/multiplicative decrease on drop) also have the poor property that your data rate tends to drop as RTT increases.
TCP wasn't designed for hardware offload, which can lead to software bottlenecks and/or increased complexity when you do try to offload it to hardware.
TCP's three-way handshake is costly for one-shot RPCs, and slow start means that short flows may never make it out of slow start, neutralizing benefits from high-speed networks.
TCP is also poor for mobility. A connection breaks when your IP address changes, and there is no easy way to migrate it. Most TCP APIs expose IP addresses at the application layer, which causes additional brittleness.
Additionally, TCP is poorly suited for optical/WDM networks, which support dedicated bandwidth (signal/channel bandwidth as well as data rate), and are becoming more important in datacenters and as interconnects for GPU clusters.
Poor for high speed connections () or very unreliable connections.
) compared to when TCP was invented.
When I started at university the ftp speed from the US during daytime was 500 bytes per second! You don't have many unacknowledged packages in such a connection.
Back then even a 1 megabits/sec connection was super high speed and very expensive.
Might be obvious in hindsight, but it was not clear at all back then, that the congestion is manageable this way. There were legitimate concerns that it will all just melt down.
There are a lot of design alternatives possible to TCP within the "create a reliable stream of data on top of an unreliable datagram layer" space:
• Full-duplex connections are probably a good idea, but certainly are not the only way, or the most obvious way, to create a reliable stream of data on top of an unreliable datagram layer. TCP's predecessor NCP was half-duplex.
• TCP itself also supports a half-duplex mode—even if one end sends FIN, the other end can keep transmitting as long as it wants. This was probably also a good idea, but it's certainly not the only obvious choice.
• Sequence numbers on messages or on bytes?
• Wouldn't it be useful to expose message boundaries to applications, the way 9P, SCTP, and some SNA protocols do?
• If you expose message boundaries to applications, maybe you'd also want to include a message type field? Protocol-level message-type fields have been found to be very useful in Ethernet and IP, and in a sense the port-number field in UDP is also a message-type field.
• Do you really need urgent data?
• Do servers need different port numbers? TCPMUX is a straightforward way of giving your servers port names, like in CHAOSNET, instead of port numbers. It only creates extra overhead at connection-opening time, assuming you have the moral equivalent of file descriptor passing on your OS. The only limitation is that you have to use different client ports for multiple simultaneous connections to the same server host. But in TCP everyone uses different client ports for different connections anyway. TCPMUX itself incurs an extra round-trip time delay for connection establishment, because the requested server name can't be transmitted until the client's ACK packet, but if you incorporated it into TCP, you'd put the server name in the SYN packet. If you eliminate the server port number in every TCP header, you can expand the client port number to 24 or even 32 bits.
• Alternatively, maybe network addresses should be assigned to server processes, as in Appletalk (or IP-based virtual hosting before HTTP/1.1's Host: header, or, for TLS, before SNI became widespread), rather than assigning network addresses to hosts and requiring port numbers or TCPMUX to distinguish multiple servers on the same host?
• Probably SACK was actually a good idea and should have always been the default? SACK gets a lot easier if you ack message numbers instead of byte numbers.
• Why is acknowledgement reneging allowed in TCP? That was a terrible idea.
• It turns out that measuring round-trip time is really important for retransmission, and TCP has no way of measuring RTT on retransmitted packets, which can pose real problems for correcting a ridiculously low RTT estimate, which results in excessive retransmission.
• Do you really need a PUSH bit? C'mon.
• A modest amount of overhead in the form of erasure-coding bits would permit recovery from modest amounts of packet loss without incurring retransmission timeouts, which is especially useful if your TCP-layer protocol requires a modest amount of packet loss for congestion control, as TCP does.
• Also you could use a "congestion experienced" bit instead of packet loss to detect congestion in the usual case. (TCP did eventually acquire CWR and ECE, but not for many years.)
• The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network.
• TCP's hardcoded timeout of 5 minutes is also a major flaw. Wouldn't it be better if the application could set that to 1 hour, 90 minutes, 12 hours, or a week, to handle intermittent connectivity, such as with communication satellites? Similarly for very-long-latency datagrams, such as those relayed by single LEO satellites. Together this and the previous flaw have resulted in TCP largely being replaced for its original session-management purpose with new ad-hoc protocols such as HTTP magic cookies, protocols which use TCP, if at all, merely as a reliable datagram protocol.
• Initial sequence numbers turn out not to be a very good defense against IP spoofing, because that wasn't their original purpose. Their original purpose was preventing the erroneous reception of leftover TCP segments from a previous incarnation of the connection that have been bouncing around routers ever since; this purpose would be better served by using a different client port number for each new connection. The ISN namespace is far too small for current LFNs anyway, so we had to patch over the hole in TCP with timestamps and PAWS.
• Full-duplex connections are probably a good idea, but certainly are not the only way, or the most obvious way, to create a reliable stream of data on top of an unreliable datagram layer. TCP itself also supports a half-duplex mode—even if one end sends FIN, the other end can keep transmitting as long as it wants. This was probably also a good idea, but it's certainly not the only obvious choice.
Much of that comes from the original applications being FTP and TELNET.
• Sequence numbers on messages or on bytes?
Bytes, because the whole TCP message might not fit in an IP packet. This is the MTU problem.
• Wouldn't it be useful to expose message boundaries to applications, the way 9P, SCTP, and some SNA protocols do?
Early on, there were some message-oriented, rather than stream-oriented, protocols on top of IP. Most of them died out. RDP was one such.
Another was QNet.[2]
Both still have assigned IP protocol numbers, but I doubt that a RDP packet would get very far across today's internet.
This was a lack. TCP is not a great message-oriented protocol.
• Do you really need urgent data?
The purpose of urgent data is so that when your slow Teletype is typing away, and the recipient wants it to stop, there's a way to break in. See [1], p. 8.
• It turns out that measuring round-trip time is really important for retransmission, and TCP has no way of measuring RTT on retransmitted packets, which can pose real problems for correcting a ridiculously low RTT estimate, which results in excessive retransmission.
Yes, reliable RTT is a problem.
• Do you really need a PUSH bit? C'mon.
It's another legacy thing to make TELNET work on slow links. Is it even supported any more?
• A modest amount of overhead in the form of erasure-coding bits would permit recovery from modest amounts of packet loss without incurring retransmission timeouts, which is especially useful if your TCP-layer protocol requires a modest amount of packet loss for congestion control, as TCP does.
• Also you could use a "congestion experienced" bit instead of packet loss to detect congestion in the usual case. (TCP did eventually acquire CWR and ECE, but not for many years.)
Originally, there was ICMP Source Quench for that, but Berkley didn't put it in BSD, so nobody used it. Nobody was sure when to send it or what to do when it was received.
• The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network.
That would require a security system to prevent hijacking sessions.
> The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network
This 100% !! And basically the reason mosh had to be created in the first place (and it probably wasn't easy.) Unfortunately mosh only solves the problem for ssh. Exposing fixed IP addresses to the application layer probably doesn't help either.
So annoying that TCP tends to break whenever you switch wi-fi networks or switch from wi-fi to cellular. (On iPhones at least you have MPTCP, but that requires server-side support.)
AppleTalk didn't get much love for its broadcast (or possibly multicast?) based service discovery protocol - but of course that is what inspired mDNS. I believe AppleTalk's LAN addresses were always dynamic (like 169.x IP addresses), simplifying administration and deployment.
I tend to think that one of the reasons linux containers are needed for network services is that DNS traditionally only returns an IP address (rather than address + port) so each service process needs to have its own IP address, which in linux requires a container or at least a network namespace.
AppleTalk also supported a reliable transaction (basically request-response RPC) protocol (ATP) and a session protocol, which I believe were used for Mac network services (printing, file servers, etc.) Certainly easier than serializing/deserializing byte streams.
Does "session protocol" mean that it provided packet retransmission and reordering, like TCP? How does that save you serializing and deserializing byte streams?
I agree that, given the existing design of IP and TCP, you could get much of the benefit of first-class addresses for services by using, for example, DNS-SD, and that is what ZeroConf does. (It is not a coincidence that the DNS-SD RFC was written by a couple of Apple employees.) But, if that's the way you're going to be finding endpoints to initiate connections to, there's no benefit to having separate port numbers and IP addresses. And IP addresses are far scarcer than just requiring a Linux container or a network namespace: there are only 2³² of them. But it is rare to find an IP address that is listening on more than 64 of its 2¹⁶ TCP ports, so in an alternate history where you moved those 16 bits from the port number to the IP address, we would have one thousandth of the IP-address crunch that we do.
Historically, possibly the reason that it wasn't done this way is that port numbers predated the DNS by about 10 years.
Mockapetris's DNS RFCs are from 01983, although I think I've talked to people who installed DNS a year or two before that. Port numbers were first proposed in RFC 38 in 01970 https://datatracker.ietf.org/doc/html/rfc38
> The END and RDY must specify relevant sockets in addition to the link number. Only the local socket name need be supplied
> Connections are named by a pair of sockets. Sockets are 40 bit names
which are known throughout the network. Each host is assigned a
private subset of these names, and a command which requests a
connection names one socket which is local to the requesting host and
one local to the receiver of the request.
> Sockets are polarized; even numbered sockets are receive sockets; odd numbered ones are send sockets. One of each is required to make a connection.
In RFC 129 in 01971 we see discussion about whether socketnames should include host numbers and/or user numbers, still with the low-order bit indicating the socket's gender (emissive or receptive). https://datatracker.ietf.org/doc/html/rfc129
RFC 147 later that year https://datatracker.ietf.org/doc/html/rfc147 discusses within-machine port numbers and how they should or should not relate to the socketnames transmitted in NCP packets:
> Previous network papers postulated that a process running under control of
the host's operating system would have access to a number of ports. A port
might be a physical input or output device, or a logical I/O device (...)
> A socket has been defined to be the identification of a port for machine to
machine communication through the ARPA network. Sockets allocated to each
host must be uniquely associated with a known process or be undefined. The
name of some sockets must be universally known and associated with a known
process operating with a specified protocol. (e.g., a logger socket, RJE
socket, a file transfer socket). The name of other sockets might not be
universally known, but given in a transmission over a universally known
socket, (c. g. the socket pair specified by the transmission over the
logger socket under the Initial Connection Protocol (ICP). In any case,
communication over the network is from one socket to another socket, each
socket being identified with a process running at a known host.
RFC 167 the same year https://datatracker.ietf.org/doc/html/rfc167 proposes that socketnames not be required to be unique network-wide but just within a host. It also points out that you really only need the socketname during the initial connection process, if you have some other way of knowing which packets belong to which connections:
> Although fields will be helpful in dealing with socket number
allocation, it is not essential that such field designations be uniform
over the network. In all network transactions the 32-bit socket number
is handled with its 8-bit host number. Thus, if hosts are able to
maintain uniqueness and repeatability internally, socket numbers in the
network as a whole will also be unique and repeatable. If a host fails
to do so, only connections with that offending host are affected.
> Because the size, use, and character of systems on the network are so
varied, it would be difficult if not impossible to come up with an
agreed upon particular division of the 32-bit socket number. Hosts have
different internal restrictions on the number of users, processes per
user, and connections per process they will permit.
> It has been suggested that it may not be necessary to maintain socket
uniqueness. It is contended that there is really no significant use made
of the socket number after a connection has been established. The only
reason a host must now save a socket number for the life of a connection
is to include it in the CLOSE of that connection.
> Initial Connection will be as per the Official Initial Connection Protocol, Documents #2, NIC 7101, to a standard socket not yet assigned. A candidate socket number would be socket #5.
> I would like to collect information on the use of socket numbers
for "standard" service programs. For example Loggers (telnet servers)
Listen on socket 1. What sockets at your host are Listened to by what
programs?
> Recently Dick Watson suggested assigning socket 5 for use by a
mail-box protocol (RFC196). Does any one object ? Are there any
suggestions for a method of assigning sockets to standard programs?
Should a subset of the socket numbers be reserved for use by future
standard protocols?
> Please phone or mail your answers and commtents to (...)
Amusingly in retrospect, Postel did not include an email address, presumably because they didn't have email working yet.
FTP's assignment to port 3 was confirmed in RFC 265 in November:
> Socket 3 is the standard preassigned socket number on which the cooperating file transfer process at the serving host should "listen". (*)The connection establishment will be in accordance with the standard initial connection protocol, (*)establishing a full-duplex connection.
> I propose that there be a czar (me ?) who hands out official socket
numbers for use by standard protocols. This czar should also keep track
of and publish a list of those socket numbers where host specific
services can be obtained. I further suggest that the initial allocation
be as follows:
Sockets Assignment
0-63 Network wide standard functions
64-127 Host specific functions
128-239 Reserved for future use
240-255 Any experimental function
> and within the network wide standard functions the following particular
assignment be made:
So, internet port numbers in their current form are from 01971 (several years before the split between TCP and IP), and DNS is from about 01982.
In December of 01972, Postel published RFC 433 https://www.rfc-editor.org/rfc/rfc433.html, obsoleting the RFC 349 list with a list including chargen and some other interesting services:
Socket Assignment
1 Telnet
3 File Transfer
5 Remote Job Entry
7 Echo
9 Discard
19 Character Generator [e.g. TTYTST]
65 Speech Data Base @ ll-tx-2 (74)
67 Datacomputer @ cca (31)
241 NCP Measurement
243 Survey Measurement
245 LINK
The gap between 9 and 19 is unexplained.
RFC 503 https://www.rfc-editor.org/rfc/rfc503.html from 01973 has a longer list (including systat, datetime, and netstat), but also listing which services were running on which ARPANet hosts, 33 at that time. So RFC 503 contained a list of every server process running on what would later become the internet.
Skipping RFC 604, RFC 739 from 01977 https://www.rfc-editor.org/rfc/rfc739.html is the first one that shows the modern port number assignments (still called "socket numbers") for FTP and Telnet, though those presumably dated back a couple of years at that point:
Specific Assignments:
Decimal Octal Description References
------- ----- ----------- ----------
Network Standard Functions
1 1 Old Telnet [6]
3 3 Old File Transfer [7,8,9]
5 5 Remote Job Entry [10]
7 7 Echo [11]
9 11 Discard [12]
11 13 Who is on or SYSTAT
13 15 Date and Time
15 17 Who is up or NETSTAT
17 21 Short Text Message
19 23 Character generator or TTYTST [13]
21 25 New File Transfer [1,14,15]
23 27 New Telnet [1,16,17]
25 31 Distributed Programming System [18,19]
27 33 NSW User System w/COMPASS FE [20]
29 35 MSG-3 ICP [21]
31 37 MSG-3 Authentication [21]
Etc. This time I have truncated the list. It also has Finger on port 79.
You say, "My understanding is that DNS can potentially provide port numbers, but this is not widely used or supported." DNS SRV records have existed since 01996 (proposed by Troll Tech and Paul Vixie in RFC 2052 https://www.rfc-editor.org/rfc/rfc2052), but they're really only widely used in XMPP, in SIP, and in ZeroConf, which was Apple's attempt to provide the facilities of AppleTalk on top of TCP/IP.
> The Stream Control Transmission Protocol (SCTP) is a computer networking communications protocol in the transport layer of the Internet protocol suite. Originally intended for Signaling System 7 (SS7) message transport in telecommunication, the protocol provides the message-oriented feature of the User Datagram Protocol (UDP) while ensuring reliable, in-sequence transport of messages with congestion control like the Transmission Control Protocol (TCP). Unlike UDP and TCP, the protocol supports multihoming and redundant paths to increase resilience and reliability.
[…]
> SCTP may be characterized as message-oriented, meaning it transports a sequence of messages (each being a group of bytes), rather than transporting an unbroken stream of bytes as in TCP. As in UDP, in SCTP a sender sends a message in one operation, and that exact message is passed to the receiving application process in one operation. In contrast, TCP is a stream-oriented protocol, transporting streams of bytes reliably and in order. However TCP does not allow the receiver to know how many times the sender application called on the TCP transport passing it groups of bytes to be sent out. At the sender, TCP simply appends more bytes to a queue of bytes waiting to go out over the network, rather than having to keep a queue of individual separate outbound messages which must be preserved as such.
> The term multi-streaming refers to the capability of SCTP to transmit several independent streams of chunks in parallel, for example transmitting web page images simultaneously with the web page text. In essence, it involves bundling several connections into a single SCTP association, operating on messages (or chunks) rather than bytes.
Wait, can you actually just use IP? Can I just make up a packet and send it to a host across the Internet? I'd think that all the intermediate routers would want to have an opinion about my packet, caring, at the very least, that it's either TCP or UDP.
Core routers don't inspect that field, NAT/ISP boxes can. I believe that with two suitable dedicated linux servers it is very possible to send and receive single custom IP packet between them even using 253 or 254 (= Use for experimentation and testing [RFC3692]) as the protocol number
We're about half-way to exhausted, but a huge chunk of the ones assigned are long deprecated and/or proprietary technologies and could conceivably be reassigned. Assignment now is obviously a lot more conservative than it was in the 1980s.
There is sometimes drama with it, though. Awhile back, the OpenBSD guys created CARP as a fully open source router failover protocol, but couldn't get an official IP number and ended up using the same one as VRRP. There's also a lot of historical animosity that some companies got numbers for proprietary protocols (eg Cisco got one for its then-proprietary EIGRP).
Probably use of some type of options. Up to 320 bits, so I think there is reasonable amount of space there for good while. Ofc, this makes really messy processing, but with current hardware not impossible.
It uses them a little differently -- in IPv4, there is one protocol per packet, while in IPv6, "protocols" can be chained in a mechanism called extension headers -- but this actually makes the problem of number exhaustion more acute.
What if extension headers made it better? We could come up with a protocol consisting solely of a larger Next Header field and chain this pseudo header with the actual payload whenever the protocol number is > 255. The same idea could also be used in IPv4.
I didn't mean to imply otherwise. But, as you say, this is equally applicable to IPv4 and IPv6. There were a lot of issues solved by IPv6, but "have even more room for non-TCP/UDP transports" wasn't one of them (and didn't need to be, tbqh).
This is an interesting list; it makes you appreciate just how many obscure protocols have died out in practice. Evolution in networks seems to mimic evolution in nature quite well.
> caring, at the very least, that it's either TCP or UDP.
You left out ICMP, my favourite! (And a lot more important in IPv6 than in v4.)
Another pretty well known protocol that is neither TCP nor UDP is IPsec. (Which is really two new IP protocols.) People really did design proper IP protocols still in the 90s.
> Can I just make up a packet and send it to a host across the Internet?
You should be able to. But if you are on a corporate network with a really strict firewalling router that only forwards traffic it likes, then likely not. There are also really crappy home routers which gives similar problems from the other end of enterpriseness.
NAT also destroyed much of the end-to-end principle. If you don't have a real IP address and relies on a NAT router to forward your data, it needs to be in a protocol the router recognizes.
Anyway, for the past two decades people have grown tired of that and just piles hacks on top of TCP or UDP instead. That's sad. Or who am I kidding? Really it's on top of HTTP. HTTP will likely live on long past anything IP.
There is little point in inventing new protocols, given how low the overhead of UDP is. That's just 8 bytes per packet, and it enables going through NAT. Why come up with a new transport layer protocol, when you can just use UDP framing?
Agreed. Building a custom protocol seems “hard” to many folks who are doing it without any fear on top of HTTP. The wild shenanigans I’ve seen with headers, query params and JSON make me laugh a little. Everything as text is _actually_ hard.
A part of the problem with UDP is the lack of good platforms and tooling. Examples as well. I’m trying to help with that, but it’s an uphill battle for sure.
> NAT also destroyed much of the end-to-end principle. If you don't have a real IP address and relies on a NAT router to forward your data, it needs to be in a protocol the router recognizes.
Not necessarily. Many protocols can survive being NATed if they don't carry IP/port related information inside their payload. FTP is a famous counterexample - it uses a control channel (TCP21) which contains commands to open data channels (TCP20), and those commands specify IP:port pairs, so, depending on the protocol, a NAT router has to rewrite them and/or open ports dynamically and/or create NAT entries on the fly. A lot of other stuff has no need for that and will happily go through without any rewriting.
I think we agree. Of course a NAT router with an application proxy such as FTP or SIP can relay and rewrite traffic as needed.
TCP and UDP have port numbers that the NAT software can extract and keep state tables for, so we can send the return traffic to its intended destination.
For unknown IP protocols that is not possible. It may at best act like network diode, which is one way of violating the end-to-end principle.
Actually the observation about ports being mostly a TCP/UDP feature is a very good point I had failed to consider. This would indeed greatly limit the ability of a NAT gateway - it could keep just a state table of IP src/dst pairs and just direct traffic back to its source, but it's indeed very crude. Thanks for bringing it up!
Of course NAT allows application layer protocols layered on TCP or UDP to pass through without the NAT understanding the application layer – otherwise, NATted networks would be entirely broken.
The end-to-end principle at the IP layer (i.e. having the IP forwarding layer be agnostic to the transport layer protocols above it) is still violated.
Have to say that I don't encounter any problems pinging hosts in AWS.
If any host is firewalling out ICMP then it won't be pingable but that does not depend on the hosting provider. AWS is no better or worse than any other in that regard, IME.
If there's no form of NAT or transport later processing along your path between endpoints you shouldn't have an issue. But NAT and transport and application layer load balancing are very common on the net these days so YMMV.
> I'd think that all the intermediate routers would want to have an opinion about my packet, caring, at the very least, that it's either TCP or UDP.
They absolutely don't. Routers are layer 3 devices; TCP & UDP are layer 4. The only impact is that the ECMP flow hashes will have less entropy, but that's purely an optimization thing.
Note TCP, UDP and ICMP are nowhere near all the protocols you'll commonly see on the internet — at minimum, SCTP, GRE, L2TP and ESP are reasonably widespread (even a tiny fraction of traffic is still a giant number considering internet scales).
You can send whatever protocol number with whatever contents your heart desires. Whether the other end will do anything useful with it is another question.
> They absolutely don't. Routers are layer 3 devices;
Idealized routers are, yes.
Actual IP paths these days usually involve at least one NAT, and these will absolutely throw away anything other than TCP, UDP, and if you're lucky ICMP.
See nearby comment about terminology. Either we're discussing odd IP protocols, then the devices you're describing aren't just "routers" (and particularly what you're describing is not part of a "router"), or we're not discussing IP protocols, then we're not having this thread.
And note the GP talked about "intermediate routers". That's the ones in a telco service site or datacenter by my book.
As far as I'm aware, sure you can. TCP packets and UDP datagrams are wrapped in IP datagrams, and it's the job of an IP network to ship your data from point A (sender) to point B (receiver). Nodes along the way might do so-called "deep packet inspection" to snoop on the payload of your IP datagrams (for various reasons, not all nefarious), but they don't need to do that to do the basic job of routing. From a semantic standpoint, the information in the TCP and UDP headers (as part of the IP payload) is only there to govern interactions between the two endpoint parties. (For instance, the "port" of a TCP or UDP packet is a node-local identifier for one of many services that might exist at the IP address the packet was routed to, allowing many services to coexist at the same node.)
Hmm, I thought intermediate routers use the TCP packet's bits for congestion control, no? Though I guess they can probably just use the destination IP for that.
Most intermediate routers don't care much. Lookup the destination IP in the routing table, forward to the next hop, no time for anything else.
Classic congestion control is done on the sender alone. The router's job is simply to drop packets when the queue is too large.
Maybe the router supports ECN, so if there's a queue going to the next hop, it will look for protocol specific ECN headers to manipulate.
Some network elements do more than the usual routing work. A traffic shaper might have per-user queues with outbound bandwidth limits. A network accelerator may effectively reterminate TCP in hopes of increasing acheivable bandwidth.
Often, the router has an aggregated connection to the next hop, so it'll use a hash on the addresses in the packet to choose which of the underlying connections to use. That hash could be based on many things, but it's not uncommon to use tcp or udp port numbers if available. This can also be used to chose between equally scored next hops and that's why you often see several different paths during a traceroute. Using port numbers is helpful to balance connections from IP A to IP B over multiple links. If you us an unknown protocol, even if it is multiplexed into ports or similar (like tcp and udp), the different streams will likely always hash onto the same link and you won't be able to exceed the bandwidth of a single link and a damaged or congested link will affect all or none of your connections.
They probably can do deep/shallow packet inspection for that purpose (being one of the non-nefarious applications I alluded to), but that's not to say their correct functioning relies on it. Those routers also need to support at least UDP, and UDP provides almost no extra information at that level -- just the source and destination ports (so, perhaps QoS prioritization) and the inner payload's length and checksum (so, perhaps dropping bad packets quickly).
If middleware decides to do packet inspection, it better make sure that any behavioral differences (relative to not doing any inspection) is strictly an optimization and does not impact the correctness of the link.
Also, although I'm not a network operator by any stretch, my understanding is that TCP congestion control is primarily a function of the endpoints of the TCP link, not the IP routers along the way. As Wikipedia explains [0]:
> Per the end-to-end principle, congestion control is largely a function of internet hosts, not the network itself.
Yep it's full of IP protocols other than the well-known TCP, UDP and ICMP (and, if you ever had the displeasure of learning IPSEC, its AH and ESP).
A bunch of multicast stuff (IGMP, PIM)
A few routing protocols (OSPF, but notably not BGP which just uses TCP, and (usually) not MPLS which just goes over the wire - it sits at the same layer as IP and not above it)
A few VPN/encapsulation solutions like GRE, IP-in-IP, L2TP and probably others I can't remember
The reason you wouldn't do that is IP doesn't give you a mechanism to share an IP address with multiple processes on a host, it just gets your packets to a particular host.
As soon as you start thinking about having multiple services on a host you end up with the idea of having a service id or "port"
UDP or UDP Lite gives you exactly that at the cost of 8 bytes, so there's no real value in not just putting everything on top of UDP
They shouldn't; the whole point is that the IP header is enough to route packets between endpoints, and only the endpoints should care about any higher layer protocols. But unfortunately some routers do, and if you have NAT then the NAT device needs to examine the TCP or UDP header to know how to forward those packets.
Yeah, this is basically what I was wondering, why QUIC used UDP instead of their own protocol if it's so straightforward. It seems like the answer may be "it's not as interference-free as they'd like it".
UDP pretty much just tacks a source/destination port pair onto every IP datagram, so its primary function is to allow multiple independent UDP peers to coexist on the same IP host. (That is, UDP just multiplexes an IP link.) UDP as a protocol doesn't add any additional network guarantees or services on top of IP.
QUIC is still "their own protocol", just implemented as another protocol nested inside a UDP envelope, the same way that HTTP is another protocol typically nested inside a TCP connection. It makes some sense that they'd piggyback on UDP, since (1) it doesn't require an additional IP protocol header code to be assigned by IANA, (2) QUIC definitely wants to coexist with other services on any given node, and (3) it allows whatever middleware analyses that exist for UDP to apply naturally to QUIC applications.
(Regarding (3) specifically, I imagine NAT in particular requires cooperation from residential gateways, including awareness of both the IP and the TCP/UDP port. Allowing a well-known outer UDP header to surface port information, instead of re-implementing ports somewhere in the QUIC header, means all existing NAT implementations should work unchanged for QUIC.)
Yeah, so... You can do it. But only for some values of you. In a NAT world, the NAT needs to understand the protocol so that it can adjust the core multiplexing in order to adjust addresses. A best effort NAT could let one internal IP at a time connect to each external IP on an unknown protocol, but that wouldn't work for QUIC: Google expects multiple clients behind a NAT to connect to its service IPs. It can often works for IP tunneling protocols where at most one connection to an external IP isn't super restrictive. But even then, many NATs won't pass unknown IP protocols at all.
Most firewalls will drop unknown IP protocols. Many will drop a lot of TCP; some drop almost all UDP. This is why so much stuff runs over tcp ports 80 and 443; it's almost always open. QUIC/HTTP/3 encourages opening of udp/443, so it's a good port to run unrelated things over too.
Also, given that SCTP had similar goals to QUIC and never got much deployment or support in OSes and NATs and firewalls and etc. It's a clear win to just use UDP and get something that will just work on a large portion of networks.
It's effectively impossible to use anything other than TCP or UDP these days.
Some people here will argue that it actually really is, and that everybody experiencing issues is just on a really weird connection or using broken hardware, but those weird connections and bad hardware make up the overwhelming majority of Internet connections these days.
Using UDP means QUIC support is as "easy" as adding it to the browser and server software. To add it as a separate protocol would have involved all OS's needing to add support for it into their networking stacks and that would have taken ages and involved more politics. The main reason QUIC was created was so that Google could more effectively push ads and add tracking, remember. The incentives were not there for others to implement it.
When it comes to QUIC, QUIC works best with unstable end-user internet (designed for http3 for the mobile age). Most end-user internet access is behind various layers of CGNAT. The way that NAT works is by using your port numbers to increase the address space. If you have 2^32 IPv4 addresses, you have 2^48 IPv4 address+port pairs. All these NAT middleboxes speak TCP and UDP only.
Additionally, firewalls are also designed to filter out any weird packets. If the packet doesn't look like you wanted to receive it, it's dropped. It usually does this by tracking open ports just like NAT, therefore many firewalls also don't trust custom protocols.
You can call the things mangling IP addresses and TCP/UDP ports what you want, but that will unfortunately not make them go away and stop throwing away non-TCP/UDP traffic.
We're discussing nonstandard IP protocols. In that context, your home router is a CPE, and not described by the term "router" without further qualifiers, because that's the level the discussion is at. I'm happy to call it a router when talking to the neighbors, when I'm not discussing IP protocols with them.
There are many routers that don't care at all about what's going through them. But there aren't any firewalls that don't route anymore (not even at the endpoints).
That would be IP over some lower level physical layer, not some custom content stuffed into an IP packet :)
(It's absolutely worth reading some of those old April Fools' RFCs, by the way [0]. I'm a big fan of RFC 7168, which introduced HTTP response code 418 "I'm a teapot".)
TCP being the “default” meant it was chosen when the need for ordering and uniform reliability wasn’t there. That was fine but left systems working less well than they could have with more carefully chosen underpinnings. With HTTP/3 gaining traction, and HTTP being the “next level up default choice” things potentially get better. The issue I see is that QUIC is far more complex, and the new power is fantastic for a few but irrelevant to most.
UDP has its place as well, and if we have more simple and effective solutions like WireGuard’s handshake and encryption on top of it we’d be better off as an industry.
The congestion control algorithm in TCP has some interesting effects on throughput that a lot of developers aren’t aware of.
For example, sending some data on a fresh TCP connection is slow, and the “ramp up time” to the bandwidth of the network is almost entirely determined by the latency.
Amazing speed ups can be achieved in a data centre network by shaving microseconds off the round trip time!
Similarly, many (all?) TCP stacks count segments, not bytes, when determining this ramp up rate. This means that jumbo frames can provide 6x the bandwidth during this period!
If you read about the network design of AWS, they put a lot of effort into low switching latency and enabling jumbo frames.
The real pros do this kind of network tuning, everyone else wonders why they don’t get anywhere near 10 Gbps through a 10 Gbps link.
No. TCP likes zero packet loss (connected), and it understands 100% packet loss (disconnected). Its weakness is scenarios (semiconnected) in which packet loss is constantly fluctuating between substantial and nearly-total. It doesn't know what is going on, and it may cope or it may not, because its designers did not envision a future in which most networks have a semiconnected last mile; but that is where we are. Without things like forward error correction, TCP would be nearly useless over wireless. It is interesting to envision a layer-4 protocol that would incorporate FEC-like capabilities.
if you went back to 1981 and said 'yeah, this is great. but what we really want to do is not have an internet, but kind of a piecewise internet. instead of a global address we'll use addresses that have a narrower scope. and naturally as consequence of this we'll need to start distinguishing between nodes that everyone can reach, service nodes, and nodes that no one can reach - client nodes. and as a consequence of this we'll start building links that are asymmetric in bandwidth, since one direction is only used for requests and acks and not any data volume.'
they would have looked at you and asked straight out what you hoped to gain by making these things distinguished, because it certainly complicates things.
Its trivial to develop your own protocols on top of IP.
It was trivial like 15 years ago in python (without any libraries) just handcrafted packets (arp, ip etc).
> The internet is incredible. It’s nearly impossible to keep people away from.
Well ... he seems very motivated. I am more skeptical.
For instance, Google via chrome controls a lot of the internet, even more so via its search engine, AI, youtube and so forth.
Even aside from this people's habits changed. In the 1990s everyone and their Grandma had a website. Nowadays ... it is a bit different. We suddenly have horrible blogging sites such as medium.com, pestering people with popups. Of course we also had popups in the 1990s, but the diversity was simply higher. Everything today is much more streamlined it seems. And top-down controlled. Look at Twitter, owned by a greedy and selfish billionaire. And the US president? Super-selfish too. We lost something here in the last some 25 years.
You're talking about the web, which is merely an app with the internet as its platform. We can scrap it and still use the internet to build a different one.
It’s worth considering how the tiny computers of the era forced a simple clean design. IPv6 was designed starting in the early 90s and they couldn’t resist loading it up with extensions, though the core protocol remains fine and is just IP with more bits. (Many of the extensions are rarely if ever used.)
If the net were designed today it would be some complicated monstrosity where every packet was reminiscent of X.509 in terms of arcane complexity. It might even have JSON in it. It would be incredibly high overhead and we’d see tons of articles about how someone made it fast by leveraging CPU vector instructions or a GPU to parse it.
This is called Eroom’s law, or Moore’s law backwards, and it is very real. Bigger machines let programmers and designers loose to indulge their desire to make things complicated.
IPSec was a big one that’s now borderline obsolete, though it is still used for VPNs and was back ported to IPv4.
Many networking folks including myself consider IPv6 router advertisements and SLAAC to be inferior, in practice, to DHCPv6, and that it would be better if we’d just left IP assignment out of the spec like it was in V4. Right now we have this mess where a lot of nets prefer or require DHCPv6 but some vendors, like apparently Android, refuse to support it.
The rules about how V6 addresses are chopped up and assigned are wasteful and dumb. The entire V4 space could have been mapped onto /32 and an encapsulation protocol made to allow V4 to carry V6, providing a seamless upgrade path that does not require full upgrade of the whole core, but that would have been too logical. Every machine should get like a /96 so it can use 32 bits of space to address apps, VMs, containers, etc. As it stands we waste 64 bits of the space to make SLAAC possible, as near as I can tell. The SLAAC tail must have wagged the dog in that people thought this feature was cool enough to waste 8 bytes per packet.
The V6 header allows extension bits that are never used and blocked by most firewalls. There’s really no point in them existing since middle boxes effectively freeze the base protocol in stone.
Those are some of the big ones.
Basically all they should have done was make IPs 64 or 128 bits and left everything else alone. But I think there was a committee.
As it stands we have what we have and we should just treat V6 as IP128 and ignore the rest. I’m still in favor of the upgrade. V4 is too small, full stop. If we don’t enlarge the addresses we will completely lose end to end connectivity as a supported feature of the network.
> Every machine should get like a /96 so it can use 32 bits of space to address apps, VMs, containers, etc.
You can just SLAAC some more addresses for whatever you want. Although hopefully you don't use more than the ~ARP~ NDP table size on your router; then things get nasty. This should be trivial for VMs, and could be made possible for containers and apps.
> The V6 header allows extension bits that are never used and blocked by most firewalls. [...] Basically all they should have done was make IPs 64 or 128 bits and left everything else alone.
This feels contradictory... IPv4 also had extension headers that were mostly unused and disallowed. V6 changed the header extension mechanism, but offers the same opportunities to try things that might work on one network but probably won't work everywhere.
I can easily spot it's an AI written article, because it actually explains the technology in understandable human language. A human would have written it the way it was either presented to them in university or in bloated IT books: absolutely useless.
I can easily spot it's an AI written comment, because it actually explains their idea in understandable human language and brings nothing to the discussion. A human would have written it the way they understand it and bring their opinions along: absolutely useless.
At first wanted to give the benefit of the doubt that this is sarcasm but gave a skim through history and I guess it's just a committed anti-AI agenda.
Personally I found the tone of the article quite genuine and the video at the end made a compelling case for it. Well I figure you commented having actually read it.
Edit: I can't downvote but if I could it probably would have been better than this comment!
The three drawbacks of the original TCP algorithm were the window size (the maximum value is just too small for today's speeds), poor handling of missing packets (addressed by extensions such as selective-ACK), and the fact that it only manages one stream at a time, and some applications want multiple streams that don't block each other. You could use multiple TCP connections, but that adds its own overhead, so SCTP and QUIC were designed to address those issues.
The congestion control algorithm is not part of the on-the-wire protocol, it's just some code on each side of the connection that decides when to (re)send packets to make the best use of the available bandwidth. Anything that implements a reliable stream on top of datagrams needs to implement such an algorithm. The original ones (Reno, Vegas, etc) were very simple but already did a good job, although back then network equipment didn't have large buffers. A lot of research is going into making better algorithms that handle large buffers, large roundtrip times, varying bandwidth needs and also being fair when multiple connections share the same bandwidth.
I'll take flak for saying it, but I feel web developers are partially at fault for laziness on this one. I've often seen them trigger a swath of connections (e.g. for uncoordinated async events), when carefully managed multiplexing over one or a handful will do just fine.
Eg. In prehistoric times I wrote a JavaScript library that let you queue up several downloads over one stream, with control over prioritization and cancelability.
It was used in a GreaseMonkey script on a popular dating website, to fetch thumbnails and other details of all your matches in the background. Hovering over a match would bring up all their photos, and if some hadn't been retrieved yet they'd immediately move to the top of the queue. I intentionally wanted to limit the number of connections, to avoid oversaturating the server or the user's bandwidth. Idle time was used to prefetch all matches on the page (IIRC in a sensible order responsive to your scroll location). If you picked a large enough pagination, then stepped away to top up your coffee, by the time you got back you could browse through all of your recent matches instantly, without waiting for any server roundtrip lag.
It was pretty slick. I realize these days modern stacks give you multiplexing for free, but to put in context this was created in the era before even JQuery was well-known.
Funny story, I shared it with one of my matches and she found it super useful but was a bit surprised that, in a way, I was helping my competition. Turned out OK... we're still together nearly two decades later and now she generously jokes I invented Tinder before it was a thing.
But you have to include giant libraries and kernel can’t see the traffic to better manage timing etc.
CDNs became popular back in the old days, when some people thought that if two websites are using jquery-1.2.3.min.js, CDN could cache it and second site would load quicker. These days, browser don't do that, they'll ignore cached assets from other websites because it somehow helps to protect user privacy and they value privacy over performance in this case.
There are some reasons CDNs might be helpful. Edge capability probably is the most important one. Another reason is that serving lots of static data might be a complicated task for a small website, so it makes sense to offload it to a specialised service. These days, CDNs went beyond static data. They can hide your backend, so public user won't know its address and can't DDoS it. They can handle TLS for you. They can filter bots, tor and people from countries you don't like. All in a few clicks in the dashboard, no need to implement complicated solutions.
But nothing you couldn't write yourself in a few days, really.
I'll add that at the time of TCP's writing, the telephone people far outnumbered everyone else in the packet switching vs circuit switching debate. TCP gives you a virtual circuit over a packet switched network as a pair of reliable-enough independent byte streams over IP. This idea, that the endpoints could implement reliability through retransmission came from an earlier French network, Cylades, and ends up being a core principle of IP networks.
Circuit switching packets means carrying metadata about the circuit rather than simply using the destination MAC or IP address to figure out routing along the way. ATM took this to an extreme with nearly 10% protocol overhead (48 bytes of payload in a 53 byte cell) and 22 bytes of wasted space in the last ATM cell for a 1500 byte ethernet packet. That inefficiency is what really hurt. Sadly the ATM legacy lives on in GPON and XGSPON -- EPON / 10GEPON are far better protocols. As a result, GPON and XGSPON require gobs of memory per port for frame reassembly (128 ONUs x 8 priorities x 9KB for jumbo frames = 9MB per port worst case), whereas EPON / 10GEPON do not.
MPLS also has certain issues that are solved by using the IPv6 next header feature which avoids having to push / pop headers (modifying the size of the packet which has implications for buffering and the associated QoS issues making the hardware more complex) in the transport network. MPLS labels made sense at the time of introduction in the early 2000s when transport network hardware was able to utilize a small table to look up the next hop of a frame instead of doing a full route lookup. The hardware constraints of those early days requiring small SRAMs have effectively gone away since modern ASICs have billions of transistors which make on chip route tables sufficient for many use-cases.
What about QoS? Jitter, bandwidth, latency, fairness guarantees? What about queuing delay? What about multiplexing and tunneling? Traffic shaping and engineering? What about long-haul performance? Easy integration with optical circuit networks? etc. ATM addressed these issues, but TCP/IP did not.
All of these things showed up again once you tried to do VOIP and video conferencing, and in core ISPs as well as access networks, and they weren't (and in many cases still aren't) easy to solve.
Also MPLS is basically a virtual circuit network.
And why are they trying to persuade regulators to let them get rid of the remaining (peripheral) part of the old circuit-switched network, i.e., to phase out old-school telephone hardware, requiring all customers to have IP phone hardware?
Modern circuit switched networks use optics rather than the legacy copper circuits which date back to telegraphy.
It is very easy for a malicious actor anywhere in the network to inject data into a connection. By contrast, it is much harder for a malicious actor to break the legitimate traffic flow ... except for the fact that TCP RST grants any rando the power to upgrade "inject" to "break". This is quite common in the wild for any traffic that does not look like HTTP, even when both endpoints are perfectly healthy.
Blocking TCP RST packets using your firewall will significantly improve reliability, but this still does not project you from more advanced attackers which cause a desynchronization due to forged sequence numbers with nonempty payload.
As a result, it is mandatory for every application to support a full-blown "resume on a separate connection" operation, which is complicated and hairy and also immediately runs into the additional flaw that TCP is very slow to start.
---
While not an outright flaw, I also think it has become clear by now that it is highly suboptimal for "address" and "port" to be separate notions.
Other applications work just fine with a single TCP connection
If I am using TCP for DNS, for example, and I am retrieving data from a single host such as a DNS cache, I can send multiple queries over a single TCP connection and receive multiple responses over the same single TCP single connection, out of order. No blocking.^1 If the cache (application) supports it, this is much faster than receiving answers sequentially and it's more efficient and polite than opening multiple TCP connections
1. I do this every day outside the browser with DNS over TLS (DoT) using something like streamtcp from NLNet Labs. I'm not sure that QUIC is faster, server support for QUIC is much more limited, but QUIC may have other advantages
I also do it with DNS over HTTPS (DoH), outside the browser, using HTTP/1.1 pipelining, but there I receive answers sequentially. I'm still not convinced that HTTP/2 is faster for this particular use case, i.e., downloading data from a single host using multiple HTTP requests (compared to something like integrating online advertising into websites, for example)
This is because DoT allows the DNS server to resolve queries concurrently and send query responses out of order.
However, this is an application layer feature, not a transport layer one. The underlying TCP packets still have to arrive in order and therefore are subject to blocking.
You're missing the point. You have one TCP connection, and the sever sends you response1 and then response2. Now if response1 gets lost or delayed due to network conditions, you must wait for response1 to be retransmitted before you can read response2. That is blocking, no way around it. It has nothing to do with advertising(?), and the other protocols mentioned don't have this drawback.
It’s a trade off, and there’s a surprising amount of application code involved on the receiving side in the application waiting for state to be updated on both channels. I definitely prefer it, but it’s not without its tradeoffs.
A stream of bytes made sense in the 1970s for remote terminal emulation. It still sort of makes sense for email, where a partial message is useful (though downloading headers in bulk followed by full message on demand probably makes more sense.)
But in 2025 much of communication involves messages that aren't useful if you only get part of them. It's also a pain to have to serialize messages into a byte stream and then deserialize the byte stream into messages (see: gRPC etc.) and the byte stream ordering is costly, doesn't work well with multipathing, and doesn't provide much benefit if you are only delivering complete messages.
TCP without congestion control isn't particularly useful. As you note traditional TCP congestion control doesn't respond well to reordering. Also TCP's congestion control traditionally doesn't distinguish between intentional packet drops (e.g. due to buffer overflow) and packet loss (e.g. due to corruption.) This means, for example that it can't be used directly over networks with wireless links (which is why wi-fi has its own link layer retransmission).
TCP's traditional congestion control is designed to fill buffers up until packets are dropped, leading to undesirable buffer bloat issues.
TCP's traditional congestion control algorithms (additive increase/multiplicative decrease on drop) also have the poor property that your data rate tends to drop as RTT increases.
TCP wasn't designed for hardware offload, which can lead to software bottlenecks and/or increased complexity when you do try to offload it to hardware.
TCP's three-way handshake is costly for one-shot RPCs, and slow start means that short flows may never make it out of slow start, neutralizing benefits from high-speed networks.
TCP is also poor for mobility. A connection breaks when your IP address changes, and there is no easy way to migrate it. Most TCP APIs expose IP addresses at the application layer, which causes additional brittleness.
Additionally, TCP is poorly suited for optical/WDM networks, which support dedicated bandwidth (signal/channel bandwidth as well as data rate), and are becoming more important in datacenters and as interconnects for GPU clusters.
etc.
> poor handling of missing packets
so it was poor at exact thing it was designed for?
) compared to when TCP was invented.
When I started at university the ftp speed from the US during daytime was 500 bytes per second! You don't have many unacknowledged packages in such a connection.
Back then even a 1 megabits/sec connection was super high speed and very expensive.
• Full-duplex connections are probably a good idea, but certainly are not the only way, or the most obvious way, to create a reliable stream of data on top of an unreliable datagram layer. TCP's predecessor NCP was half-duplex.
• TCP itself also supports a half-duplex mode—even if one end sends FIN, the other end can keep transmitting as long as it wants. This was probably also a good idea, but it's certainly not the only obvious choice.
• Sequence numbers on messages or on bytes?
• Wouldn't it be useful to expose message boundaries to applications, the way 9P, SCTP, and some SNA protocols do?
• If you expose message boundaries to applications, maybe you'd also want to include a message type field? Protocol-level message-type fields have been found to be very useful in Ethernet and IP, and in a sense the port-number field in UDP is also a message-type field.
• Do you really need urgent data?
• Do servers need different port numbers? TCPMUX is a straightforward way of giving your servers port names, like in CHAOSNET, instead of port numbers. It only creates extra overhead at connection-opening time, assuming you have the moral equivalent of file descriptor passing on your OS. The only limitation is that you have to use different client ports for multiple simultaneous connections to the same server host. But in TCP everyone uses different client ports for different connections anyway. TCPMUX itself incurs an extra round-trip time delay for connection establishment, because the requested server name can't be transmitted until the client's ACK packet, but if you incorporated it into TCP, you'd put the server name in the SYN packet. If you eliminate the server port number in every TCP header, you can expand the client port number to 24 or even 32 bits.
• Alternatively, maybe network addresses should be assigned to server processes, as in Appletalk (or IP-based virtual hosting before HTTP/1.1's Host: header, or, for TLS, before SNI became widespread), rather than assigning network addresses to hosts and requiring port numbers or TCPMUX to distinguish multiple servers on the same host?
• Probably SACK was actually a good idea and should have always been the default? SACK gets a lot easier if you ack message numbers instead of byte numbers.
• Why is acknowledgement reneging allowed in TCP? That was a terrible idea.
• It turns out that measuring round-trip time is really important for retransmission, and TCP has no way of measuring RTT on retransmitted packets, which can pose real problems for correcting a ridiculously low RTT estimate, which results in excessive retransmission.
• Do you really need a PUSH bit? C'mon.
• A modest amount of overhead in the form of erasure-coding bits would permit recovery from modest amounts of packet loss without incurring retransmission timeouts, which is especially useful if your TCP-layer protocol requires a modest amount of packet loss for congestion control, as TCP does.
• Also you could use a "congestion experienced" bit instead of packet loss to detect congestion in the usual case. (TCP did eventually acquire CWR and ECE, but not for many years.)
• The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network.
• TCP's hardcoded timeout of 5 minutes is also a major flaw. Wouldn't it be better if the application could set that to 1 hour, 90 minutes, 12 hours, or a week, to handle intermittent connectivity, such as with communication satellites? Similarly for very-long-latency datagrams, such as those relayed by single LEO satellites. Together this and the previous flaw have resulted in TCP largely being replaced for its original session-management purpose with new ad-hoc protocols such as HTTP magic cookies, protocols which use TCP, if at all, merely as a reliable datagram protocol.
• Initial sequence numbers turn out not to be a very good defense against IP spoofing, because that wasn't their original purpose. Their original purpose was preventing the erroneous reception of leftover TCP segments from a previous incarnation of the connection that have been bouncing around routers ever since; this purpose would be better served by using a different client port number for each new connection. The ISN namespace is far too small for current LFNs anyway, so we had to patch over the hole in TCP with timestamps and PAWS.
Much of that comes from the original applications being FTP and TELNET.
• Sequence numbers on messages or on bytes?
Bytes, because the whole TCP message might not fit in an IP packet. This is the MTU problem.
• Wouldn't it be useful to expose message boundaries to applications, the way 9P, SCTP, and some SNA protocols do?
Early on, there were some message-oriented, rather than stream-oriented, protocols on top of IP. Most of them died out. RDP was one such. Another was QNet.[2] Both still have assigned IP protocol numbers, but I doubt that a RDP packet would get very far across today's internet.
This was a lack. TCP is not a great message-oriented protocol.
• Do you really need urgent data?
The purpose of urgent data is so that when your slow Teletype is typing away, and the recipient wants it to stop, there's a way to break in. See [1], p. 8.
• It turns out that measuring round-trip time is really important for retransmission, and TCP has no way of measuring RTT on retransmitted packets, which can pose real problems for correcting a ridiculously low RTT estimate, which results in excessive retransmission.
Yes, reliable RTT is a problem.
• Do you really need a PUSH bit? C'mon.
It's another legacy thing to make TELNET work on slow links. Is it even supported any more?
• A modest amount of overhead in the form of erasure-coding bits would permit recovery from modest amounts of packet loss without incurring retransmission timeouts, which is especially useful if your TCP-layer protocol requires a modest amount of packet loss for congestion control, as TCP does.
• Also you could use a "congestion experienced" bit instead of packet loss to detect congestion in the usual case. (TCP did eventually acquire CWR and ECE, but not for many years.)
Originally, there was ICMP Source Quench for that, but Berkley didn't put it in BSD, so nobody used it. Nobody was sure when to send it or what to do when it was received.
• The fact that you can't resume a TCP connection from a different IP address, the way you can with a Mosh connection, is a serious flaw that seriously impedes nodes from moving around the network.
That would require a security system to prevent hijacking sessions.
[1] https://archive.org/stream/rfc854/rfc854.txt_djvu.txt
[2] https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers
This 100% !! And basically the reason mosh had to be created in the first place (and it probably wasn't easy.) Unfortunately mosh only solves the problem for ssh. Exposing fixed IP addresses to the application layer probably doesn't help either.
So annoying that TCP tends to break whenever you switch wi-fi networks or switch from wi-fi to cellular. (On iPhones at least you have MPTCP, but that requires server-side support.)
I tend to think that one of the reasons linux containers are needed for network services is that DNS traditionally only returns an IP address (rather than address + port) so each service process needs to have its own IP address, which in linux requires a container or at least a network namespace.
AppleTalk also supported a reliable transaction (basically request-response RPC) protocol (ATP) and a session protocol, which I believe were used for Mac network services (printing, file servers, etc.) Certainly easier than serializing/deserializing byte streams.
I agree that, given the existing design of IP and TCP, you could get much of the benefit of first-class addresses for services by using, for example, DNS-SD, and that is what ZeroConf does. (It is not a coincidence that the DNS-SD RFC was written by a couple of Apple employees.) But, if that's the way you're going to be finding endpoints to initiate connections to, there's no benefit to having separate port numbers and IP addresses. And IP addresses are far scarcer than just requiring a Linux container or a network namespace: there are only 2³² of them. But it is rare to find an IP address that is listening on more than 64 of its 2¹⁶ TCP ports, so in an alternate history where you moved those 16 bits from the port number to the IP address, we would have one thousandth of the IP-address crunch that we do.
Historically, possibly the reason that it wasn't done this way is that port numbers predated the DNS by about 10 years.
The higher level protocols were built on ATP which was message based.
ADSP was a stream protocol that could be used for remote terminal access or other applications where byte streams actually made sense.
> Historically, possibly the reason that it wasn't done this way is that port numbers predated the DNS by about 10 years.
Predated or postdated?
My understanding is that DNS can potentially provide port numbers, but this is not widely used or supported.
Mockapetris's DNS RFCs are from 01983, although I think I've talked to people who installed DNS a year or two before that. Port numbers were first proposed in RFC 38 in 01970 https://datatracker.ietf.org/doc/html/rfc38
> The END and RDY must specify relevant sockets in addition to the link number. Only the local socket name need be supplied
and given actual numbers in RFC 54, also in 01970 https://datatracker.ietf.org/doc/html/rfc54
> Connections are named by a pair of sockets. Sockets are 40 bit names which are known throughout the network. Each host is assigned a private subset of these names, and a command which requests a connection names one socket which is local to the requesting host and one local to the receiver of the request.
> Sockets are polarized; even numbered sockets are receive sockets; odd numbered ones are send sockets. One of each is required to make a connection.
In RFC 129 in 01971 we see discussion about whether socketnames should include host numbers and/or user numbers, still with the low-order bit indicating the socket's gender (emissive or receptive). https://datatracker.ietf.org/doc/html/rfc129
RFC 147 later that year https://datatracker.ietf.org/doc/html/rfc147 discusses within-machine port numbers and how they should or should not relate to the socketnames transmitted in NCP packets:
> Previous network papers postulated that a process running under control of the host's operating system would have access to a number of ports. A port might be a physical input or output device, or a logical I/O device (...)
> A socket has been defined to be the identification of a port for machine to machine communication through the ARPA network. Sockets allocated to each host must be uniquely associated with a known process or be undefined. The name of some sockets must be universally known and associated with a known process operating with a specified protocol. (e.g., a logger socket, RJE socket, a file transfer socket). The name of other sockets might not be universally known, but given in a transmission over a universally known socket, (c. g. the socket pair specified by the transmission over the logger socket under the Initial Connection Protocol (ICP). In any case, communication over the network is from one socket to another socket, each socket being identified with a process running at a known host.
RFC 167 the same year https://datatracker.ietf.org/doc/html/rfc167 proposes that socketnames not be required to be unique network-wide but just within a host. It also points out that you really only need the socketname during the initial connection process, if you have some other way of knowing which packets belong to which connections:
> Although fields will be helpful in dealing with socket number allocation, it is not essential that such field designations be uniform over the network. In all network transactions the 32-bit socket number is handled with its 8-bit host number. Thus, if hosts are able to maintain uniqueness and repeatability internally, socket numbers in the network as a whole will also be unique and repeatable. If a host fails to do so, only connections with that offending host are affected.
> Because the size, use, and character of systems on the network are so varied, it would be difficult if not impossible to come up with an agreed upon particular division of the 32-bit socket number. Hosts have different internal restrictions on the number of users, processes per user, and connections per process they will permit.
> It has been suggested that it may not be necessary to maintain socket uniqueness. It is contended that there is really no significant use made of the socket number after a connection has been established. The only reason a host must now save a socket number for the life of a connection is to include it in the CLOSE of that connection.
RFC 172 in June https://datatracker.ietf.org/doc/html/rfc172 proposes using port 3 for the second version of FTP:
> [6] It seems that socket 1 has been assigned to logger. Socket 3 seems a reasonable choice for File Transfer.
This updates the first version in RFC 114 in April https://datatracker.ietf.org/doc/html/rfc114 which said:
> [16] It seems that socket 1 has been assigned to logger and socket 5 to NETRJS. Socket 3 seems a reasonable choice for the file transfer process.
RFC 196 the same year https://datatracker.ietf.org/doc/html/rfc196 proposes to use port 5 to receive mail and/or print jobs:
> Initial Connection will be as per the Official Initial Connection Protocol, Documents #2, NIC 7101, to a standard socket not yet assigned. A candidate socket number would be socket #5.
In RFC204 in August https://www.rfc-editor.org/rfc/rfc204.html Postel publishes the first list of port number assignments:
> I would like to collect information on the use of socket numbers for "standard" service programs. For example Loggers (telnet servers) Listen on socket 1. What sockets at your host are Listened to by what programs?
> Recently Dick Watson suggested assigning socket 5 for use by a mail-box protocol (RFC196). Does any one object ? Are there any suggestions for a method of assigning sockets to standard programs? Should a subset of the socket numbers be reserved for use by future standard protocols?
> Please phone or mail your answers and commtents to (...)
Amusingly in retrospect, Postel did not include an email address, presumably because they didn't have email working yet.
FTP's assignment to port 3 was confirmed in RFC 265 in November:
> Socket 3 is the standard preassigned socket number on which the cooperating file transfer process at the serving host should "listen". (*)The connection establishment will be in accordance with the standard initial connection protocol, (*)establishing a full-duplex connection.
In May of 01972 Postel published a list as RFC 349 https://www.rfc-editor.org/rfc/rfc349.html:
> I propose that there be a czar (me ?) who hands out official socket numbers for use by standard protocols. This czar should also keep track of and publish a list of those socket numbers where host specific services can be obtained. I further suggest that the initial allocation be as follows:
> and within the network wide standard functions the following particular assignment be made: Note that ports 7 and 9 are still assigned to echo and discard in /etc/services, although Telnet and FTP got moved to ports 23 and 21, respectively. So, internet port numbers in their current form are from 01971 (several years before the split between TCP and IP), and DNS is from about 01982.In December of 01972, Postel published RFC 433 https://www.rfc-editor.org/rfc/rfc433.html, obsoleting the RFC 349 list with a list including chargen and some other interesting services:
The gap between 9 and 19 is unexplained.RFC 503 https://www.rfc-editor.org/rfc/rfc503.html from 01973 has a longer list (including systat, datetime, and netstat), but also listing which services were running on which ARPANet hosts, 33 at that time. So RFC 503 contained a list of every server process running on what would later become the internet.
Skipping RFC 604, RFC 739 from 01977 https://www.rfc-editor.org/rfc/rfc739.html is the first one that shows the modern port number assignments (still called "socket numbers") for FTP and Telnet, though those presumably dated back a couple of years at that point:
Etc. This time I have truncated the list. It also has Finger on port 79.You say, "My understanding is that DNS can potentially provide port numbers, but this is not widely used or supported." DNS SRV records have existed since 01996 (proposed by Troll Tech and Paul Vixie in RFC 2052 https://www.rfc-editor.org/rfc/rfc2052), but they're really only widely used in XMPP, in SIP, and in ZeroConf, which was Apple's attempt to provide the facilities of AppleTalk on top of TCP/IP.
The Linux kernel supports it but at least when I had tried this those modules were disabled on most distros.
> The Stream Control Transmission Protocol (SCTP) is a computer networking communications protocol in the transport layer of the Internet protocol suite. Originally intended for Signaling System 7 (SS7) message transport in telecommunication, the protocol provides the message-oriented feature of the User Datagram Protocol (UDP) while ensuring reliable, in-sequence transport of messages with congestion control like the Transmission Control Protocol (TCP). Unlike UDP and TCP, the protocol supports multihoming and redundant paths to increase resilience and reliability.
[…]
> SCTP may be characterized as message-oriented, meaning it transports a sequence of messages (each being a group of bytes), rather than transporting an unbroken stream of bytes as in TCP. As in UDP, in SCTP a sender sends a message in one operation, and that exact message is passed to the receiving application process in one operation. In contrast, TCP is a stream-oriented protocol, transporting streams of bytes reliably and in order. However TCP does not allow the receiver to know how many times the sender application called on the TCP transport passing it groups of bytes to be sent out. At the sender, TCP simply appends more bytes to a queue of bytes waiting to go out over the network, rather than having to keep a queue of individual separate outbound messages which must be preserved as such.
> The term multi-streaming refers to the capability of SCTP to transmit several independent streams of chunks in parallel, for example transmitting web page images simultaneously with the web page text. In essence, it involves bundling several connections into a single SCTP association, operating on messages (or chunks) rather than bytes.
* https://en.wikipedia.org/wiki/Stream_Control_Transmission_Pr...
The only good answer is "a reliability layer on top of UDP"; fortunately everybody is now rallying around QUIC as the choice for that.
Core routers don't inspect that field, NAT/ISP boxes can. I believe that with two suitable dedicated linux servers it is very possible to send and receive single custom IP packet between them even using 253 or 254 (= Use for experimentation and testing [RFC3692]) as the protocol number
To save a skim (though it's an interesting list!), protocol codes 253 and 254 are suitable "for experimentation and testing".
There is sometimes drama with it, though. Awhile back, the OpenBSD guys created CARP as a fully open source router failover protocol, but couldn't get an official IP number and ended up using the same one as VRRP. There's also a lot of historical animosity that some companies got numbers for proprietary protocols (eg Cisco got one for its then-proprietary EIGRP).
https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers
I do hope we'll have stopped using IPv4 by then... But well, a decade after address exhaustion we are still on it, so who knows?
It uses them a little differently -- in IPv4, there is one protocol per packet, while in IPv6, "protocols" can be chained in a mechanism called extension headers -- but this actually makes the problem of number exhaustion more acute.
You left out ICMP, my favourite! (And a lot more important in IPv6 than in v4.)
Another pretty well known protocol that is neither TCP nor UDP is IPsec. (Which is really two new IP protocols.) People really did design proper IP protocols still in the 90s.
> Can I just make up a packet and send it to a host across the Internet?
You should be able to. But if you are on a corporate network with a really strict firewalling router that only forwards traffic it likes, then likely not. There are also really crappy home routers which gives similar problems from the other end of enterpriseness.
NAT also destroyed much of the end-to-end principle. If you don't have a real IP address and relies on a NAT router to forward your data, it needs to be in a protocol the router recognizes.
Anyway, for the past two decades people have grown tired of that and just piles hacks on top of TCP or UDP instead. That's sad. Or who am I kidding? Really it's on top of HTTP. HTTP will likely live on long past anything IP.
A part of the problem with UDP is the lack of good platforms and tooling. Examples as well. I’m trying to help with that, but it’s an uphill battle for sure.
Not necessarily. Many protocols can survive being NATed if they don't carry IP/port related information inside their payload. FTP is a famous counterexample - it uses a control channel (TCP21) which contains commands to open data channels (TCP20), and those commands specify IP:port pairs, so, depending on the protocol, a NAT router has to rewrite them and/or open ports dynamically and/or create NAT entries on the fly. A lot of other stuff has no need for that and will happily go through without any rewriting.
TCP and UDP have port numbers that the NAT software can extract and keep state tables for, so we can send the return traffic to its intended destination.
For unknown IP protocols that is not possible. It may at best act like network diode, which is one way of violating the end-to-end principle.
The end-to-end principle at the IP layer (i.e. having the IP forwarding layer be agnostic to the transport layer protocols above it) is still violated.
Even ICMP has a hard time traversing NATs and firewalls these days, for largely bad reasons. Try pinging anything in AWS, for example...
If any host is firewalling out ICMP then it won't be pingable but that does not depend on the hosting provider. AWS is no better or worse than any other in that regard, IME.
You might have more luck with an IPv6 packet.
They absolutely don't. Routers are layer 3 devices; TCP & UDP are layer 4. The only impact is that the ECMP flow hashes will have less entropy, but that's purely an optimization thing.
Note TCP, UDP and ICMP are nowhere near all the protocols you'll commonly see on the internet — at minimum, SCTP, GRE, L2TP and ESP are reasonably widespread (even a tiny fraction of traffic is still a giant number considering internet scales).
You can send whatever protocol number with whatever contents your heart desires. Whether the other end will do anything useful with it is another question.
Idealized routers are, yes.
Actual IP paths these days usually involve at least one NAT, and these will absolutely throw away anything other than TCP, UDP, and if you're lucky ICMP.
And note the GP talked about "intermediate routers". That's the ones in a telco service site or datacenter by my book.
Classic congestion control is done on the sender alone. The router's job is simply to drop packets when the queue is too large.
Maybe the router supports ECN, so if there's a queue going to the next hop, it will look for protocol specific ECN headers to manipulate.
Some network elements do more than the usual routing work. A traffic shaper might have per-user queues with outbound bandwidth limits. A network accelerator may effectively reterminate TCP in hopes of increasing acheivable bandwidth.
Often, the router has an aggregated connection to the next hop, so it'll use a hash on the addresses in the packet to choose which of the underlying connections to use. That hash could be based on many things, but it's not uncommon to use tcp or udp port numbers if available. This can also be used to chose between equally scored next hops and that's why you often see several different paths during a traceroute. Using port numbers is helpful to balance connections from IP A to IP B over multiple links. If you us an unknown protocol, even if it is multiplexed into ports or similar (like tcp and udp), the different streams will likely always hash onto the same link and you won't be able to exceed the bandwidth of a single link and a damaged or congested link will affect all or none of your connections.
If middleware decides to do packet inspection, it better make sure that any behavioral differences (relative to not doing any inspection) is strictly an optimization and does not impact the correctness of the link.
Also, although I'm not a network operator by any stretch, my understanding is that TCP congestion control is primarily a function of the endpoints of the TCP link, not the IP routers along the way. As Wikipedia explains [0]:
> Per the end-to-end principle, congestion control is largely a function of internet hosts, not the network itself.
[0]: https://en.wikipedia.org/wiki/TCP_congestion_control
A bunch of multicast stuff (IGMP, PIM)
A few routing protocols (OSPF, but notably not BGP which just uses TCP, and (usually) not MPLS which just goes over the wire - it sits at the same layer as IP and not above it)
A few VPN/encapsulation solutions like GRE, IP-in-IP, L2TP and probably others I can't remember
As usual, Wikipedia has got you covered, much better than my own recollection: https://en.wikipedia.org/wiki/List_of_IP_protocol_numbers
Behind a NA(P)T, you can obviously only use those protocols that the translator knows how to remap ports for.
As soon as you start thinking about having multiple services on a host you end up with the idea of having a service id or "port"
UDP or UDP Lite gives you exactly that at the cost of 8 bytes, so there's no real value in not just putting everything on top of UDP
*The protocol.
QUIC is still "their own protocol", just implemented as another protocol nested inside a UDP envelope, the same way that HTTP is another protocol typically nested inside a TCP connection. It makes some sense that they'd piggyback on UDP, since (1) it doesn't require an additional IP protocol header code to be assigned by IANA, (2) QUIC definitely wants to coexist with other services on any given node, and (3) it allows whatever middleware analyses that exist for UDP to apply naturally to QUIC applications.
(Regarding (3) specifically, I imagine NAT in particular requires cooperation from residential gateways, including awareness of both the IP and the TCP/UDP port. Allowing a well-known outer UDP header to surface port information, instead of re-implementing ports somewhere in the QUIC header, means all existing NAT implementations should work unchanged for QUIC.)
Most firewalls will drop unknown IP protocols. Many will drop a lot of TCP; some drop almost all UDP. This is why so much stuff runs over tcp ports 80 and 443; it's almost always open. QUIC/HTTP/3 encourages opening of udp/443, so it's a good port to run unrelated things over too.
Also, given that SCTP had similar goals to QUIC and never got much deployment or support in OSes and NATs and firewalls and etc. It's a clear win to just use UDP and get something that will just work on a large portion of networks.
Some people here will argue that it actually really is, and that everybody experiencing issues is just on a really weird connection or using broken hardware, but those weird connections and bad hardware make up the overwhelming majority of Internet connections these days.
Additionally, firewalls are also designed to filter out any weird packets. If the packet doesn't look like you wanted to receive it, it's dropped. It usually does this by tracking open ports just like NAT, therefore many firewalls also don't trust custom protocols.
There are many routers that don't care at all about what's going through them. But there aren't any firewalls that don't route anymore (not even at the endpoints).
https://en.wikipedia.org/wiki/IP_over_Avian_Carriers
(It's absolutely worth reading some of those old April Fools' RFCs, by the way [0]. I'm a big fan of RFC 7168, which introduced HTTP response code 418 "I'm a teapot".)
[0]: https://en.wikipedia.org/wiki/April_Fools%27_Day_Request_for...
UDP has its place as well, and if we have more simple and effective solutions like WireGuard’s handshake and encryption on top of it we’d be better off as an industry.
For example, sending some data on a fresh TCP connection is slow, and the “ramp up time” to the bandwidth of the network is almost entirely determined by the latency.
Amazing speed ups can be achieved in a data centre network by shaving microseconds off the round trip time!
Similarly, many (all?) TCP stacks count segments, not bytes, when determining this ramp up rate. This means that jumbo frames can provide 6x the bandwidth during this period!
If you read about the network design of AWS, they put a lot of effort into low switching latency and enabling jumbo frames.
The real pros do this kind of network tuning, everyone else wonders why they don’t get anywhere near 10 Gbps through a 10 Gbps link.
they would have looked at you and asked straight out what you hoped to gain by making these things distinguished, because it certainly complicates things.
Well ... he seems very motivated. I am more skeptical.
For instance, Google via chrome controls a lot of the internet, even more so via its search engine, AI, youtube and so forth.
Even aside from this people's habits changed. In the 1990s everyone and their Grandma had a website. Nowadays ... it is a bit different. We suddenly have horrible blogging sites such as medium.com, pestering people with popups. Of course we also had popups in the 1990s, but the diversity was simply higher. Everything today is much more streamlined it seems. And top-down controlled. Look at Twitter, owned by a greedy and selfish billionaire. And the US president? Super-selfish too. We lost something here in the last some 25 years.
If the net were designed today it would be some complicated monstrosity where every packet was reminiscent of X.509 in terms of arcane complexity. It might even have JSON in it. It would be incredibly high overhead and we’d see tons of articles about how someone made it fast by leveraging CPU vector instructions or a GPU to parse it.
This is called Eroom’s law, or Moore’s law backwards, and it is very real. Bigger machines let programmers and designers loose to indulge their desire to make things complicated.
Many networking folks including myself consider IPv6 router advertisements and SLAAC to be inferior, in practice, to DHCPv6, and that it would be better if we’d just left IP assignment out of the spec like it was in V4. Right now we have this mess where a lot of nets prefer or require DHCPv6 but some vendors, like apparently Android, refuse to support it.
The rules about how V6 addresses are chopped up and assigned are wasteful and dumb. The entire V4 space could have been mapped onto /32 and an encapsulation protocol made to allow V4 to carry V6, providing a seamless upgrade path that does not require full upgrade of the whole core, but that would have been too logical. Every machine should get like a /96 so it can use 32 bits of space to address apps, VMs, containers, etc. As it stands we waste 64 bits of the space to make SLAAC possible, as near as I can tell. The SLAAC tail must have wagged the dog in that people thought this feature was cool enough to waste 8 bytes per packet.
The V6 header allows extension bits that are never used and blocked by most firewalls. There’s really no point in them existing since middle boxes effectively freeze the base protocol in stone.
Those are some of the big ones.
Basically all they should have done was make IPs 64 or 128 bits and left everything else alone. But I think there was a committee.
As it stands we have what we have and we should just treat V6 as IP128 and ignore the rest. I’m still in favor of the upgrade. V4 is too small, full stop. If we don’t enlarge the addresses we will completely lose end to end connectivity as a supported feature of the network.
You can just SLAAC some more addresses for whatever you want. Although hopefully you don't use more than the ~ARP~ NDP table size on your router; then things get nasty. This should be trivial for VMs, and could be made possible for containers and apps.
> The V6 header allows extension bits that are never used and blocked by most firewalls. [...] Basically all they should have done was make IPs 64 or 128 bits and left everything else alone.
This feels contradictory... IPv4 also had extension headers that were mostly unused and disallowed. V6 changed the header extension mechanism, but offers the same opportunities to try things that might work on one network but probably won't work everywhere.
Personally I found the tone of the article quite genuine and the video at the end made a compelling case for it. Well I figure you commented having actually read it.
Edit: I can't downvote but if I could it probably would have been better than this comment!