Even though WebRTC is a p2p technology, there's always a signaling server required which transmits the initial token generated by one client to another, so that they can connect.
It is always required, since WebRTC isn't completely p2p. It needs a signaling server to relay an offer from one client to another for the first time, after that they can pick it up the entire connection on their own.
It's inexpensive and can be setup within a few lines of socket.io code.
A Nat Traversal server is typically required in production scenarios. Before IPv6 there wasn't as many available IP addresses, hence sysadmins setup NATs in their setups which translated regular requests to the non routable address space IPs in the internal network. Hence the address exposed to the web is different then the IP address the machine has.
In certain scenarios due to the firewall limitations, certain UDP ports will be blocked and the connection might not go through.
To solve this, WebRTC gives us the ICE (Interactive Connectivity Establishment) protocol. It is a protocol for allowing a client behind a NAT or a firewall device to talk to another one which may or may not have those devices.
There are two types of servers that handle ICE transactions:
Media servers are usually needed when we are in a multi party scenario, since the cost of transferring multiple streams becomes exponential very quickly.
To improve this, there are several approaches but none of them is perfect and depends on the needs of the application.
MCU (Multipoint Control Unit)
All the streams are sent to a single server, which merges them and reverts them back to the peers. It's a simple approach but comes with huge costs of not being able to setup the layouts as we prefer as well it's a very computationally expensive solution which will reduce performance, quality and increase latency.
SFU (Selective forwarding unit)
A much more flexible approach, where client basically sends one stream and the SMU copies that stream and sends them to the other peers. It's lightweight since we're not merging streams and the cost we pay is for decryption and lots of bandwidth. It's a lot more scalable and flexible.
Simulcast (SFU with Simulcast)
Each clients simultaneously sends more than one media feed, generally one is a high bitrate and others are lower bitrate feeds kind of like a thumbnail.
The logic of SFU can be set to something like: the main person who's talking sends HD quality video and others streams are send in SD quality for thumbnails, etc.
This can save lots of bandwidth compared to normal SFU and gives a much better user experience. It's much more scalable / flexible.
Gateway servers are for applications who allow users to not only connect with browsers but other devices such as telephones. A gateway server basically acts a middleman which translates the streams into the appropriate formats and transfers them through the different protocols.