
Basically, one user runs a server and sends out frame/audio data to all the clients (you could even use UDP for this, instead of TCP). P2/3/4 simply send their input presses back to the server, who is listening on a socket. P2/3/4 could even be on a low-end server, and essentially "stream" the game from P1, who has a high-end server. All P2/3/4 do is provide input.
The only downside to this approach is that P1 has an advantage over a high-latency network because P2/3/4 will essentially experience some input lag that is proportional to the network latency/bandwidth. However, their input will remain perfectly in sync.