A while ago I mentioned I use Android-10 with the built in SIP stack and that the Google stack was pretty buggy and I had to fix it simply to get it to function without disconnecting all the time. Since then I’ve upported my fixes to Android-11 (the jejb-11 branch in the repositories) by using LineageOS-19.1. However, another major deficiency in the Google SIP stack is its complete lack of security: both the SIP signalling and the media streams are all unencrypted meaning they can be intercepted and tapped by pretty much anyone in the network path running tcpdump. Why this is so, particularly for a company that keeps touting its security credentials is anyone’s guess. I personally suspect they added SIP in Android-4 with a view to basing Google Voice on it, decided later that proprietary VoIP protocols was the way to go but got stuck with people actually using the SIP stack for other calling services so they couldn’t rip it out and instead simply neglected it hoping it would die quietly due to lack of features and updates.
This blog post is a guide to how I took the fully unsecured Google SIP stack and added security to it. It also gives a brief overview of some of the security protocols you need to understand to get secure VoIP working.
What is SIP
What I’m calling SIP (but really a VoIP system using SIP) is a protocol consisting of several pieces. SIP (Session Initiation Protocol), RFC 3261, is really only one piece: it is the “signalling” layer meaning that call initiation, response and parameters are all communicated this way. However, simple SIP isn’t enough for a complete VoIP stack; once a call moves to in progress, there must be an agreement on where the media streams are and how they’re encoded. This piece is called a SDP (Session Description Protocol) agreement and is usually negotiated in the body of the SIP INVITE and response messages and finally once agreement is reached, the actual media stream for call audio goes over a different protocol called RTP (Real-time Transport Protocol).
How Google did SIP
The trick to adding protocols fast is to take them from someone else (if you’re open source, this is encouraged) so google actually chose the NIST-SIP java stack (which later became the JAIN-SIP stack) as the basis for SIP in android. However, that only covered signalling and they had to plumb it in to the android Phone model. One essential glue piece is
frameworks/opt/net/voip which supplies the SDP negotiating layer and interfaces the network codec to the phone audio. This isn’t quite enough because the telephony service and the Dialer also need to be involved to do the account setup and call routing. It always interested me that SIP was essentially special cased inside these services and apps instead of being a plug in, but that’s due to the fact that some of the classes that need extending to add phone protocols are internal only; presumably so only manufacturers can add phone features.
This is pretty easy following the time honoured path of sending messages over TLS instead of in the clear simply by using a TLS wrappering technique of secure sockets and, indeed, this is how RFC 3261 says to do it. However, even this minor re-engineering proved unnecessary because the nist-sip stack was already TLS capable, it simply wasn’t allowed to be activated that way by the configuration options Google presented. A simple 10 line patch in a couple of repositories (
frameworks/opt/net/voip) fixed this and the SIP stack messaging was secured leaving only the voice stream insecure.
As I said above, the google
frameworks/opt/net/voip does all the SDP negotiation. This isn’t actually part of SIP. The SDP negotiation is conducted over SIP messages (which means it’s secured thanks to the above) but how this should be done isn’t part of the SIP RFC. Instead SDP has its own RFC 4566 which is what the class follows (mainly for codec and port negotiation). You’d think that if it’s already secured by SIP, there’s no additional problem, but, unfortunately, using SRTP as the audio stream requires the exchange of additional security parameters which added to SDP by RFC 4568. To incorporate this into the Google SIP stack, it has to be integrated into the
voip class. The essential additions in this RFC are a separate media description protocol (
RTP/SAVP) for the secure stream and the addition of a set of tagged
a=crypto: lines for key negotiation.
As will be a common theme: not all of RFC 4568 has to be implemented to get a secure RTP stream. It goes into great detail about key lifetime and master key indexes, neither of which are used by the asterisk SIP stack (which is the one my phone communicates with) so they’re not implemented. Briefly, it is best practice in TLS to rekey the transport periodically, so part of key negotiation should be key lifetime (actually, this isn’t as important to SRTP as it is to TLS, see below, which is why asterisk ignores it) and the people writing the spec thought it would be great to have a set of keys to choose from instead of just a single one (The Master Key Identifier) but realistically that simply adds a load of complexity for not much security benefit and, again, is ignored by asterisk which uses a single key.
In the end, it was a case of adding a new class for parsing the a=crypto: lines of SDP and doing a loop in the audio protocol for
RTP/SAVP if TLS were set as the transport. This ended up being a ~400 line patch.
RTP itself is governed by RFC 3550 which actually contains two separate stream descriptions: the actual media over RTP and a control protocol over RTCP. RTCP is mostly used for multi-party and video calls (where you want reports on reception quality to up/downshift the video resolution) and really serves no purpose for audio, so it isn’t implemented in the Google SIP stack (and isn’t really used by asterisk for audio only either).
When it comes to securing RTP (and RTCP) you’d think the time honoured mechanism (using secure sockets) would have applied although, since RTP is transmitted over UDP, one would have to use DTLS instead of TLS. Apparently the IETF did consider this, but elected to define a new protocol instead (or actually two: SRTP and SRTCP) in RFC 3711. One of the problems with this new protocol is that it also defines a new ciphersuite (AES_CM_…) which isn’t found in any of the standard SSL implementations. Although the AES_CM ciphers are very similar in operation to the AES_GCM ciphers of TLS (Indeed AES_GCM was adopted for SRTP in a later RFC 7714) they were never incorporated into the TLS ciphersuite definition.
So now there are two problems: adding code for the new protocol and performing the new encyrption/decryption scheme. Fortunately, there already exists a library (
libsrtp) that can do this and even more fortunately it’s shipped in android (
external/libsrtp2) although it looks to be one of those throwaway additions where the library hasn’t really been updated since it was added (for cuttlefish gcastv2) in 2019 and so is still at a pre 2.3.0 version (I did check and there doesn’t look to be any essential bug fixes missing vs upstream, so it seems usable as is).
One of the really great things about libsrtp is that it has
srtp_unprotect functions which transform SRTP to RTP and vice versa, so it’s easily possible to layer this library directly into an existing RTP implementation. When doing this you have to remember that the encryption also includes authentication, so the size of the packet expands which is why the initial allocation size of the buffers has to be increased. One of the not so great things is that it implements all its own crypto primitives including AES and SHA1 (which most cryptographers think is always a bad idea) but on the plus side, it’s the same library asterisk uses so how much of a real problem could this be …
Following the simple layering approach, I constructed a patch to do the RTP<->SRTP transform in the JNI code if a key is passed in, so now everything just works and setting asterisk to SRTP only confirms the phone is able to send and receive encrypted audio streams. This ends up being a ~140 line patch.
So where does DTLS come in?
Anyone spending any time at all looking at protocols which use RTP, like webRTC, sees RTP and DTLS always mentioned in the same breath. Even asterisk has support for DTLS, so why is this? The answer is that if you use RTP outside the SIP framework, you still need a way of agreeing on the keys using SDP. That key agreement must be protected (and can’t go over RTCP because that would cause a chicken and egg problem) so implementations like webRTC use DTLS to exchange the initial SDP offer and answer negotiation. This is actually referred to as DTLS-SRTP even though it’s an initial DTLS handshake followed by SRTP (with no further DTLS in sight). However, this DTLS handshake is completely unnecessary for SIP, since the SDP handshake can be done over TLS protected SIP messaging instead (although I’ve yet to find anyone who can convincingly explain why this initial handshake has to go over DTLS and not TLS like SIP … I suspect it has something to do with wanting the protocol to be all UDP and go over the same initial port).
This whole exercise ended up producing less than 1000 lines in patches and taking a couple of days over Christmas to complete. That’s actually much simpler and way less time than I expected (given the complexity in the RFCs involved), which is why I didn’t look at doing this until now. I suppose the next thing I need to look at is reinserting the SIP stack into Android-12, but I’ll save that for when Android-11 falls out of support.