Category Archives: Linux

General Linux and Open Source related blog posts

Figuring out how ipsec transforms work in Linux

The basics of ipsec

ipsec has two “protocols”: Authentication Header (AH) which means the packet is fully authenticated (or integrity protected) by an HMAC but not encrypted; and Encapsulating Security Payload (ESP), where the packet is encrypted but not necessarily integrity protected (ipsec was invented before AEAD ciphers, so previously you used both protocols to ensure confidentiality and integrity but, in the modern world, you can use ESP with an AEAD cipher and dispense entirely with AH). Once you have the protocol set, you encapsulate either in transport or tunnel mode. Transport mode means that the protocol headers are simply added to an existing IP packet (so the source and destination address remain the same) in the case of AH and an added header plus an encryption transformed payload with ESP and tunnel mode means that the entire IP packet is encapsulated and a new outer source and destination address is added (this is sometimes referred to as an ipsec VPN).

Understanding ipsec flows

This diagram should help understand how ipsec transforms work. There are two aspects to this: policy which basically does accept/reject and tagging and state, which does the encode/decode

The square boxes are the firewall filters and the ellipses are the ip xfrm policy and state transforms. xfrm decode is unconditionally activated whenever an ipsec packet reaches the input flow (provided there’s a matching state rule), output encoding only occurs if a matching output policy says it should (otherwise the packet is passed unencoded) and the xfrm policy fwd has no matching encode/decode, so it’s not possible to ipsec transform forwarded packets.

ip xfrm policy

To decapsulate, there is no requirement for a policy: every ipsec packet coming into the INPUT flow will be checked for a state match and decapsulated if one is found. However, for the packet to progress further you may need a policy. For transport mode, the decapsulated packet will go back around the input loop, so a dir in policy likely isn’t required but for tunnel mode, the decapsulated packet will likely traverse the FORWARD table and a dir fwd policy plus a firewall rule is likely required to permit the packet. Forwarded packets also hit the dir out policy as well.

A policy is specified with two parts: a selector which contains a set of matches (the only mandatory part of which is the direction dir) and which may match on partial addresses an action (block or allow, with allow being default) and a template (tmpl) which can specify the encoding (for dir out) or additional rules based on encapsulation.

Encoding Templates

Encoding only applies to dir out policies. Transport mode simply requires a statement of which encapsulation to use (proto ah or esp) and doesn’t require IP addresses in the ID section. In tunnel mode, the template must also have the source and destination outer IP addresses (the current source and destination become the inner addresses).

Every packet matching an encoding policy must also have a corresponding ip xfrm state match to specify the encapsulation parameters. Note that if a state transform is missing, the kernel will signal this on a netlink socket (which you can monitor with ip xfrm monitor). This socket is mostly used by ipsec toolkits to add state transforms just in time.

Other Policy Templates

For the in and fwd directions, the template acts as an additional filter on what packets to allow and what to block. For instance, if only decapsulated packets should be forwarded, then there should be a policy like

ip xfrm policy add dst net/netmask dir fwd tmpl proto ah mode tunnel level required

Which says the only allowed packets are those which were encapsulated in tunnel mode. Note that for this policy to be reached at all, the FORWARD table must allow the packets to pass. For most network security people, having a blanket forward permit rule is an anathema, so they often achieve the same thing by applying a firewall mark in decapsulation (the output-mark option of ip xfrm state) and only allowing market packets to pass the FORWARD chain (which dispenses with the need for a xfrm dir fwd policy). The two levers for controlling filter policies are the action (allow or block) the default is allow which is why this statement usually doesn’t appear) and the template level (required or use). The default level is required, which means for the allow rule to match the packet must be decapsulate (level use means pass regardless of decapsulation status)

Security Parameters Index (SPI) and reqid

For ipsec to work, every encapsulated packet must have a SPI value. You can specify this in the state transformation. The standards (RFC2409, RFC4303, etc) specify that SPI values 1-255 are “reserved”. Additionally the standards allow SPI value 0 to be used internally, which the Linux Kernel takes advantage of. SPI is mostly used to distinguish packet streams from the same host for complex ipsec policy, and don’t have much use in a simple policy situation. However, you must provide a value that isn’t 0-255 otherwise strange things can happen. In particular 0, the value you’ll get if you don’t specify spi, often causes the packet to get lost after decapsulation, so always specify a large number for spi. requid is a label which is attached to an unencapsulated packet that effectively remembers what the SPI value was; it’s mostly used as a label based discriminator in policy template to state transforms.

For the purposes of the following example I’ll simply use the randomly chosen 4321 for the spi value (but you could choose anything outside the 0-255 range).

ip xfrm state

Unlike policy, which attaches to a particular location (in, out or fwd), state is location agnostic and the same state match could theoretically be used both to encapsulate or decapsulate. State matching also isn’t subnet based: the address matching is either exact or fully wild card (match everything). However, a state encapsulation transform rule must match on dst and a decapsulation one may match on either src or dst, but must have an exact match on one or other. The main thing the state specifies is the algorithm to encapsulate (see man ip-xfrm for a full list). Remember you also must specify spi. The only other thing you might want to specify is the sel parameter. The selector applies to the inner address of decapsulated packets and is to ensure that a mode tunnel packet is going to an address you approve of.

Simple Example: HMAC authentication between two nodes

Assume a [4321::]/64 subnet with two nodes [4321::1] and [4321::2]. To set up authenticated headers one way (from 1->2) you need a policy specifying AH (can be specific or subnet based, so this is subnet)

ip xfrm policy add dst 4321::/64 dir out tmpl proto ah mode tunnel

Followed by a state that’s specific to the destination (using random spi 4321 and short key 1234):

ip xfrm state add dst 4321::2 proto ah spi 4321 auth "hmac(sha1)" 1234 mode transport

If you ping from [4321::1] and do a tcp dump from [4321::2] you’ll see

IP6 4321::1 > 4321::2: AH(spi=0x000010e1,seq=0x1b,icv=0x530cdd96149288da7a35fc6d): ICMP6, echo request, id 4, seq 104, length 64

But nothing will come back until you add on [4321::2]

ip xfrm state add dst 4321::2 proto ah spi 4321 auth "hmac(sha1)" 1234 mode transport

Which will cause a ping response to be seen. Note the ping packet has an authentication header, but the response is a simple icmp6 response packet (no AH) demonstrating ipsec can be set up asymmetrically.

Example: Private network for Cloud Nodes

Assume we have N nodes with public IP addresses [4321::1]…[4321::N] (which could be provided by the cloud overlay or simply by virtue of the physical network the nodes are on) and we want to connect them in a private mesh network using encryption. There are two ways of doing this: the first is to encrypt all traffic between the nodes on the public network using transport mode and the second would be to set up overlay tunnels between the nodes (this latter can be used even if the public addresses aren’t on a single network segment).

Simple Transport Mode Encryption

Firstly each node needs a policy to require encryption both to and from the private network

ip xfrm policy add dst 4321::0/64 dir out tmpl proto esp
ip xfrm policy add scr 4321::/64 dir in tmpl proto esp

And then for each node on the star and encrypt and a decrypt policy (the aes cipher type is taken from the key length, so I’ve chosen a 128 bit key “1234567890123456”)

ip xfrm state add dst 4321::1 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode transport
ip xfrm state add src 4321::1 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode transport
...
ip xfrm state add dst 4321::N proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode transport
ip xfrm state add src 4321::N proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode transport

This encryption scheme has one key for the entire network, but you could use 1 key per node if you wished (although this wouldn’t necessarily increase security that much). Note that what’s described above is not an overlay network because it relies on using the characteristics of the underlying network (in this case that all nodes are on an IPv6 /64 segment) to do opportunistic transport encryption. To get a true single subnet overlay on top of a disjoint network there must be some sort of tunnel. One way to get the tunnel is simply to use ipsec in tunnel mode, but another is to set up gre tunnels (or another, not necessarily trusted, network overlay which the cloud can likely provide) for the virtual overlay and then use ipsec in transport mode to ensure the packets are always encrypted.

Tunnel Mode Overlay Network

For this example, we’ll allow unencrypted packets to flow over a routed network [4321::1] and [4322::2] (assume network device eth0 on each node) but set up an overlay network on [6666::N]/64 which is fully encrypted. Firstly, each node requires a local address addition for the [6666::N] address. So on node [4321::1] do

ip addr add 6666::1/128 dev eth0

Now add policies and state transforms in both directions (in this case a dir in policy is required otherwise the decapsulated packets won’t get sent up the input flow):

# required policy for encapsulation
ip xfrm policy add dst 6666::2 dir out tmpl src 4321::1 dst 4322::2 proto esp mode tunnel
# state transform for encapsulation
ip xfrm state add src 4321::1 dst 4322::2 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode tunnel
# policy to allow passing of decapsulated packets
ip xfrm policy add dst 6666::1 dir in tmpl proto esp mode tunnel level required
# automatic decapsulation.  sel ensures addresses after decapsulation
ip xfrm state add src 4322::2 dst 4321::1 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode tunnel sel src 6666::2 dst 6666::1

And on the other node [4322::2] do the same in reverse

ip addr add 6666::2/128 dev eth0
ip xfrm policy add dst 6666::1 dir out tmpl src 4321::1 dst 4322::2 proto esp mode tunnel
ip xfrm state add src 4322::2 dst 4321::1 proto esp spi 4321 enc "cbc(aes)" 1234567890123456 mode tunnel
ip xfrm policy add dst 6666::2 dir in tmpl proto esp mode tunnel level required
ip xfrm state add src 4321::1 dst 4322::2 proto esp sip 4321 enc "cbc(aes)" 1234567890123456 mode tunnel sel src 6666::1 dst 6666::2

Note there’s no need to add any routing entries because the decapsulation is point to point (incoming decapsulated packets always end up in the input flow destined for the local [6666::N] address). With the above rules you should be able to ping from [6666::1] to [6666::2] and tcpdump should show fully encrypted packets going over the wire.

Obviously, you can add more nodes but each time you have to add rules for all the other nodes, making this an N(N-1) scaling problem. The need for a specific source and destination template in the out policy means you must have one for each connection. The in policy can be subnet based. The reason why people use ipsec toolkits is that they can add the transforms just in time for the subset of nodes you’re actually communicating with rather than having to add all the rules up front.

Final Example: correcly keyed AH Inbound Packet Acceptance

The final example is me trying to penetrate my router firewall when on an external IPv6 connection. I have a class /60 set of IPv6 space, so each of my systems has its own IPv6 address but, as is usual, inbound packets in the NEW state are blocked. It occurred to me that I should be able to use ipsec AH (no real need for encryption since most of the protocols I use are encrypted anyway) to accept packets to an internal destination with the NEW state. This would be an asymmetric use of ipsec because inbound would have AH but return wouldn’t.

My initial thought was to use AH in transport mode, but as you can see from the diagram above that won’t work because the router merely forwards the packets and to get to a state decapsulation on the router they have to go up the INPUT flow.

The next attempt used tunnel mode, so the packet was aimed at the router and then the inner destination was the real node. The next problem was the fwd policy to permit this: the xfrm policy has no connection tracker and return packets from connections originating in the interior node also have to pass this filter. The solution to this conundrum is to install a level use policy and rely on the FORWARD table firewall rules to allow RELATED,ESTABLISHED and MARKed packets (so the decapsulation can add the MARK to pass this rule).

IPSEC on the Router

Assume my external router address is [4321::1] and my internal network is [4444::]/60. On the router, I install a catch all state transform

ip xfrm state add add dst 4321::1 proto ah spi 4321 auth "hmac(sha1)" 1234 mode tunnel sel dst 4444::/60 output-mark 0x1

But I also need a policy to permit the decapsulated packets (and the state RELATED,ESTABLISHED unencapsulated ones) to pass:

ip xfrm policy add dst 4444::/60 dir fwd tmpl proto ah spi 4321 mode tunnel level use

And finally I need an addition to the firewall rules to allow packets in state NEW but with mark 0x1 to pass

ip6tables -A FORWARD -m conntrack --ctstate NEW -m mark --mark 0x1/0x1 -j ACCEPT

Which should be placed directly after the RELATED,ESTABLISHED state check.

Since there’s no encapsulation on outbound, the return packets simply pass through the firewall as normal. This means that any external entity wishing to use this AH packet acceptance simply needs a policy and state to tunnel

ip xfrm policy dst 4444::/60 dir out tmpl proto ah dst 4321::1 mode tunnel
ip xfrm state add dst 4321::1 spi 4321 proto ah auth "hmac(sha1)" 1234 mode tunnel

And with that, any machine known by inner IPv6 address can be reached (for an IPv6 connected remote machine).

Conclusion

Hopefully this post has demystified some of the ip xfrm rules for you. I’m afraid the commands have a huge range of options, so I’ve only covered the essential ones above and there are still loads of interesting but not at all well documented ones remaining, but thanks to the examples you should have some scope now for playing with them.

Securing a Rooted Android Phone

A Brief Tour of Android Verified Boot (AVB)

We’ll actually be covering the 2.0 version of AVB, but that’s what most phones use today. The proprietary bootloader of a Pixel (the fastboot capable one you get to with adb reboot bootloader) uses a vbmeta partition to find the boot/recovery system and from there either enter recovery or boot the standard OS. vbmeta contains hashes for both this boot partition and the system partition. If your phone is unlocked the bootloader will simply boot the partitions vbmeta points to without any verification. If it is locked, the vbmeta partition must be signed by a key the phone knows. Pixel phones have two keyslots: a built in one which houses either the Google key or an OEM one and the custom slot, which is blank. In the unlocked mode, you can flash your own key into the custom slot using the fastboot flash avb_custom_key command.

The vbmeta partition also contains a boot flags region which tells the bootloader how to boot the OS. The two flags the OS knows about are in external/avb/libavb/avb_vbmeta_image.h:

/* Flags for the vbmeta image.
 *
 * AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED: If this flag is set,
 * hashtree image verification will be disabled.
 *
 * AVB_VBMETA_IMAGE_FLAGS_VERIFICATION_DISABLED: If this flag is set,
 * verification will be disabled and descriptors will not be parsed.
 */
typedef enum {
  AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED = (1 << 0),
  AVB_VBMETA_IMAGE_FLAGS_VERIFICATION_DISABLED = (1 << 1)
} AvbVBMetaImageFlags;

if the first flag is set then dm-verity is disabled and if the second one is set, the bootloader doesn’t pass the hash of the vbmeta partition on the kernel command line. In a standard LineageOS build, both these flags are set.

The reason for passing the vbmeta hash on the command line is so the android init process can load the vbmeta partition, hash it and verify against what the bootloader passed in, thus confirming there hasn’t been a time of check to time of use (TOCTOU) security breach. The init process cannot verify the signature for itself because the vbmeta signing public key isn’t built into the OS (which allows the OS to be signed after the images are build).

The description of the AVB_VBMETA_IMAGE_FLAGS_HASHTREE_DISABLED flag is slightly wrong. A standard android build always seems to calculate the dm-verity hash tree and insert it into the vbmeta partition (where it is verified by the vbmeta signature) it’s just that if this flag is set, the android init process won’t load the dm-verity hash tree and the system partition will thus be mutable.

Creating and Using your own custom Boot Key

Obviously android doesn’t use any standard tool form for its keys, so you have to create your own RSA 2048 (the literature implies it will work with 4096 as well but I haven’t tried it) AVB custom key using say openssl, then use avbtool (found in external/avb as a python script) to convert your RSA public key to a form that can be flashed in the phone:

avbtool extract_public_key --key pubkey.pem --output pkmd.bin

This can then be flashed to the unlocked phone (in the bootloader fastboot) with

fastboot flash avb_custom_key pkmd.bin

And you’re all set up to boot a custom signed OS.

Signing your LineageOS

There is a wrinkle to this: to re-sign the OS, you need the target-files.zip intermediate build, not the ROM install file. Unfortunately, this is pretty big (38GB for lineage-19.1) and doesn’t seem to be available for download any more. If you can find it, you can re-sign the stock LineageOS, but if not you have to build it yourself. Instructions for both building and re-signing can be found here. You need to follow this but in addition you must add two extra flags to the sign_target_files_apks command:

--avb_vbmeta_key=/path/to/private/avb.key
--avb_vbmeta_algorithm=SHA256_RSA2048

Which will ensure the vbmeta partition is signed with the key you created above.

Optionally Enabling dm-verity

If you want to enable dm-verity, you have to change the vbmeta flags to 0 (enable both hashtree and vbmeta verification) before you execute the signing command above. These flags are stored in the META/misc_info.txt file which you can extract from target-files.zip with

unzip target-files.zip META/misc_info.txt

And once done you can vi this file to find the line

avb_vbmeta_args=--flags 3 --padding_size 4096 --rollback_index 1804212800

If you update the 3 to 0 this will unset the two disable flags and allow you to do a dm-verity verified boot. Then use zip to replace this updated file

zip -u target-files.zip META/misc_info.txt

And then proceed with signing the updated target-files.zip

Wrinkle for Android-12 (lineage-19.1) and above

For all these versions, this patch ensures that if the vbmeta was signed then the vbmeta hash must be verified, otherwise the system will crash in early init, so you have no choice and must alter the avb_vbmeta_args above to either --flags 1 or --flags 0 so the vbmeta hash is passed in to init. Since you have to alter the flags anyway, you might as well enable dm-verity (set to 0) at the same time.

Re-Lock the Bootloader

Once you have installed both your custom keys and your custom signed boot image, you are ready to re-lock the bootloader. Beware that some phones will erase your data partition when you do this (the Google advice says they shouldn’t, but not all manufacturers bother to follow it), so make sure you have a backup (or are playing with a newly rooted phone).

fastboot flashing lock

Check with a reboot (the phone should now display a yellow warning triangle saying it is booting a custom OS rather than the orange unsigned OS one). If everything goes according to plan you can enter the developer settings and click the “OEM Unlocking” settings to disabled and your phone can no longer be unlocked without your say so.

Conclusions and Caveats

Following the above instructions, you can updated your phone so it will verify images you signed with your AVB key, turn on dm-verity if you wish and generally make your phone much more secure. However, remember that you haven’t erased the original AVB key, so the phone can still be updated to an image signed with that key and, worse, the recovery partition of LineageOS is modified to allow rollback, so it will allow the flashing of any signed image without triggering an erase of the data partition. There are also a few more problems like, thanks to a bug in AOSP, the recovery version of fastboot will actually allow commands that are usually forbidden if the phone is locked.

Debugging Android Early Boot Failures

1 Reply

Back in my blog post about Securing the Google SIP Stack, I did say I’d look at re-enabling SIP in Android-12, so with a view to doing that I tried building and booting LineageOS 19.1, but it crashed really early in the boot sequence (after the boot splash but before the boot animation started). It turns out that information on debugging the android early boot sequence is a bit scarce, so I thought I should write a post about how I did it just in case it helps someone else who’s struggling with a similar early boot problem.

How I usually Build and Boot Android

My builds are standard LineageOS with my patches to fix SIP and not much else. However, I do replace the debug keys with my signing keys and I also have an AVB key installed in the phone’s third party keyslot with which I sign the vbmeta for boot. This actually means that my phone is effectively locked but with a user supplied key (Yellow as google puts it).

My phone is now a pixel 3 (I had to say goodbye to the old Nexus One thanks to the US 3G turn off) and I do have a slightly broken Pixel 3 I play with for experimental patches, which is where I was trying to install Android-12.

Signing Seems to be the Problem

Just to verify my phone could actually boot a stock LineageOS (it could) I had to unlock it and this lead to the discovery that once unlocked, it would also boot my custom rom as well, so whatever was failing in early boot seemed to be connected with the device being locked.

I also discovered an interesting bug in the recovery rom fastboot: If you’re booting locked with your own keys, it will still let you perform all the usually forbidden fastboot commands (the one I was using was set_active). It turns out to be because of a bug in AOSP which treats yellow devices as unlocked in fastboot. Somewhat handy for debugging, but not so hot for security …

And so to Debugging Early Boot

The big problem with Android is there’s no way to get the console messages for early boot. Even if you enable adb early, it doesn’t get started until quite far in to the boot animation (which was way after the crash I was tripping over). However, android does have a pstore (previously ramoops) driver that can give you access to the previously crashed boot’s kernel messages (early init, fortunately, mostly logs to the kernel message log).

Forcing init to crash on failure

Ordinarily an init failure prints a message and reboots (to the bootloader), which doesn’t excite pstore into saving the kernel message log. fortunately there is a boot option (androidboot.init_fatal_panic) which can be set in the boot options (or kernel command line for a pixel-3 which can only boot the 4.9 kernel). If you build your own android, it’s fairly easy to add things to the android commandline (which is in boot.img) because all you need to do is extract BOOT/cmdline from the intermediate zip file you sign add any boot options you need and place it back in the zip file (before you sign it).

Unfortunately, this expedient didn’t work (no console logs appear in pstore). I did check that init was correctly panic’ing on failure by inducing an init failure in recovery mode and observing the panic (recovery mode allows you to run adb). But this induced panic also didn’t show up in pstore, meaning there’s actually some problem with pstore and early panics.

Security is the problem (as usual)

The actual problem turned out to be security (as usual): The pixel-3 does encrypted boot panic logs. The way this seems to work (at least in my reading of the google additional pstore patches) is that the bootloader itself encrypts the pstore ram area with a key on the /data partition, which means it only becomes visible after the device is unlocked. Unfortunately, if you trigger a panic before the device is unlocked (by echoing ‘c’ to /proc/sysrq-trigger) the panic message is lost, so pstore itself is useless for debugging early boot. There seems to be some communication of the keys by the vendor proprietary ramoops binary making it very difficult to figure out how it’s being done.

Why the early panic message is lost is a bit mysterious, but unfortunately pstore on the pixel-3 has several proprietary components around the encrypted message handling that make it hard to debug. I suspect if you don’t set up the pstore encryption keys, the bootloader erases the pstore ram area instead of encrypting it, but I can’t prove that.

Although it might be possible to fix the pstore drivers to preserve the ramoops from before device unlock, the participation of the proprietary bootloader in preserving the memory doesn’t make that look like a promising avenue to explore.

Anatomy of the Pixel-3 Boot Sequence

The Pixel-3 device boots through recovery. What this means is that the initial ramdisk (from boot.img) init is what boots both the recovery and normal boot paths. The only difference is that for recovery (and fastboot), the device stays in the ramdisk and for normal boot it mounts the /system partition and pivots to it. What makes this happen or not is the boot flag androidboot.force_normal_boot=1 which is added by the bootloader. Pretty much all the binary content and init rc files in the ramdisk are for recovery and its allied menus.

Since the boot paths are pretty radically different, because the normal boot first pivots to a first stage before going on to a second, but in the manner of containers, it might be possible to boot recovery first, start a dmesg logger and then re-exec init through the normal path

Forcing Re-Exec

The idea is to signal init to re-exec itself for the normal path. Of course, there have to be a few changes to do this: An item has to be added to the recovery menu to signal init and init itself has to be modified to do the re-exec on the signal (note you can’t just kick off an init with a new command line because init must be pid 1 for booting). Once this is done, there are problems with selinux (it won’t actually allow init to re-exec) and some mount moves. The selinux problem is fixable by switching it from enforcing to permissive (boot option androidboot.selinux=permissive) and the mount moves (which are forbidden if you’re running binaries from the mount points being moved) can instead become bind mounts. The whole patch becomes 31 insertions across 7 files in android_system_core.

The signal I chose was SIGUSR1, which isn’t usually used by anything in the bootloader and the addition of a menu item to recovery to send this signal to init was also another trivial patch. So finally we have a system from which I can start adb to trace the kernel log (adb shell dmesg -w) and then signal to init to re-exec. Surprisingly this worked and produced as the last message fragment:

[ 190.966881] init: [libfs_mgr]Created logical partition system_a on device /dev/block/dm-0
[ 190.967697] init: [libfs_mgr]Created logical partition vendor_a on device /dev/block/dm-1
[ 190.968367] init: [libfs_mgr]Created logical partition product_a on device /dev/block/dm-2
[ 190.969024] init: [libfs_mgr]Created logical partition system_ext_a on device /dev/block/dm-3
[ 190.969067] init: DSU not detected, proceeding with normal boot
[ 190.982957] init: [libfs_avb]Invalid hash size:
[ 190.982967] init: [libfs_avb]Failed to verify vbmeta digest
[ 190.982972] init: [libfs_avb]vbmeta digest error isn't allowed
[ 190.982980] init: Failed to open AvbHandle: No such file or directory
[ 190.982987] init: Failed to setup verity for '/system': No such file or directory
[ 190.982993] init: Failed to mount /system: No such file or directory
[ 190.983030] init: Failed to mount required partitions early …
[ 190.983483] init: InitFatalReboot: signal 6
[ 190.984849] init: #00 pc 0000000000123b38 /system/bin/init
[ 190.984857] init: #01 pc 00000000000bc9a8 /system/bin/init
[ 190.984864] init: #02 pc 000000000001595c /system/lib64/libbase.so
[ 190.984869] init: #03 pc 0000000000014f8c /system/lib64/libbase.so
[ 190.984874] init: #04 pc 00000000000e6984 /system/bin/init
[ 190.984878] init: #05 pc 00000000000aa144 /system/bin/init
[ 190.984883] init: #06 pc 00000000000487dc /system/lib64/libc.so
[ 190.984889] init: Reboot ending, jumping to kernel

Which indicates exactly where the problem is.

Fixing the problem

Once the messages are identified, the problem turns out to be in system/core ec10d3cf6 “libfs_avb: verifying vbmeta digest early”, which is inherited from AOSP and which even says in in it’s commit message “the device will not boot if: 1. The image is signed with FLAGS_VERIFICATION_DISABLED is set 2. The device state is locked” which is basically my boot state, so thanks for that one google. Reverting this commit can be done cleanly and now the signed image boots without a problem.

I note that I could also simply add hashtree verification to my boot, but LineageOS is based on the eng target, which has FLAGS_VERIFICATION_DISABLED built into the main build Makefile. It might be possible to change it, but not easily I’m guessing … although I might try fixing it this way at some point, since it would make my phones much more secure.

Conclusion

Debugging android early boot is still a terribly hard problem. Probably someone with more patience for disassembling proprietary binaries could take apart pixel-3 vendor ramoops and figure out if it’s possible to get a pstore oops log out of early boot (which would be the easiest way to debug problems). But failing that the simple hack to re-exec init worked enough to show me where the problem was (of course, if init had continued longer it would likely have run into other issues caused by the way I hacked it).

Converting Engines to OpenSSL-3 Providers

5 Replies

Engines in OpenSSL have a long history of providing new algorithms (Russian GOST hash/signature etc) but they can also be used to interface external crypto tokens (pkcs#11) or even key managers like my own TPM engine. I’ve actually been using my TPM2 engine for nearly a decade so that I no longer have to have an unprotected private keys anywhere on my laptops (including for ssh). The purpose of this post is to look at the differences between Providers and Engines and give advice on the minimum necessary Provider implementation to give back all the Engine functionality. So this post is aimed at Engine developers who wish to convert to Providers rather than giving user advice for either.

TPMs and Engines

TPM2 actually has a remarkable number of algorithms: hashing, symmetric encryption, asymmetric signatures, key derivation, etc. However, most TPMs are connected to the host over very slow busses (usually serial), which means that no-one in their right mind would use a TPM for bulk data operations (like hashing or symmetric encryption) since it will take orders of magnitude longer than if the native CPU did it. Thus from an Engine point of view, the TPM is really only good for guarding private asymmetric keys and doing sign or decrypt operations on them, which are the only capabilities the TPM engine has.

Hashes and Signatures

Although I said above we don’t use the TPM for doing hashes, the TPM2_Sign() routines insist on knowing which hash they’re signing. For ECDSA signatures, this is irrelevant, since the hash type plays no part in the signature (it’s always truncated to key length and converted to a bignum) but for RSA the ASN.1 form of the hash description is part of the toBeSigned data. The problem now is that early TPM2’s only had two hash algorithms (sha1 and sha256) and the engine wanted to be able to use larger hash sizes. The solution was actually easy: lie about the hash size for ECDSA, so always give the hash that’s the width of the key (sha256 for NIST P-256 and sha384 for NIST P-384) and left truncate the passed in hash if larger or left zero pad if smaller.

For RSA, the problem is more acute, since TPM2_Sign() actually takes a raw digest and adds the hash description but the engine code sends down the fully described hash which merely needs to be padded if PKCS1 (PSS data is fully padded when sent down) and encrypted with the private key. The solution to this taken years ago was not to bother with TPM2_Sign() at all for RSA keys but instead to do a Decrypt operation¹. This also means that TPM RSA engine keys are marked as decryption keys, not signing keys.

The Engine Itself

Given that the TPM is really only guarding the private keys, it only makes sense to substitute engine functions for the private key operations. Although the TPM can do public key operations, the core OpenSSL routines do them much faster and no information is leaked about the private key by doing them through OpenSSL, so Engine keys were constructed from standard OpenSSL keys by substituting a couple of private key methods from the underlying key types. One thing Engines were really bad at was passing additional parameters at key creation time and doing key wrapping. The result is that most Engines already have a separate tool to create engine keys (create_tpm2_key for the TPM2 engine) because complex arguments are needed for TPM specific things like key policy.

TPM keys are really both public and private keys combined and the public part of the key can be accessed without a password (unlike OpenSSL keys) or even access to the TPM that created the key. However, the engine code doesn’t usually know when only the public part of the key will be required and password prompting is done in OpenSSL at key loading (the TPM doesn’t need a password until key use), so usually after a TPM key is created, the public key is also separately derived using a pkey operation and used as a normal public key.

The final, and most problematic Engine feature, is key loading. Engine keys must be loaded using a special API (ENGINE_load_private_key). OpenSSL built in applications require you to specify the key type (-keyform option) but most well written OpenSSL applications simply try loading the PEM key first, then the DER key then the Engine key (since they all have different APIs), but frequently the Engine key is forgotten leading to the application having to be patched if you want to use them with any engine.

Converting Engines to Providers

The provider API has several pieces which apply to asymmetric key handling: Store, Encode/Decode, Key Management, Signing and Decryption (plus many more if you provide hashes or symmetric algorithms). One thing to remember about the store API is that if you only have file based keys, you should use the generic file store instead. Implementing your own store is only necessary if you also have a URI based input (like PKCS#11). In fact the TPM Engine has a URI for persistent keys, so the TPM store implementation will be dealt with later.

Provider Basics

If a provider is specified on the OpenSSL command line, it will become the sole provider of every algorithm. That means that providers like the TPM2 one, which only fill in a subset of functions cannot operate on their own and must always be used with another provider (usually the default one). After initialization (see below) all provider actions are governed by algorithm tables. One of the key questions for any provider is what to do about algorithm names and properties. Because the TPM2 provider relies on external providers for other algorithms, it must use consistent key names (so “EC” for Elliptic curve and “RSA” for RSA), even though it has only a single key type. There are also elements of the provider key managements, like the way Elliptic Curve keys change name to “ECDSA” for signing and “ECDH” for derivation, which is driven by the key management query operation function. As far as I can tell, this provides no benefit and merely serves to add complexity to the provider, so my provider doesn’t implement these functions and uses the same key names throughout.

The most mysterious string of all is the algorithm property one. The manual gives very little clue as to what should be in it besides “provider=<provider name>”. Empirically it seems to have input, output and structure elements, which are primarily used by encoders and decoders: input can be either der or pem and structure must be the same as the OSSL_OBJECT_PARAM_DATA_STRUCTURE string produced by the der decoder (although you are free to choose any name for this). output is even more varied and the best current list is provided by the source; however the only encoder the TPM2 provider actually provides is the text one.

One of the really nice things about providers is that when OpenSSL is presented with a key to load, every provider will be tried (usually in the order they’re specified on the command line) to decode and load the key. This completely fixes the problem with missing ENGINE_load_private_key() functions is applications because now all applications can use any provider key. This benefit alone is enough to outweigh all the problems of doing the actual conversion to a provider.

Replacing Engine Controls

Engine controls were key/value pairs passed into engines. The TPM2 engine has two: “PIN” for the parent authority and “NVPREFIX” for the prefix which identifies a non-volatile key. Although these can be passed in with the ENGINE_ctrl() functions, they were mostly set in the configuration file. This latter mechanism can be replaced with the provider base callback core_get_params(). Most engine controls actually set global variables and with the provider, they could be placed into the provider context. However, for code sharing it’s easier simply to keep the current globals mechanism.

Initialization and Contexts

Every provider has to have an OSSL_provider_init() routine which fills in a dispatch table and allocates a core context, which is passed in to every other context routine. For a provider, there’s really only one instance, so storing variables in the provider context is really no different (except error handling and actually getting destructors) from using static variables and since the engine used static variables, that’s what we’ll stick with. However, pretty much every routine will need an allocated library context, so it’s easiest to allocate at provider init time and pass it through as the provider context. The dispatch routine must contain a query_operation function, and probably needs a teardown function if you need to use a destructor, but nothing else.

All provider function groups require a newctx() and freectx() call. This is not optional because the current OpenSSL code calls them without checking so they cannot be NULL. Thus for function groups (like encoders and key management) where new contexts aren’t really required it makes sense to use pass through context functions that simply pass through the provider context for newctx() and do nothing for freectx().

The man page implies it is necessary to pick a load of functions from the in argument, but it seems unnecessary for those which the OpenSSL library already provides. I assume it’s something to do with a provider not requiring OpenSSL symbols, but it’s impossible to implement a provider today without relying on other OpenSSL functions than those which can be picked out of the in argument.

Decoders

Decoders are used to convert a read file from PEM to DER (this is essentially the same conversion for every provider, so it is strange you have to do this rather than it being done in the core routines) and then DER to an internal key structure. The remaining decoders take DER in and output a labelled key structure (which is used as a component of the EVP_PKEY), if you do both RSA and EC keys, you need one for each key type and, unfortunately, they must be provided and may not cross decode (the RSA decoder must reject EC keys and vice versa). This is actually required so the OpenSSL core can tell what type of key it has but is a royal pain for things like the TPM where the key DER is identical regardless of key type:

const OSSL_ALGORITHM decoders[] = {
	{ "DER", "provider=tpm2,input=pem", decode_pem_fns },
	{ "RSA", "provider=tpm2,input=der,structure=TPM2", decode_rsa_fns },
	{ "EC", "provider=tpm2,input=der,structure=TPM2", decode_ec_fns },
	{ NULL, NULL, NULL }
};

The decode_pem_fns can be cut and pasted from any provider with the sole exception that you probably have a different PEM guard string that you need to check for.

Then a sample decoder function set looks like:

static const OSSL_DISPATCH decode_rsa_fns[] = {
	{ OSSL_FUNC_DECODER_NEWCTX, (void (*)(void))tpm2_passthrough_newctx },
	{ OSSL_FUNC_DECODER_FREECTX, (void (*)(void))tpm2_passthrough_freectx },
	{ OSSL_FUNC_DECODER_DECODE, (void (*)(void))tpm2_rsa_decode },
	{ 0, NULL }
};

The main job of the DECODER_DECODE function is to take the DER form of the key and convert it to an internal PKEY and send that PKEY up by reference so it can be consumed by a key management load.

Encoders

By and large, engines all come with creation tools for key files, which means that while you could now use the encoder routines to create key files, it’s probably better off to stick with what you have (especially for things like the TPM that can have complex policy statements attached to keys), so you can omit providing any encoder functions at all. The only possible exception is if you want the keys pretty printing, you might consider a text output encoder:

const OSSL_ALGORITHM encoders[] = {
	{ "RSA", "provider=tpm2,output=text", encode_text_fns },
	{ "EC", "provider=tpm2,output=text", encode_text_fns },
	{ NULL, NULL, NULL }
};

Which largely follows the format for decoders:

static const OSSL_DISPATCH encode_text_fns[] = {
	{ OSSL_FUNC_ENCODER_NEWCTX, (void (*)(void))tpm2_passthrough_newctx },
	{ OSSL_FUNC_ENCODER_FREECTX, (void (*)(void))tpm2_passthrough_freectx },
	{ OSSL_FUNC_ENCODER_ENCODE, (void (*)(void))tpm2_encode_text },
	{ 0, NULL }
};

Note: there are many more encode/decode function types you could supply, but the above are the essential ones.

Key Management

Nothing in the key management functions requires the underlying key object to be reference counted since it belongs to an already reference counted EVP_PKEY structure in the OpenSSL generic routines. However, the signature operations can’t be implemented without context duplication and the signature context must contain a reference to the provider key so, depending on how the engine implements keys, duplicating via reference might be easier than duplicating via copy. The minimum functionality to implement is LOAD, FREE and HAS. If you are doing Elliptic Curve derive or reference counting your engine keys, you will also need NEW. You also have to provide both GET_PARAMS and GETTABLE_PARAMS (many key management functions have to implement pairs like this) for at least the BITS, SECURITY_BITS and SIZE properties)².

You must also implement the EXPORT (and EXPORT_TYPES, which must be provided but has no callers) so that you can convert your engine key to an external public key. Note the EXPORT function must fail if asked to export the private key otherwise the default provider will try to do the private key operations via the exported key as well.

If you need to do Elliptic Curve key derivation you must also implement IMPORT (and IMPORT_TYPES) because the creation of the peer key (even though it’s a public one) will necessarily go through your provider key managment functions.

The HAS function can be problematic because OpenSSL doesn’t assume the interchangeability of public and private keys, even if it is true of the engine. Thus the engine must remember in the decode routines what key selector was used (public, private or both) and make sure to condition HAS on that value.

Signatures

This is one of the most confusing areas for simple signing devices (which don’t do hashing) because you’d assume you can implement NEWCTX, FREECTX, SIGN_INIT and SIGN and be done. Unfortunately, in spite of the fact that all the DIGEST_SIGN_… functions can be implemented in terms of the previous functions and generic hashing, they aren’t, so all providers are required to duplicate hashing and signing functions including constructing the binary ASN.1 for the certificate signature function (via GET_CTX_PARAMS and its pair GETTABLE_CTX_PARAMS). Another issue a sign only token will get into is padding: OpenSSL supports a variety of padding schemes (for RSA) but is deprecating their export, so if your token doesn’t do an expected form of padding, you’ll need to implement that in your provider as well. Recalling that the TPM2 provider uses RSA Decryption for signatures means that the TPM2 provider implementation is entirely responsible for padding all signatures. In order to try to come up with a common solution, I added an opensslmissing directory to my provider under the MIT licence that anyone is free to incorporate into their provider if they end up having the same digest and padding problems I did.

Decryption and Derivation

The final thing a private key provider needs to do is decryption. This is a very different operation between Elliptic Curve and RSA keys, so you need two different operations for each (OSSL_OP_ASYM_CIPHER for RSA and OSSL_OP_KEYEXCH for EC). Each ends up being a slightly special snowflake: RSA because it may need OAEP padding (which the TPM does) but with the most usual cipher being md5 (so OAEP padding with arbitrary mask and hash function is also in opensslmissing), which the TPM doesn’t do. and EC because it requires derivation from another public key. The problem with this latter operation is that because of the way OpenSSL works, the public key must be imported into the provider before it can be used, so you must provide NEW, IMPORT and IMPORT_TYPES routines for key management for this to happen.

Store

The store functions only need to be used if you have to load keys that aren’t file based (for file based keys the default provider file store will load them). For the TPM there are a set of NV Keys with 0x81 MSO prefix that aren’t file based. We load these in the engine with //nvkey:<hex> as the designator (and the //nvkey: prefix is overridable in the config file). To get this to work in the Provider is slightly problematic because the scheme (the //nvkey: prefix) must be specified as the provider algorithm_name which is usually a constant in a static array. This means that the stores actually can’t be static and must have the configuration defined name poked into it before the store is used, but this is relatively easy to arrange this in the OSSL_provider_init() function. Once this is done, it’s relatively easy to create a store. The only really problematic function is the STORE_EOF one, which is designed around files but means you have to keep an eof indicator in the context and update it to be 1 once the load function has complete.

The Provider Recursion Problem

This doesn’t seem to be discussed anywhere else, but it can become a huge issue if your provider depends on another library which also uses OpenSSL. The TPM2 provider depends on either the Intel or IBM TSS libraries and both of those use OpenSSL for cryptographic operations around TPM transport security since both of them use ECDH to derive a seed for session encryption and HMAC. The problem is that ordinarily the providers are called in the order they’re listed, so you always have to specify –provider default –provider tpm2 to make up for the missing public key operations in the TPM2 provider. However, the OpenSSL core operates a cache for the provider operations it has previously found and searches the cache first before doing any other lookups, so if the EC key management routines are cached (as they are if you input a TPM format key) and the default ones aren’t (because inputting TPM format keys requires no public key operations), the next attempt to generate an ephemeral EC key for the ECDH security derivation will find the TPM2 provider first. So say you are doing a signature which requires HMAC security to guard against interposer tampering. The use of ECDH in the HMAC seed derivation will then call back into the provider to do an ECDH operation which also requires session security and will thus call back again into the provider ad infinitum (or at least until stack overflow). The only way to break out of this infinite recursion is to try to prime the cache with the default provider as well as the TPM2 provider, so the tss library functions can find the default provider first. The (absolutely dirty) hack I have to do this is inside the pkey decode function as

	if (alg == TPM_ALG_ECC) {
		EVP_PKEY_CTX *ctx = EVP_PKEY_CTX_new_id(EVP_PKEY_EC, NULL);
		EVP_PKEY_CTX_free(ctx);
	}

Which currently works to break the recursion loop. However it is an unreliable hack because internally the OpenSSL hash bucket implementation orders the method cache by provider address and since the TPM2 provider is dynamically loaded it has a higher address than the OpenSSL default one. However, this will not survive security techniques like Address Space Layout Randomization.

Conclusions

Hopefully I’ve given a rapid (and possibly useful) overview of converting an engine to a provider which will give some pointers about provider conversion to all the engine token implementations out there. Please feel free to repurpose my opensslmissing routines under the MIT licence without any obligations to get them back upstream (although I would be interested in hearing about bugs and feature enhancements). In the end, it was only 1152 lines of C to implement the TPM2 provider (additive on top of the common shared code base with the existing Engine) and 681 lines in opensslmissing, showing firstly that there is still an need for OpenSSL itself to do the missing routines as a provider export and secondly that it really takes a fairly small amount of provider code to wrapper an existing engine implementation provided you’re discriminating about what functions you actually provide. As a final remark I should note that the openssl_tpm2_engine has a fairly extensive test suite which all now pass with the provider implementation as well.

Using SIP to Replace Mobile and Land Lines

6 Replies

If you read more than a few articles in my blog you’ve probably figured out that I’m pretty much a public cloud Luddite: I run my own cloud (including my own email server) and don’t really have much of my data in any public cloud. I still have public cloud logins: everyone wants to share documents with Google nowadays, but Google regards people who don’t use its services “properly” with extreme prejudice and I get my account flagged with a security alert quite often when I try to log in.

However, this isn’t about my public cloud phobia, it’s about the evolution of a single one of my services: a cloud based PBX. It will probably come as no surprise that the PBX I run is Asterisk on Linux but it may be a surprise that I’ve been running it since the early days (since 1999 to be exact). This is the story of why.

I should also add that the motivation for this article is that I’m unable to get a discord account: discord apparently has a verification system that requires a phone number and explicitly excludes any VOIP system, which is all I have nowadays. This got me to thinking that my choices must be pretty unusual if they’re so pejoratively excluded by a company whose mission is to “Create Space for Everyone to find Belonging”. I’m sure the suspicion that this is because Discord the company also offers VoIP services and doesn’t like the competition is unworthy.

Early Days

I’ve pretty much worked remotely in the US all my career. In the 90s this meant having three phone lines (These were actually physical lines into the house): one for the family, one for work and one for the modem. When DSL finally became a thing and we were running a business, the modem was replaced by a fax machine. The minor annoyance was knowing which line was occupied but if line 1 is the house and line 2 the office, it’s not hard. The big change was unbundling. This meant initially the call costs to the UK through the line provider skyrocketed and US out of state rates followed. The way around this was to use unbundled providers via dial-around (a prefix number), but finding the cheapest was hard and the rates changed almost monthly. I needed a system that could add the current dial-around prefix for the relevant provider automatically. The solution: asterisk running on a server in the basement with two digium FX cards for the POTS lines (fax facility now being handled by asterisk) and Aastra 9113i SIP phones wired over the house ethernet with PoE injectors. Some fun jiggery pokery with asterisk busy lamp feature allowed the lights on the SIP phone to indicate busy lines and extensions.conf could be programmed to keep the correct dial-around prefix. For a bonus, asterisk can be programmed to do call screening, so now if the phone system doesn’t recognize your number you get told we don’t accept solicitation calls and to hang up now, otherwise press 0 to ring the house phone … and we’ve had peaceful dinner times ever after. It was also somewhat useful to have each phone in the house on its own PBX extension so people could call from the living room to my office without having to yell.

Enter SIP Trunking

While dial-arounds worked successfully for a few years, they always ended with problems (usually signalled by a massive phone bill) and a new dial-around was needed. However by 2007 several companies were offering SIP trunking over the internet. The one I chose (Localphone, a UK based company) was actually a successful ring back provider before moving into SIP. They offered a pay as you go service with phone termination in whatever country you were calling. The UK and US rates were really good, so suddenly the phone bills went down and as a bonus they gave me a free UK incoming number (called a DID – Direct Inward Dialing) which family and friends in the UK could call us on at local UK rates. Pretty much every call apart from local ones was now being routed over the internet, although most incoming calls, apart for those from the UK, were still over the POTS lines.

The beginning of Mobile (For Me)

I was never really a big consumer of mobile phones, but that all changed in 2009 when google presented all kernel developers with a Nexus One. Of course, they didn’t give us SIM cards to go with it, so my initial experiments were all over wifi. I soon had CyanogenMod installed and found a SIP client called Sipdroid. This allowed me to install my Nexus One as a SIP extension on the house network. SIP calls over 2G data were not very usable (the bandwidth was too low), but implementing multiple codecs and speex support got it to at least work (and is actually made me an android developer … scratching my own itch again). The bandwidth problems on 2G evaporated on 3G and SIP became really usable (although I didn’t have a mobile “plan”, I did use pay-as-you-go SIMs while travelling). It already struck me that all you really needed the mobile network for was data and then all calls could simply travel to a SIP provider. When LTE came along it seemed to be confirming this view because IP became the main communication layer.

I suppose I should add that I used the Nexus One long beyond its design life, even updating its protocol stack so it kept working. I did this partly because it annoyed certain people to see me with an old phone (I have a set of friends who were very amused by this and kept me supplied with a stock of Nexus One phones in case my old one broke) but mostly because of inertia and liking small phones.

SIP Becomes My Only Phone Service

In 2012, thanks to a work assignment, we relocated from the US to London. Since these moves take a while, I relocated the in-house PBX machine to a dedicated server in Los Angeles (my nascent private cloud), ditched the POTS connections and used the UK incoming number as our primary line that could be delivered to us while we were in temporary accommodation as well as when we were in our final residence in London. This did have the somewhat inefficient result that when you called from the downstairs living room to the upstairs office, the call was routed over an 8,000 mile round trip from London to Los Angeles and back, but thanks to internet latency improvements, you couldn’t really tell. The other problem was that the area code I’d chosen back in 2007 was in Whitby, some 200 Miles north of London but fortunately this didn’t seem to be much of an issue except for London Pizza delivery places who steadfastly refused to believe we lived locally.

When the time came in 2013 to move back to Seattle in the USA, the adjustment was simply made by purchasing a 206 area code DID and plugging it into the asterisk system and continued using a fully VoIP system based in Los Angeles. Although I got my incoming UK number for free, being an early service consumer, renting DIDs now costs around $1 per month depending on your provider.

SIP and the Home Office

I’ve worked remotely all my career (even when in London). However, I’ve usually worked for a company with a physical office setup and that means a phone system. Most corporate PBX’s use SIP under the covers or offer a SIP connector. So, by dint of finding the PBX administrator I’ve usually managed to get a SIP extension that will simply plug into my asterisk PBX. Using correct dial plan routing (and a prefix for outbound calling), the office number usually routes to my mobile and desk phone, meaning I can make and receive calls from my office number wherever in the world I happen to be. For those who want to try this at home, the trick is to find the phone system administrator; if you just ask the IT department, chances are you’ll simply get a blanket “no” because they don’t understand it might be easy to do and definitely don’t want to find out.

Evolution to Fully SIP (Data Only) Mobile

Although I said above that I maintained a range of in-country Mobile SIMs, this became less true as the difficulty in running in-country SIMs increased (most started to insist you add cash or use them fairly regularly). When COVID hit in 2020, and I had no ability to travel, my list of in-country SIMs was reduced to one from 3 UK largely because they allowed you to keep your number provided you maintained a balance (and they had a nice internet roaming agreement which meant you paid UK data rates in a nice range of countries). The big problem giving up a mobile number was no text messaging when travelling (SMS). For years I’ve been running a xmpp server, but the subset of my friends who have xmpp accounts has always been under 10% so it wasn’t very practical (actually, this is somewhat untrue because I wrote an xmpp to google chat bridge but the interface became very impedance mismatched as Google moved to rich media).

The major events that got me to move away from xmpp and the Nexus One were the shutdown of the 3G network in the US and the viability of the Matrix federated chat service (the Matrix android client relied on too many modern APIs ever to be backported to the version of android that ran on the Nexus One). Of the available LTE phones, I chose the Pixel-3 as the smallest and most open one with the best price/performance (and rapidly became acquainted with the fact that only some of them can actually be rooted) and LineageOS 17.1 (Android 10). The integration of SIP with the Dialer is great (I can now use SIP on the car’s bluetooth, yay!) but I rapidly ran into severe bugs in the Google SIP implementation (which hasn’t been updated for years). I managed to find and fix all the bugs (or at least those that affected me most, repositories here; all beginning with android_ and having the jejb-10 branch) but that does now mean I’m stuck on Android 10 since Google ripped SIP out in Android 12.

For messaging I adopted matrix (Apart from the Plumbers Matrix problem, I haven’t really written about it since matrix on debian testing just works out of the box) and set up bridges to Signal, Google Chat, Slack and WhatsApp (The WhatsApp one requires you be running WhatsApp on your phone, but I run mine on an Android VM in my cloud) all using the 3 UK Sim number where they require a mobile number confirmation. The final thing I did was to get a universal roaming data SIM and put it in my phone, meaning I now rely on matrix for messaging and SIP for voice when I travel because the data SIM has no working mobile number at all (either for voice or SMS). In many ways, this is no hardship: I never really had a permanent SMS number when travelling because of the use of in-country SIMs, so no-one has a number for me they rely on for SMS.

Conclusion and Problems

Although I implied above I can’t receive SMS, that’s not quite true: one of my VOIP numbers does accept SMS inbound and is able to send outbound, the problem is that it doesn’t come over the SIP MESSAGE protocol, but instead goes to a web page in the provider backend, making it inconvenient to use and meaning I have to know the message is coming (although I do use it for things like Delta Boarding passes, which only send the location of the web page to receive pkpasses over SMS). However, this isn’t usually a problem because most people I know have moved on from SMS to rich messaging over one of the protocols I have (and if one came along with a new protocol, well I can install a bridge for that).

In terms of SIP over an IP substrate giving rise to unbundled services, I could claim to be half right, since most of the modern phone like services have a SIP signalling core. However, the unbundling never really came about: the silo provider just moved from landline to mobile (or a mobile resale service like Google Fi). Indeed, today, if you give anyone your US phone number they invariably assume it is a mobile (and then wonder why you don’t reply to their SMS messages). This mobile assumption problem can be worked around by emphasizing “it’s a landline” every time you give out your VOIP number, but people don’t always retain the information.

So what about the future? I definitely still like the way my phone system works … having a single number for the house which any household member can answer from anywhere and side numbers for travelling really suits me, and I have the technical skills to maintain it indefinitely (provided the SIP trunking providers sill exist), but I can see the day coming where the Discord intolerance of non-siloed numbers is going to spread and most silos will require non-VOIP phone numbers with the same prejudice, thus locking people who don’t comply out in much the same way as it’s happening with email now; however, hopefully that day for VoIP is somewhat further off.

Owning Your Own Copyrights in Open Source

Why Should you Own your Own Source code?

In the early days of open source, everything was a hobby project and everyone owned their own contributions. Owning your own contribution was a sort of mark of franchise in the project. Of course, there were some projects, notably the FSF ones, which didn’t believe in distributed ownership and insisted you contribute ownership of your copyrights to them so they could look after the project for you. Obviously, since I’m a Linux Kernel developer and with the Linux Kernel being a huge distributed copyright project, it’s easy to see which side of the argument I fall.

The main rights you give up if you don’t own the code you create are the right to re-licence and the right to enforce. It probably hadn’t occurred to you that if you actually find a licence violation in a project you contribute to for your employer, you’ll have no standing to demand that the problem get addressed. In fact, any enforcement on the code would have to be done by the proper owner: your employer. Plus your employer can control the ultimate destination of that ownership, including selling your code to a copyright troll if they so wished … while you may trust your employer now you work for them, do you trust them to do the right thing for all time, especially since they may be bought out by EvilCorp on down the road?

The relicensing problem can also be thorny: as a strong open source contributor you’ve likely been on the receiving end of requests to relicense (“I really like the code in your project X and would like to incorporate it in my open source project Y, but there’s a licence compatibility problem, would you dual license it?”) and thought nothing about saying “yes”. However, if your employer owns the code, you were likely lying when you said “yes” because you have no relicensing rights and you must ask your employer for permission to do the relicensing.

All the above points up the dangers in the current ecosystem. Project contributors often behave like they own the code but if they don’t they can be leaving a legal minefield in their wakes. The way to fix this is to own your own code … or at least understand the limitations of your rights if you don’t.

Open Source in Your Own Time

It’s a mistake to think that just because you work on something in your own time it isn’t actually owned by your employer. Historically, at least in the US, employment agreements contain incredibly broad provisions for invention ownership which basically try to claim anything you invent at any hour of the day or night that might be even vaguely related to your employment. Not unnaturally this caused huge volumes of litigation around startups where former employees successfully develop innovations their prior employer declined to pursue (at least until it started making money). This has lead to a slew of state based legal safe harbour protections for employee inventions. Most of them, like the Illinois Statute I first used, have similar wording

A provision in an employment agreement which provides that an employee shall assign or offer to assign any of the employee’s rights in an invention to the employer does not apply to an invention for which no equipment, supplies, facilities, or trade secret information of the employer was used and which was developed entirely on the employee’s own time … is … void and unenforceable.
765 ILCS 1060/2

In fact most states now require the wording to appear in the employment contract, so you likely don’t have to look up the statute to figure out what to do. The biggest requirements are that it be on your own time and you not be using any employer equipment, so the most important thing is to make sure you have your own laptop or computer. If you follow the requirements to the letter, you should be safe enough in owning your own time open source code. However, if you really want a guarantee you need to take extra precautions.

Own Time Open Source Carve Outs in Employment agreements

When you join a company, one of the things you’ll sign is a prior invention disclosure form, usually as an appendix to the invention assignment agreement as part of your employment contract. Here’s an example one from the SEC database (ironically for a Chinese subsidiary). Look particularly at section 2(a) “Inventions Retained and Licensed”. It’s basically pure CYA for the company, and most people leave Exhibit A blank, but you shouldn’t do that. What you should do is list all your current and future (by doing sweeping guesswork) own open source projects. The most useful clause in 2(a) says “I agree that I will not incorporate any Prior Inventions into any products …” so you and your employer have now agreed that all the listed projects are outside the scope of your employment agreement.

As far as I can tell, no-one really looks at Exhibit A at all, so I’ve been really general and put things like “The Linux Kernel” and “Open Source UEFI software” “Open Source cryptography such as gnupg, openssl and gnutls” and never been challenged on it.

One legitimate question, which will probably happen if your carve outs are very broad, is what happens if your employer specifically asks you to work on a project you’ve declared in Exhibit A? Ideally you could use this as an opportunity to negotiate an addendum to your contract covering your ownership of open source. However, if you don’t want to rock the boat, you can simply do nothing and rely on the fact that the agreement has something to say about this. The sample section 2(a) above goes on to give your employer a non-exclusive licence, which you could take as agreement to your continued ownership of the copyrights in the code, even through your employer is now instructing you (and paying you) to work on it. However, the say nothing approach has never been tested in court and may be vulnerable to challenge, so a safer course is to send your manager an email pointing out the issue and proposing to follow the licence in the employment contract. If they do nothing, thinking the matter settled, as most managers do, then you have legal cover for continuing to own your own copyrights. You can make it as vague as you like, so using the above sample agreement, something like “You’ve asked me to work on Project X which was listed in Exhibit A of my employment agreement. To move forward, I’m happy to licence all future works on this project to you under the terms of section 2(a)”. It looks innocuous, but it’s actually a statement that your company doesn’t get copyright ownership because of the actual wording in section 2(a) says the company gets a non-exclusive licence if you incorporate any works listed in Exhibit A. Remember to save the email somewhere safe (and any reply which is additional proof it was seen) just in case.

Owning Open Source Produced on Company Time

The first thing to note is that if your employer pays for you to work on open source, absent any side agreement, the code that you produce will be owned by your employer. This isn’t some US specific thing, this is a general principle of employment the world over (they pay you, so they own it). So even if you work in Europe, your employer will still own your open source copyrights if they pay you to work on the project, moral rights arguments notwithstanding. The only way to change this is to get some sort of explicit or implicit (if you want to go the carve out route above) agreement about the ownership.

Although I’ve negotiated both joint and exclusive ownership of open source via employment agreements, the actual agreements are still the property of the relevant corporations and thus, unfortunately, while I can describe some of the elements, I can’t publish the text (employment agreements are the crown jewels the HR dragons guard).

How to Negotiate

Most employers (or at least their lawyers) will refuse point blank to change the wording of employment agreements. However, what you want can be a side agreement and usually doesn’t require rewording the employment agreement at all. All you need is the understanding that the side agreement will get executed. One big problem can be that most negotiations over employment agreements occur with people from HR, which is a department with the least understanding of open source, so you don’t want to be negotiating the side agreement with them, you want to talk to the person that is hiring you. You also need to present your request as reasonable, so find out if anyone inside your prospective new company has done something similar. Often they have, and they’ll likely be someone in open source you’ve at least heard of so you can approach them and ask for details. “But you gave a copyright ownership side agreement to X” is often a great way to advance your cause. Don’t be afraid to ask and argue politely but firmly … hiring talented developers is very competitive nowadays so they have (or at least the manager who wants to hire you has) a vested interest in keeping you happy.

Consider Joint Ownership

Joint ownership is a specific legal term meaning the rights in a copyright are shared by the joint owners. Effectively this sharing means that either party may enforce without consulting the other and either party may license the work without consulting the other (but here they must share any profits from the licence equally among joint owners).

Joint ownership is often a good solution because it gives you the right to relicence and the right to enforce, while also giving your employer a share in what they paid to produce. Joint ownership is often far easier to sell to corporations than one or other of you having exclusive ownership because it gives them all the rights they would have had anyway. The only slight concern you may have down the road is it does give them the right to relicence or sell on their ownership, say to an open core business or to an enforcement troll. However, the good news is that as joint owner you now have a right to a half share of any profit they (and the new owner) make out of such a rights transfer, which can potentially act as a deterrent to the transaction if you remind them of this requirement.

Open Source as a Contractor

In some ways this is the best relationship. There are no work for hire assumptions about companies you contract for owning your free time, so doing other open source projects is easy. However, a contractor is bound by whatever contract you sign, so you need someone with legal training to help you make sure it is actually equitable. You can’t get around this legal requirement: the protections that exist for employees don’t exist for contractors, so if you sign a contract saying in exchange for a certain sum company X owns the entirety of your output, you will be bound by it. So remember: read the contract and negotiate the terms.

Copyright Ownership as a Contractor

Surprisingly, in a relationship where you’re contracted to get something upstream, it’s often in the client’s best interest to have the contractor own the copyrights in Open Source. It means the contractor is responsible for all the nitty gritty of pushing patches and dealing with contribution agreements and the client simply gets the end product: the thing they wanted upstream. I’ve found this a surprisingly easy sell to most legal departments. Even if the client does want some sort of ownership of the code, you can offer joint ownership as the easy route to you taking on all the hassle and them getting the benefits of ownership.

Trade Secrets

As a contractor, you’ll likely be forced to sign an NDA never to reveal client secrets. This is pretty usual, but the pitfall in open source, particularly if you’re doing a driver for a device whose programming manual is under NDA, is that you are going to be revealing them contrary to the NDA. You need this handled in an equitable fashion in the contract to avoid unpleasant problems long after the job is done. The simplest phrase you need is something like “Client understands that open source is developed in public and authorizes that all information necessary to producing X under this contract be disclosed to the public”.

Patents

Patents can be a huge minefield with contract open source, because as a contractor who owns the copyrights and negotiates the contribution agreements, you have no authority to bind your client’s patents. You really don’t want to find yourself being used as a conduit for a patent ambush on open source (where a client contracts with you to put code into a project which reads on a patent they hold and then turns around and patent trolls the ecosystem) so you need contract language binding the client patents at least in the work you’re doing for them. Something simple like “Client grants a perpetual and irrevocable licence, consistent with the terms of the open source licence for X, to all contributions made by contractor to X that read on patents client holds now or may in future acquire”. This latter is pretty narrow, so you could start out by trying to get a patent licence for the entirety of project X and negotiate down from there.

Conclusions

Owning your own copyrights in open source is possible provided you’re careful. The strategies outlined above are based on my own experiences (all in the US) as a contract employee from 1995-2008 there after as a regular employee but are not the only ones you could pursue, so ask around to see what others have done as well. The main problem with all the strategies above is that they work well when you’re negotiating your employment. If you’re already working at some corporation they’re unlikely to be helpful to you unless you really have a simple own time open source project. Oh, and just remember that while the snippets I quoted above for the contract case may actually have been in contracts I signed, this isn’t legal advice and you should have a lawyer advise you how best to incorporate the various points raised.

Papering Over our TPM 2.0 TSS Divisions

13 Replies

For years I’ve been hoping that the Trusted Computing Group (TCG) based IBM and Intel TSS (TCG Software Stack) would simply integrate with one another into a single package. The rationale is pretty simple: the Intel TSS is already quite a large collection of libraries so adding one more (the IBM TSS has a single library) wouldn’t be too much of a burden. Both TSSs are based on TCG specifications, except that the IBM TSS is based on the TPM 2.0 Library Specification and the Intel TSS is based on the TPM Software Stack (also, not at all confusingly, abbreviated TSS). There’s actually very little overlap between these specifications so co-existence seems very reasonable. Before we get into the stories of these two stacks and what they do, I should confess my biases: while I’ve worked with the TCG over the years, I’ve always harboured the view that the complete lack of adoption of TPM 2.0’s predecessor (TPM 1.2) was because of the hugely complicated nature of the TCG mandated software stack which was implemented in Linux by trousers. It is my firm belief that the complexity of the API lead to the lack of uptake, even though I made several efforts over the years to make use of it.

My primary interest in the TPM has been as a secure laptop keystore (since I already paid for a TPM, I didn’t see the need to fork out again for one of the new security dongles; plus the TPM is infinitely scalable in the number of keys, unlike most dongles). The key to making the TPM usable in this form is integration with existing Cryptographic systems (via plugins if they do them). Since openssl has an engine plugin, I’ve already produced an openssl TPM2 engine, patches for gnupg and engine integration patches for openvpn (upstream in 2.5) and openssh as well as a PKC11 exporter (to make file based engine keys exportable as PKCS11 tokens). Note a lot of the patches aren’t strictly TPM patches, they’re actually making openssl engines work in places they previously didn’t. However, the one thing most of the patches that actually touch the TPM have in common is that they have to pick one or other of the available TSSs to operate with. Before describing the TSS agnostic solution, lets look at why these two TSSs exist and what the difference is between them and why you might choose one over the other.

Schizophrenia at the TCG

As I said in the introduction, both TSSs are based on TCG specifications. These standards aren’t ambiguous: they lay out in excruciating detail what the header files are called and what the prototypes and structures have to be. Both TSS implementations are the way they are because they wouldn’t be following the standards if they deviated even slightly. The problem is the standards don’t agree with each other in meaningful ways. For instance the TPM Library standards define every structure in terms of the fundamental unit of TPM data: the TPM2B structure, which defines a 16 bit big endian length followed by a data unit of that length. The TPM Library standards (in Part 4 section 9.10.6) lay out that every TPM2B_X structure shall be a union of a ‘b’ element which is a TPM2B and a ‘t’ element which is the actual structure. However the TPM Software Stack specification eliminates the plain TPM2B so every TPM2B_X structure in the latter specification are not unions, they are simply the ‘t’ form of the structure. This means that although TPM2B_X structures in each specification are byte for byte the same, they are definitionally different when written as C code and can’t be assigned to each other … oops. The TPM Library standard lays out additional structures for an elaborate calling convention for the TPM2_Command interfaces which are completely different from the ESYS_Command interfaces in the TPM Software Stack.

The reason it’s all done this way? well the specifications were built by completely different committees for what the committees saw as separate use cases, so they didn’t see a need to reconcile the differences. As long as the definitions were byte for byte compatible, everything would work out correctly on the wire. The problem was the TPM Library specification was released nearly a decade ahead of the TPM Software Stack specification, so the first TSS created had to follow the former because the latter didn’t exist.

Sessions, HMAC and Encryption

One of the perennial problems of a TPM is that integrity and security of the information going over the wire is the responsibility of the user. However, the encryption and integrity computations involved, particularly the key derivations, are incredibly involved (even though well documented in the TPM Library specification, so naturally everyone would like the TSS to do this. The problem the TPM Secure Stack had is that all the way up to its ESAPI specification, the security and integrity computations were still the responsibility of the user, so it didn’t begin to be useful until ESAPI was finalized a couple of years ago.

The Resource Manager Problem

TPM 2.0 was designed to be far leaner in terms of resources than TPM 1.2, which meant there was a very small limit to the number of sessions and volatile objects it could contain at any one time. This necessitated the use of a “resource manager” to control access otherwise applications would get unexpected out of resource errors. The Intel TSS has its own resource manager. However, the Linux Kernel itself incorporated a resource manager in the TPM device in 4.12 and the IBM TSS avoids the need for its own resource manager by using this, and will, therefore not work correctly on earlier kernel versions.

Inside the IBM TSS

Even though the IBM TSS is based on a solid and easily comprehensible and detailed specification, that specification itself suffers from a couple of defects. The first being it assumes you’re submitting to a physical TPM, so the specification has no functional (library based) submission API for TPM commands, so the IBM TSS had to invent API it called TSS_Execute() which is a way of sending TPM commands directly to the physical TPM over the kernel’s device interfaces. Secondly, the standard contains no routing interfaces (telling it what destination the TPM is on: should it open the /dev/tpmrm0 device or send the commands to the TPM over an IP socket), so this is controlled in the IBM TSS by several environment variables (TPM_INTERFACE_TYPE, which can be either “dev” or “socsim” for either a physical device or a network socket. The endpoints being controlled by TPM_DEVICE for “dev” type, which specifies which device to use, defaulting to /dev/tpmrm0 or TPM_SERVER_NAME and TPM_PLAFORM_PORT for “socsim”).

The invented TSS_Execute() API also does all the encryption and HMAC parts necessary for secure and integrity verified communication with the TPM, so it acts as a fully functional TSS. The main drawback of the IBM TSS is that it stores essential information about the sessions and handles in files which will, by default, be dropped into the local directory. Most users of the IBM TSS have to set TPM_DATA_DIR to be a specially created directory under /tmp to avoid leaving messy artifacts in users home directories.

Inside the Intel TSS

The TPM Software Stack consists of a large number of different specifications, including the resource manager (which is now unnecessary for kernels above 4.12) the TCTI which specifies the routing information for the TPM. It turns out that even in the Intel TSS, environment variables are the most convenient form to specify this information but, unfortunately, the name of the environment variable has been left up to each use case instead of being standardised in the library meaning you’ll have to consult the man page to figure out what it is. The next set of standards: SAPI and ESAPI define functional interfaces to the TPM with one submission API for each command and additionally a corresponding ..._Async()/..._Finish() pair for asynchronous programming. The only real difference between SAPI and ESAPI is that the latter also does the necessary session cryptography for security and integrity, so it’s pretty much the only usable interface for TPM commands. Unfortunately, the ESAPI interface, as constructed by the TCG, has several cases of premature abstraction the worst of which is a separate abstraction for the TPM handle interface which lives only as long as the lifetime of the connection object and which necessitates multiple conversions to and from internal handle objects if your session or object lives longer than the connection (which can be the case).

There is one final wrinkle is that in the handle abstraction, ESAPI has no API for retrieving the real TPM handle. I’d always wondered why the Intel TSS tpm2 tools always saved the objects they create to a context instead of simply returning the handle to them, but this is the reason: without the ability to transform an internal handle to an external one, you either save the context or let the object die when the connection terminates. This problem is one forced by the ESAPI standard, but eventually it became enough of a problem that the Intel TSS introduced its own additional API to remedy.

The other major difference between the Intel and IBM TSSs is memory handling for returned results: The IBM TSS requires pre-allocated structures whereas the Intel TSS insists on allocation on return. It looks like the Intel TSS should be able to tell if the return pointer is allocated or NULL, but right at the moment it always allocates and overwrites the pointer.

Constructing a unifying Interface for both the IBM and Intel TSSs

In essence the process for converting something that runs with the IBM TSS to being TSS Agnostic is a fairly simple three step process which I’ll illustrate by reference to the openssl tpm2 engine which has already been converted:

Hide the structural differences by inserting a set of macros: VAL() and VAL_2B() which hide most of the TCG induced structure schizophrenia.
Convert the API call structure to be functional instead of via a single TSS_Execute() call. This is quite involved so I did it by adding tpm2_Function() wrappers for each specific invocation.
Introduce the correct premature abstraction for internal and external representation of handles. This was the nastiest step for me because handles are stored in long lived engine structures, and the internal and external representations are both forms of uint32_t even in ESAPI (meaning the compiler won’t complain if you assign one to the other) so it was incredibly painful to get this conversion correct.

Once this is done, the remaining step was to introduce a header which did the impedance matching between the Intel and IBM TSSs and an autoconf macro to detect which TSS is installed and the resulting configure and compile just works. The resulting code will now build and run under either TSS. I should point out that the Intel TSS is missing several helper routines, but these are added into the intel-tss.h header file by copying the from the original IBM TSS. Finally an autoconf check is added to look for the missing internal to external handle transform, and everything is ready to go.

It does seem like it would be easier to port an existing Intel TSS application to the IBM TSS, since points 2 and 3 will already be sorted out. However, all the major TSS library using applications are IBM TSS based, so I haven’t actually been able to verify this.

Remaining Problems and Anomalies

The biggest remaining issue was the test scripts. The openssl TPM2 engine has 27 of them all told, all designed to check the engine function by invoking it via openssl when connected to a software TPM. These scripts are all highly dependent on the IBM TSS command line binaries and the Intel TSS versions seem to be very unstable in terms of argument structure making it pretty much impossible to convert, so I elected finally to have the tests run only if the IBM TSS CLI is installed. The next problem was that the Intel TSS version of the engine didn’t actually pass all the tests. However this was quickly narrowed down to a bug in the Intel TSS when using bound sessions on the NULL seed.

The sole remaining issue is a curious performance anomaly. When running time make check with the IBM TSS, the result is:

real 0m6.100s
user 0m2.827s
sys 0m0.822s

and the same command with the Intel TSS (running one fewer test and skipping the NULL seed) is:

real	0m10.948s
user	0m6.822s
sys	0m0.859s

Showing that the Intel TSS is nearly twice as slow as the IBM one with most of the time differential being user time. Since the tests use a software TPM which can perform the cryptographic operations at the speed of the main CPU, this is showing some type of issue with the command transmission system of the Intel TSS, likely having to do with the fact that most applications use synchronous TPM operations (the engine certainly does) but in the Intel TSS, the synchronous operations are implemented as the corresponding asynchronous pair. Regardless of the root cause, this is unlikely to be a problem with real world TPM crypto where the time taken for any operation will be dominated by the slowness of the physical TPM.

Conclusion

The TSS agnostic scheme adopted by the openssl TPM2 engine should be easily adaptable for all the other non-engine TPM code bases, and thus should pave the way for users not having to choose between applications which only support the Intel or IBM TSSs and can choose to install the best supported one on their distribution. The next steps are to investigate adapting this infrastructure to the existing gnupg patches (done and upstream) and also see if it can be used to solve the gnutls conundrum over supporting TPM based keys.

Deploying Encrypted Images for Confidential Computing

19 Replies

In the previous post I looked at how you build an encrypted image that can maintain its confidentiality inside AMD SEV or Intel TDX. In this post I’ll discuss how you actually bring up a confidential VM from an encrypted image while preserving secrecy. However, first a warning: This post represents the state of the art and includes patches that are certainly not deployed in distributions and may not even be upstream, so if you want to follow along at home you’ll need to patch things like qemu, grub and OVMF. I should also add that, although I’m trying to make everything generic to confidential environments, this post is based on AMD SEV, which is the only confidential encrypted¹ environment currently shipping.

The Basics of a Confidential Computing VM

At its base, current confidential computing environments are about using encrypted memory to run the virtual machine and guarding the encryption key so that the owner of the host system (the cloud service provider) can’t get access to it. Both SEV and TDX have the encryption technology inside the main memory controller meaning the L1 cache isn’t encrypted (still vulnerable to cache side channels) and DMA to devices must also be done via unencryped memory. This latter also means that both the BIOS and the Operating System of the guest VM must be enlightened to understand which pages to encrypted and which must not. For this reason, all confidential VM systems use OVMF² to boot because this contains the necessary enlightening. To a guest, the VM encryption looks identical to full memory encryption on a physical system, so as long as you have a kernel which supports Intel or AMD full memory encryption, it should boot.

Each confidential computing system has a security element which sits between the encrypted VM and the host. In SEV this is an aarch64 processor called the Platform Security Processor (PSP) and in TDX it is an SGX enclave running Intel proprietary code. The job of the PSP is to bootstrap the VM, including encrypting the initial OVMF and inserting the encrypted pages. The security element also includes a validation certificate, which incorporates a Diffie-Hellman (DH) key. Once the guest owner obtains and validates the DH key it can use it to construct a one time ECDH encrypted bundle that can be passed to the security element on bring up. This bundle includes an encryption key which can be used to encrypt secrets for the security element and a validation key which can be used to verify measurements from the security element.

The way QEMU boots a Q35 machine is to set up all the configuration (including a disk device attached to the VM Image) load up the OVMF into rom memory and start the system running. OVMF pulls in the QEMU configuration and constructs the necessary ACPI configuration tables before executing grub and the kernel from the attached storage device. In a confidential VM, the first task is to establish a Guest Owner (the person whose encrypted VM it is) which is usually different from the Host Owner (the person running or controlling the Physical System). Ownership is established by transferring an encrypted bundle to the Secure Element before the VM is constructed.

The next step is for the VMM (QEMU in this case) to ask the secure element to provision the OVMF Firmware. Since the initial OVMF is untrusted, the Guest Owner should ask the Secure Element for an attestation of the memory contents before the VM is started. Since all paths lead through the Host Owner, who is also untrusted, the attestation contains a random nonce to prevent replay and is HMAC’d with a Guest Supplied key from the Launch Bundle. Once the Guest Owner is happy with the VM state, it supplies the Wrapped Key to the secure element (along with the nonce to prevent replay) and the Secure Element unwraps the key and provisions it to the VM where the Guest OS can use it for disc encryption. Finally, the enlightened guest reads the encrypted disk to unencrypted memory using DMA but uses the disk encryptor to decrypt it to encrypted memory, so the contents of the Encrypted VM Image are never visible to the Host Owner.

The Gaps in the System

The most obvious gap is that EFI booting systems don’t go straight from the OVMF firmware to the OS, they have to go via an EFI bootloader (grub, usually) which must be an efi binary on an unencrypted vFAT partition. The second gap is that grub must be modified to pick the disk encryption key out of wherever the Secure Element has stashed it. The third is that the key is currently stashed in VM memory before OVMF starts, so OVMF must know not to use or corrupt the memory. A fourth problem is that the current recommended way of booting OVMF has a flash drive for persistent variable storage which is under the control of the host owner and which isn’t part of the initial measurement.

Plugging The Gaps: OVMF

To deal with the problems in reverse order: the variable issue can be solved simply by not having a persistent variable store, since any mutable configuration information could be used to subvert the boot and leak the secret. This is achieved by stripping all the mutable variable handling out of OVMF. Solving key stashing simply means getting OVMF to set aside a page for a secret area and having QEMU recognise where it is for the secret injection. It turns out AMD were already working on a QEMU configuration table at a known location by the Reset Vector in OVMF, so the secret area is added as one of these entries. Once this is done, QEMU can retrieve the injection location from the OVMF binary so it doesn’t have to be specified in the QEMU Machine Protocol (QMP) command. Finally OVMF can protect the secret and package it up as an EFI configuration table for later collection by the bootloader.

The final OVMF change (which is in the same patch set) is to pull grub inside a Firmware Volume and execute it directly. This certainly isn’t the only possible solution to the problem (adding secure boot or an encrypted filesystem were other possibilities) but it is the simplest solution that gives a verifiable component that can be invariant across arbitrary encrypted boots (so the same OVMF can be used to execute any encrypted VM securely). This latter is important because traditionally OVMF is supplied by the host owner rather than being part of the VM image supplied by the guest owner. The grub script that runs from the combined volume must still be trusted to either decrypt the root or reboot to avoid leaking the key. Although the host owner still supplies the combined OVMF, the measurement assures the guest owner of its correctness, which is why having a fairly invariant component is a good idea … so the guest owner doesn’t have potentially thousands of different measurements for approved firmware.

Plugging the Gaps: QEMU

The modifications to QEMU are fairly simple, it just needs to scan the OVMF file to determine the location for the injected secret and inject it correctly using a QMP command.. Since secret injection is already upstream, this is a simple find and make the location optional patch set.

Plugging the Gaps: Grub

Grub today only allows for the manual input of the cryptodisk password. However, in the cloud we can’t do it this way because there’s no guarantee of a secure tty channel to the VM. The solution, therefore, is to modify grub so that the cryptodisk can use secrets from a provider, in addition to the manual input. We then add a provider that can read the efi configuration tables and extract the secret table if it exists. The current incarnation of the proposed patch set is here and it allows cryptodisk to extract a secret from an efisecret provider. Note this isn’t quite the same as the form expected by the upstream OVMF patch in its grub.cfg because now the provider has to be named on the cryptodisk command line thus

cryptodisk -s efisecret

but in all other aspects, Grub/grub.cfg works. I also discovered several other deviations from the initial grub.cfg (like Fedora uses /boot/grub2 instead of /boot/grub like everyone else) so the current incarnation of grub.cfg is here. I’ll update it as it changes.

Putting it All Together

Once you have applied all the above patches and built your version of OVMF with grub inside, you’re ready to do a confidential computing encrypted boot. However, you still need to verify the measurement and inject the encrypted secret. As I said before, this isn’t easy because, due to replay defeat requirements, the secret bundle must be constructed on the fly for each VM boot. From this point on I’m going to be using only AMD SEV as the example because the Intel hardware doesn’t yet exist and AMD kindly gave IBM research a box to play with (Anyone with a new EPYC 7xx1 or 7xx2 based workstation can likely play along at home, but check here). The first thing you need to do is construct a launch bundle. AMD has a tool called sev-tool to do this for you and the first thing you need to do is obtain the platform Diffie Hellman certificate (pdh.cert). The tool will extract this for you

sevtool --pdh_cert_export

Or it can be given to you by the cloud service provider (in this latter case you’ll want to verify the provenance using sevtool –validate_cert_chain, which contacts the AMD site to verify all the details). Once you have a trusted pdh.cert, you can use this to generate your own guest owner DH cert (godh.cert) which should be used only one time to give a semblance of ECDHE. godh.cert is used with pdh.cert to derive an encryption key for the launch bundle. You can generate this with

sevtool --generate_launch_blob <policy>

The gory details of policy are in the SEV manual chapter 3, but most guests use 1 which means no debugging. This command will generate the godh.cert, the launch_blob.bin and a tmp_tk.bin file which you must save and keep secure because it contains the Transport Encryption and Integrity Keys (TEK and TIK) which will be used to encrypt the secret. Figuring out the qemu command line options needed to launch and pause a SEV guest is a bit of a palaver, so here is mine. You’ll likely need to change things, like the QMP port and the location of your OVMF build and the launch secret.

Finally you need to get the launch measure from QMP, verify it against the sha256sum of OVMF.fd and create the secret bundle with the correct GUID headers. Since this is really fiddly to do with sevtool, I wrote this python script³ to do it all (note it requires qmp.py from the qemu git repository). You execute it as

sevsecret.py --passwd <disk passwd> --tiktek-file <location of tmp_tk.bin> --ovmf-hash <hash> --socket <qmp socket>

And it will verify the launch measure and encrypt the secret for the VM if the measure is correct and start the VM. If you got everything correct the VM will simply boot up without asking for a password (if you inject the wrong secret, it will still ask). And there you have it: you’ve booted up a confidential VM from an encrypted image file. If you’re like me, you’ll also want to fire up gdb on the qemu process just to show that the entire memory of the VM is encrypted …

Conclusions and Caveats

The above script should allow you to boot an encrypted VM anywhere: locally or in the cloud, provided you can access the QMP port (most clouds use libvirt which introduces yet another additional layering pain). The biggest drawback, if you refer to the diagram, is the yellow box: you must trust the secret element, which in both Intel and AMD is proprietary⁴, in order to get confidential computing to work. Although there is hope that in future the secret element could be fully open source, it isn’t today.

The next annoyance is that launching a confidential VM is high touch requiring collaboration from both the guest owner and the host owner (due to the anti-replay nonce). For a single launch, this is a minor annoyance but for an autoscaling (launch VMs as needed) platform it becomes a major headache. The solution seems to be to have some Hardware Security Module (HSM), like the cloud uses today to store encryption keys securely, and have it understand how to measure and launch encrypted VMs on behalf of the guest owner.

The final conclusion to remember is that confidentiality is not security: your VM is as exploitable inside a confidential encrypted VM as it was outside. In many ways confidentiality and security are opposites, in that security in part requires reducing the trusted code and confidentiality requires pulling as much as possible inside. Confidential VMs do have an answer to the Cloud trust problem since the enterprise can now deploy VMs without fear of tampering by the cloud provider, but those VMs are as insecure in the cloud as they were in the Enterprise Data Centre. All of this argues that Confidential Computing, while an important milestone, is only one step on the journey to cloud security.

Patch Status

The OVMF patches are upstream (including modifications requested by Intel for TDX). The QEMU and grub patch sets are still on the lists.

Building Encrypted Images for Confidential Computing

2 Replies

With both Intel and AMD announcing confidential computing features to run encrypted virtual machines, IBM research has been looking into a new format for encrypted VM images. The first question is why a new format, after all qcow2 only recently deprecated its old encrypted image format in favour of luks. The problem is that in confidential computing, the guest VM runs inside the secure envelope but the host hypervisor (including the QEMU process) is untrusted and thus runs outside the secure envelope and, unfortunately, even for the new luks format, the encryption of the image is handled by QEMU and so the encryption key would be outside the secure envelope. Thus, a new format is needed to keep the encryption key (and, indeed, the encryption mechanism) within the guest VM itself. Fortunately, encrypted boot of Linux systems has been around for a while, and this can be used as a practical template for constructing a fully confidential encrypted image format and maintaining that confidentiality within a hostile cloud environment. In this article, I’ll explore the state of the art in encrypted boot, constructing EFI encrypted boot images, and finally, in the follow on article, look at deploying an encrypted image into a confidential environment and maintaining key secrecy in the cloud.

Encrypted Boot State of the Art

Luks and the cryptsetup toolkit have been around for a while and recently (in 2018), the luks format was updated to version 2. However, actually booting a linux kernel from an encrypted partition has always been a bit of a systems problem, primarily because the bootloader (grub) must decrypt the partition to actually load the kernel. Fortunately, grub can do this, but unfortunately the current grub in most distributions (2.04) can only read the version 1 luks format. Secondly, the user must type the decryption passphrase into grub (so it can pull the kernel and initial ramdisk out of the encrypted partition to boot them), but grub currently has no mechanism to pass it on to the initial ramdisk for mounting root, meaning that either the user has to type their passphrase twice (annoying) or the initial ramdisk itself has to contain a file with the disk passphrase. This latter is the most commonly used approach and only has minor security implications when the system is in motion (the ramdisk and the key file must be root read only) and the password is protected at rest by the fact that the initial ramdisk is also on the encrypted volume. Even more annoying is the fact that there is no distribution standard way of creating the initial ramdisk. Debian (and Ubuntu) have the most comprehensive documentation on how to do this, so the next section will look at the much less well documented systemd/dracut mechanism.

Encrypted Boot for Systemd/Dracut

Part of the problem here seems to be less that stellar systems co-ordination between the two components. Additionally, the way systemd supports passphraseless encrypted volumes has been evolving for a while but changed again in v246 to mirror the Debian method. Since cloud images are usually pretty up to date, I’ll describe this new way. Each encrypted volume is referred to by UUID (which will be the UUID of the containing partition returned by blkid). To get dracut to boot from an encrypted partition, you must pass in

rd.luks.uuid=<UUID>

but you must also have a key file named

/etc/cryptsetup-keys.d/luks-<UUID>.key

And, since dracut hasn’t yet caught up with this, you usually need a cryptodisk.conf file in /etc/dracut.conf.d/ which contains

install_items+=" /etc/cryptsetup-keys.d/* "

Grub and EFI Booting Encrypted Images

Traditionally grub is actually installed into the disk master boot record, but for EFI boot that changed and the disk (or VM image) must have an EFI System partition which is where the grub.efi binary is installed. Part of the job of the grub.efi binary is to find the root partition and source the /boot/grub¹/grub.cfg. When you install grub on an EFI partition a search for the root by UUID is actually embedded into the grub binary. Another problem is likely that your distribution customizes the location of grub and updates the boot variables to tell the system where it is. However, a cloud image can’t rely on the boot variables and must be installed in the default location (\EFI\BOOT\bootx64.efi). This default location can be achieved by adding the –removable flag to grub-install.

For encrypted boot, this becomes harder because the grub in the EFI partition must set up the cryptographic location by UUID. However, if you add

GRUB_ENABLE_CRYPTODISK=y

To /etc/default/grub it will do the necessary in grub-install and grub-mkconfig. Note that on Fedora, where every other GRUB_ENABLE parameter is true/false, this must be ‘y’, unfortunately grub-install will look for =y not =true.

Putting it all together: Encrypted VM Images

Start by extracting the root of an existing VM image to a tar file. Make sure it has all the tools you will need, like cryptodisk and grub-efi. Create a two partition raw image file and loopback mount it (I usually like 4GB) with a small efi partition (p1) and an encrypted root (p2):

truncate -s 4GB disk.img
parted disk.img mklabel gpt
parted disk.img mkpart primary 1Mib 100Mib
parted disk.img mkpart primary 100Mib 100%
parted disk.img set 1 esp on
parted disk.img set 1 boot on

Now setup the efi and cryptosystem (I use ext4, but it’s not required). Note at this time luks will require a password. Use a simple one and change it later. Also note that most encrypted boot documents advise filling the encrypted partition with random numbers. I don’t do this because the additional security afforded is small compared with the advantage of converting the raw image to a smaller qcow2 one.

losetup -P -f disk.img          # assuming here it uses loop0
l=($(losetup -l|grep disk.img)) # verify with losetup -l
mkfs.vfat ${l}p1
blkid ${l}p1       # remember the EFI partition UUID
cryptsetup --type luks1 luksFormat ${l}p2 # choose temp password
blkid ${l}p2       # remember this as <UUID> you'll need it later 
cryptsetup luksOpen ${l}p2 cr_root
mkfs.ext4 /dev/mapper/cr_root
mount /dev/mapper/cr_root /mnt
tar -C /mnt -xpf <vm root tar file>
for m in run sys proc dev; do mount --bind /$m /mnt/$m; done
chroot /mnt

Create or modify /etc/fstab to have root as /dev/disk/cr_root and the EFI partition by label under /boot/efi. Now set up grub for encrypted boot²

echo "GRUB_ENABLE_CRYPTODISK=y" >> /etc/default/grub
mount /boot/efi
grub-install --removable --target=x86_64-efi
grub-mkconfig -o /boot/grub/grub.cfg

For Debian, you’ll need to add an /etc/crypttab entry for the encrypted disk:

cr_root UUID=<uuid> luks none

And then re-create the initial ramdisk. For dracut systems, you’ll have to modify /etc/default/grub so the GRUB_CMDLINE_LINUX has a rd.luks.uuid=<UUID> entry. If this is a selinux based distribution, you may also have to trigger a relabel.

Now would also be a good time to make sure you have a root password you know or to install /root/.ssh/authorized_keys. You should unmount all the binds and /mnt and try EFI booting the image. You’ll still have to type the password a couple of times, but once the image boots you’re operating inside the encrypted envelope. All that remains is to create a fast boot high entropy low iteration password and replace the existing one with it and set the initial ramdisk to use it. This example assumes your image is mounted as SCSI disk sda, but it may be a virtual disk or some other device.

dd if=/dev/urandom bs=1 count=33|base64 -w 0 > /etc/cryptsetup-keys.d/luks-<UUID>.key
chmod 600 /etc/cryptsetup-keys.d/luks-<UUID>.key
cryptsetup --key-slot 1 luksAddKey /dev/sda2 # permanent recovery key
cryptsetup --key-slot 0 luksRemoveKey /dev/sda2 # remove temporary
cryptsetup --key-slot 0 --iter-time 1 luksAddKey /dev/sda2 /etc/cryptsetup-keys.d/luks-<UUID>.key

Note the “-w 0” is necessary to prevent the password from having a trailing newline which will make it difficult to use. For mkinitramfs systems, you’ll now need to modify the /etc/crypttab entry

cr_root UUID=<UUID> /etc/cryptsetup-keys.d/luks-<UUID>.key luks

For dracut you need the key install hook in /etc/dracut.conf.d as described above and for Debian you need the keyfile pattern:

echo "KEYFILE_PATTERN=\"/etc/cryptsetup-keys.d/*\"" >>/etc/cryptsetup-initramfs/conf-hook

You now rebuild the initial ramdisk and you should now be able to boot the cryptosystem using either the high entropy password or your rescue one and it should only prompt in grub and shouldn’t prompt again. This image file is now ready to be used for confidential computing.

Creating a Home IPv6 Network

5 Replies

One of the recent experiences of Linux Plumbers Conference convinced me that if you want to be part of a true open source WebRTC based peer to peer audio/video interaction, you need an internet address that’s not behind a NAT. In reality, the protocol still works as long as you can contact a stun server to tell you what your external address is and possibly a turn server to proxy the packets if both endpoints are NATed but all this seeking external servers takes time as those of you who complained about the echo test found. The solution to all this is to connect over IPv6 which has an address space large enough to support every device on the planet having its own address. All modern Linux distributions support IPv6 out of the box so the chances are you’ve actually accidentally used it without ever noticing, which is one of the beauties of IPv6 autoconfiguration (it’s supposed to just work).

However, I recently moved, and so lost my fibre internet connection to cable but cable that did come with an IPv6 address, so this is my story of getting it all to work. If you don’t really care about the protocol basics, you can skip down to the how. This guide is also focussed on a “dual stack” configuration (one that has both IPv6 and IPv4 addresses). Pure IPv6 configurations are possible, but because some parts of the internet are still IPv4 only, they’re not complete unless you set up an IPv4 encapsulating bridge.

The Basics of IPv6

IPv6 has been a mature protocol for a long time now, so I erroneously assumed there’d be a load of good HOWTOs about it. However, after reading 20 different descriptions of how the IPv6 128 bit address space works and not much else, I gave up in despair and read the RFCs instead. I’ll assume you’ve read at least one of these HOWTOS, so I don’t have to go into IPv6 address prefixes, suffixes, interface IDs or subnets so I’ll begin where most of the HOWTOs end.

How does IPv6 Just Work?

In IPv4 there’s a protocol called dynamic host configuration protocol (DHCP) so as long as you can find a DHCP server you can get all the information you need to connect (local address, router, DNS server, time server, etc). However, this service has to be set up by someone and IPv6 is designed to configure a network without it.

The first assumption IPv6 StateLess Address AutoConfiguration (SLAAC) makes is that it’s on a /64 subnet (So every subnet in IPv6 contains 10¹⁰times as many addresses as the entire IPv4 internet). This means that, since most real subnets contain <100 systems, they can simply choose a random address and be very unlikely to clash with the existing systems. In fact, there are three current ways of choosing an address in the /64:

EUI-64 (RFC 4291) based on the MAC address which is basically the MAC with one bit flipped and ff:fe placed in the middle.
Stable Private (RFC 7217) which generate from a hash based on a static key, interface, prefix and a counter (the counter is incremented if there is a clash). These are preferred to the EUI-64 ones which give away any configuration associated with the MAC address (such as what type of network card you have)
Privacy Extension Addresses (RFC 4941) which are very similar to stable private addresses except they change over time using the IPv6 address deprecation mechanism and are for client systems who want to preserve anonymity.

The next problem in Linux is who configures the interface? The Kernel IPv6 stack is actually designed to do it, and will unless told not to, but most of the modern network controllers (like NetworkManager) are control freaks and turn off the kernel’s auto configuration so they can do it themselves. They also default to stable private addressing using a static secret maintained in the filesystem (/var/lib/NetworkManager/secret_key).

The next thing to understand about IPv6 addresses is that they are divided into scopes, the most important being link local (unrouteable) addresses which conventionally always have the prefix fe80::/64. The link local address is configured first using one of the above methods and then used to probe the network.

Multicast and Neighbour Discovery

Unlike IPv4, IPv6 has no broadcast capability so all discovery is done by multicast. Nodes coming up on the network subscribe to particular multicast addresses, via special packets intercepted by the switch, and won’t receive any multicast to which they’re not subscribed. Conventionally, all link local multicast addresses have the prefix ff02::/64 (for other types of multicast address see RFC 4291). All nodes subscribe to the “all nodes” multicast address ff02::1 and also must subscribe to their own solicited node multicast address at ff02::1:ffXX:XXXX where the last 24 bits correspond to the lowest 24 bits of the node’s IPv6 address. This latter is to avoid the disruption that used to occur in IPv4 from ARP broadcasts because now you can target a specific subset of nodes for address resolution.

The IPV6 address resolution protocol is called Neighbour Solicitation (NS), described in RFC 4861 and it’s use with SLAAC described in RFC 4862, and is done by sending a multicast to the neighbor solicitation address of the node you want to discover containing the full IPv6 address you want to know, a node with the matching address replies with its link layer (MAC) address in a Neighbour Advertisement (NA) packet.

Once a node has chosen its link local address, it first sends out a NS packet to its chosen address to see if anyone replies and if no-one does it assumes it is OK to keep it otherwise it follows the collision avoidance protocol associated with its particular form of address. Once it has found a unique address, the node configures this link local address and looks for a router. Note that if an IPv6 network isn’t present, discovery stops here, which is why most network interfaces always show a link local IPv6 address.

Router Discovery

Once the node has its own unique link local address, it uses it to send out Router Solicitation (RS) packets to the “all routers” multicast address ff02::2. Every router on the network responds with a Router Advertisement (RA) packet which describes (among other things) the the router lifetime, the network MTU, a set of one or more prefixes the router is responsible for, the router’s link address and a set of option flags including the M (Managed) and O (Other Configuration) flag and possibly a set of DNS servers.

Each advertised prefix contains the prefix and prefix length, a set of flags including the A (autonomous configuration) and L (link local) and a set of lifetimes. The Link Local prefixes tell you what global prefixes the local network users (there may be more than one) and whether you are allowed to do SLAAC on the global prefix (if the A flag is clear, you must ask the router for an address using DHCPv6). If the router has a non zero lifetime, you may assume it is a default router for the subnet.

Now that the node has discovered one or more routers it may configure its own global address (note that every IPv6 routeable node has at least two addresses: a link local and a global). How it does this depends on the router and prefix flags

Global Address Configuration

The first thing a node needs to know is whether to use SLAAC for the global address or DHCPv6. This is entirely determined by the A flag of any link local prefix in the RA packet. If A is set, then the node may use SLAAC and if A is clear then the node must use DHCPv6 to obtain an address. If A is set and also the M (Managed) flag then the node may use either SLAAC or DHCPv6 (or both) to obtain an address and if the M flag is clear, but the O (Other Config) flag is present then the node must use SLAAC but may use DHCPv6 to obtain other information about the network (usually DNS).

Once the node has a global address in now needs a default route. It forms the default route list from the RA packets that have a non-zero router Lifetime. All of these are configured as default routes to their link local address with the RA specified hop count. Finally, the node may add specific prefix routes from RA packets with zero router LifeTimes but non link local prefixes.

DHCPv6 is a fairly complex configuration protocol (see RFC 8415) but it cannot specify either prefix length (meaning all obtained addresses are configured as /128) or routes (these must be obtained from RA packets). This leads to a subtlety of outbound address selection in that the most specific is always preferred, so if you configure both by SLAAC and DHCPv6, the SLAAC address will be added as /64 and the DHCPv6 address as /128 meaning your outbound IP address will always be the DHCPv6 one (although if an external entity knows your SLAAC address, they will still be able to reach you on it).

The How: Configuring your own Home Router

One of the things you’d think from the above is that IPv6 always auto configures and, while it is true that if you simply plug your laptop into the ethernet port of a cable modem it will just automatically configure, most people have a more complex home setup involving a router, which needs some special coaxing before it will work. That means you need to obtain additional features from your ISP using special DHCPv6 requests.

This section is written from my own point of view: I have a rather complex IPv4 network which has a completely open but bandwidth limited (to untrusted clients) wifi network, and several protected internal networks (one for my lab, one for my phones and one for the household video cameras), so I need at least 4 subnets to give every device in my home an IPv6 address. I also use OpenWRT as my router distribution, so all the IPv6 configuration information is highly specific to it (although it should be noted that things like NetworkManager can also do all of this if you’re prepared to dig in the documentation).

Prefix Delegation

Since DHCPv6 only hands out a /128 address, this isn’t sufficient because it’s the IP address of the router itself. In order to become a router, you must request delegation of part of the IPv6 address space via the Identity Association for Prefix Delection (IA_PD) option of DHCPv6. Once this is done the router IP address will be assumed by the ISP to be the route for all of the delegated prefixes. The subtlety here is that if you want more than one subnet, you have to ask for it specifically (the client must specify the exact prefix length it’s looking for) and since it’s a prefix length, and your default subnet should be /64, if you request a prefix length of 64 you only have one subnet. If you request 63 you have 2 and so on. The problem is how do you know how many subnets the ISP is willing to give you? Unfortunately there’s no way of finding this (I had to do an internet search to discover my ISP, Comcast, was willing to delegate a prefix length of 60, meaning 16 subnets). If searching doesn’t tell you how much your ISP is willing to delegate, you could try starting at 48 and working your way to 64 in increments of 1 to see what the largest delegation you can get away with (There have been reports of ISPs locking you at your first delegated prefix length, so don’t start at 64). The final subtlety is that the prefix you’re delegated may not be the same prefix as the address your router obtained (my current comcast configuration has my router at 2001:558:600a:… but my delegated prefix is 2601:600:8280:66d0:/60). Note you can run odhcp6c manually with the -P option if you have to probe your ISP to find out what size of prefix you can get.

Configuring the Router for Prefix Delegation

In OpenWRT terms, the router WAN DHCP(v6) configuration is controlled by /etc/default/network. You’ll already have a WAN interface (likely called ‘wan’) for DHCPv4, so you simply add an additional ‘wan6’ interface to get an additional IPv6 and become dual stack. In my configuration this looks like

config interface 'wan6'
        option ifname '@wan'
        option proto 'dhcpv6'
        option reqprefix 60

The slight oddity is the ifname: @wan simply tells the config to use the same ifname as the ‘wan’ interface. Naming it this way is essential if your wan is a bridge, but it’s good practice anyway. The other option ‘reqprefix’ tells DHCPv6 to request a /60 prefix delegation.

Handing Out Delegated Prefixes

This turns out to be remarkably simple. Firstly you have to assign a delegated prefix to each of your other interfaces on the router, but you can do this without adding a new OpenWRT interface for each of them. My internal IPv4 network has all static addresses, so you add three directives to each of the interfaces:

config interface 'lan'
        ... interface designation (bridge for me)
        option proto 'static'
        ... ipv4 addresses
        option ip6assign '64'
        option ip6hint '1'
        option ip6ifaceid '::ff'

ip6assign ‘N’ means you are a /N network (so this is always /64 for me) and ip6hint ‘N’ means use N as your subnet id and ip6ifaceid ‘S’ means use S as the IPv6 suffix (This defaults to ::1 so if you’re OK with that, omit this directive). So given I have a 2601:600:8280:66d0::/60 prefix, the global address of this interface will be 2601:600:8280:66d1::ff. Now the acid test, if you got this right, this global address should be pingable from anywhere on the IPv6 internet (if it isn’t, it’s likely a firewall issue, see below).

Advertising as a Router

Simply getting delegated a delegated prefix on a local router interface is insufficient . Now you need to get your router to respond to Router Solicitations on ff02::2 and optionally do DHCPv6. Unfortunately, OpenWRT has two mechanisms for doing this, usually both installed: odhcpd and dnsmasq. What I found was that none of my directives in /etc/config/dhcp would take effect until I disabled odhcpd completely

/etc/init.d/odhcpd stop
/etc/init.d/odhcpd disable

and since I use dnsmasq extensively elsewhere (split DNS for internal/external networks), that suited me fine. I’ll describe firstly what options you need in dnsmasq and secondly how you can achieve this using entries in the OpenWRT /etc/config/dhcp file (I find this useful because it’s always wise to check what OpenWRT has put in the /var/etc/dnsmasq.conf file).

The first dnsmasq option you need is ‘enable-ra’ which is a global parameter instructing dnsmasq to handle router advertisements. The next parameter you need is the per-interface ‘ra-param’ which specifies the global router advertisement parameters and must appear once for every interface you want to advertise on. Finally the ‘dhcp-range’ option allows more detailed configuration of the type of RA flags and optional DHCPv6.

SLAAC or DHCPv6 (or both)

In many ways this is a matter of personal choice. If you allow SLAAC, hosts which want to use privacy extension addresses (like Android phones) can do so, which is a good thing. If you also allow DHCPv6 address selection you will have a list of addresses assigned to hosts and dnsmasq will do DNS resolution for them (although it can do DNS for SLAAC addresses provided it gets told about them). A special tag ‘constructor’ exists for the ‘dhcp-range’ option which tells it to construct the supplied address (for either RA or DHCPv6) from the IPv6 global prefix of the specified interface, which is how you pass out our delegated prefix addresses. The modes for ‘dhcp-range’ are ‘ra-only’ to disallow DHCPv6 entirely, ‘slaac’ to allow DHCPv6 address selection and ‘ra-stateless’ to disallow DHCPv6 address selection but allow other DHCPv6 configuration information.

Based on trial and error (and finally examining the scripting in /etc/init.d/dnsmasq) the OpenWRT options required to achieve the above dnsmasq options are

config dhcp lan
        option interface lan
        option start 100
        option limit 150
        option leasetime 1h
        option dhcpv6 'server'
        option ra_management '1'
        option ra 'server'

with ‘ra_management’ as the key option with ‘0’ meaning SLAAC with DHCPv6 options, ‘1’ meaning SLAAC with full DHCPv6, ‘2’ meaning DHCPv6 only and ‘3’ meaning SLAAC only. Another OpenWRT oddity is that there doesn’t seem to be a way of setting the lease range: it always defaults to either static only or ::1000 to ::ffff.

Firewall Configuration

One of the things that trips people up is the fact that Linux has two completely separate firewalls: one for IPv4 and one for IPv6. If you’ve ever written any custom rules for them, the chances are you did it in the OpenWRT /etc/firewall.user file and you used the iptables command, which means you only added the rules to the IPv4 firewall. To add the same rule for IPv6 you need to duplicate it using the ip6tables command. Another significant problem, if you’re using a connection tracking for port knock detection like I am, is that Linux connection tracking has difficulty with IPv6 multicast, so packets that go out to a multicast but come back as unicast (as most of the discovery protocols do) get the wrong conntrack state. To fix this, I eventually had to have an INPUT rule just accepting all ICMPv6 and DHCPv6 (udp ports 546 [client] and 547 [server]). The other firewall considerations are that now everyone has their own IP address, there’s no need to NAT (OpenWRT can be persuaded to take care of this automatically, but if you’re duplicating the IPv4 rules manually, don’t duplicate the NAT rules). The final one is likely more applicable to me: my wifi interface is designed to be an extension of the local internet and all machines connecting to it are expected to be able to protect themselves since they’ll migrate to such hostile environments as airport wifi, thus I do complete exposure of wifi connected devices to the general internet for all ports, including port probes. For my internal devices, I have a RELATED,ESTABLISHED rule to make sure they’re not probed since they’re not designed to migrate off the internal network.

Now the problems with OpenWRT: since you want NAT on IPv4 but not on IPv6 you have to have two separate wan zones for them: if you try to combine them (as I first did), then OpenWRT will add an IPv6 –ctstate INVALID rule which will prevent Neighbour Discovery from working because of the conntrack issues with IPv6 multicast, so my wan zones are (well, this is a lie because my firewall is now hand crafted, but this is what I checked worked before I put the hand crafted firewall in place):

config zone
        option name 'wan'
        option network 'wan'
        option masq '1'
        ...

config zone
        option name 'wan6'
        option network 'wan6'
        ...

And the routing rules for the lan zone (fully accessible) are

config forwarding
        option src 'lan'
        option dest 'wan'

config forwarding
        option src 'lan'
        option dest 'wan6'

config forwarding
        option src 'wan6'
        option dest 'lan'

Putting it Together: Getting the Clients IPv6 Connected

Now that you have your router configured, everything should just work. If it did, your laptop wifi interface should now have a global IPv6 address

ip -6 address show dev wlan0

If that comes back empty, you need to enable IPv6 on your distribution. If it has only a link local (fe80:: prefix) address, IPv6 is enabled but your router isn’t advertising (suspect firewall issues with discovery packets or dnsmasq misconfiguration). If you see a global address, you’re done. Now you should be able to go to https://testv6.com and secure a 10/10 score.

The final piece of the puzzle is preferring your new IPv6 connection when DNS offers a choice of IPv4 or IPv6 addresses. All modern Linux clients should prefer IPv6 when available if connected to a dual stack network, so try … if you ping, say, www.google.com and see an IPv6 address you’re done. If not, you need to get into the murky world of IPv6 address labelling (RFC 6724) and gai.conf.

Conclusion

Adding IPv6 to and existing IPv4 setup is currently not a simple plug in and go operation. However, provided you understand a handful of differences between the two protocols, it’s not an insurmountable problem either. I have also glossed over many of the problems you might encounter with your ISP. Some people have reported that their ISPs only hand out one IPv6 address with no prefix delegation, in which case I think finding a new ISP would be wisest. Others report that the ISP will only delegate one /64 prefix so your choice here is either to only run one subnet (likely sufficient for a lot of home networks), or subnet at greater than /64 and forbid SLAAC, which is definitely not a recommended configuration. However, provided your ISP is reasonable, this blog post should at least help get you started.