Category Archives: Embedded

Debugging Android Early Boot Failures

Back in my blog post about Securing the Google SIP Stack, I did say I’d look at re-enabling SIP in Android-12, so with a view to doing that I tried building and booting LineageOS 19.1, but it crashed really early in the boot sequence (after the boot splash but before the boot animation started). It turns out that information on debugging the android early boot sequence is a bit scarce, so I thought I should write a post about how I did it just in case it helps someone else who’s struggling with a similar early boot problem.

How I usually Build and Boot Android

My builds are standard LineageOS with my patches to fix SIP and not much else. However, I do replace the debug keys with my signing keys and I also have an AVB key installed in the phone’s third party keyslot with which I sign the vbmeta for boot. This actually means that my phone is effectively locked but with a user supplied key (Yellow as google puts it).

My phone is now a pixel 3 (I had to say goodbye to the old Nexus One thanks to the US 3G turn off) and I do have a slightly broken Pixel 3 I play with for experimental patches, which is where I was trying to install Android-12.

Signing Seems to be the Problem

Just to verify my phone could actually boot a stock LineageOS (it could) I had to unlock it and this lead to the discovery that once unlocked, it would also boot my custom rom as well, so whatever was failing in early boot seemed to be connected with the device being locked.

I also discovered an interesting bug in the recovery rom fastboot: If you’re booting locked with your own keys, it will still let you perform all the usually forbidden fastboot commands (the one I was using was set_active). It turns out to be because of a bug in AOSP which treats yellow devices as unlocked in fastboot. Somewhat handy for debugging, but not so hot for security …

And so to Debugging Early Boot

The big problem with Android is there’s no way to get the console messages for early boot. Even if you enable adb early, it doesn’t get started until quite far in to the boot animation (which was way after the crash I was tripping over). However, android does have a pstore (previously ramoops) driver that can give you access to the previously crashed boot’s kernel messages (early init, fortunately, mostly logs to the kernel message log).

Forcing init to crash on failure

Ordinarily an init failure prints a message and reboots (to the bootloader), which doesn’t excite pstore into saving the kernel message log. fortunately there is a boot option (androidboot.init_fatal_panic) which can be set in the boot options (or kernel command line for a pixel-3 which can only boot the 4.9 kernel). If you build your own android, it’s fairly easy to add things to the android commandline (which is in boot.img) because all you need to do is extract BOOT/cmdline from the intermediate zip file you sign add any boot options you need and place it back in the zip file (before you sign it).

Unfortunately, this expedient didn’t work (no console logs appear in pstore). I did check that init was correctly panic’ing on failure by inducing an init failure in recovery mode and observing the panic (recovery mode allows you to run adb). But this induced panic also didn’t show up in pstore, meaning there’s actually some problem with pstore and early panics.

Security is the problem (as usual)

The actual problem turned out to be security (as usual): The pixel-3 does encrypted boot panic logs. The way this seems to work (at least in my reading of the google additional pstore patches) is that the bootloader itself encrypts the pstore ram area with a key on the /data partition, which means it only becomes visible after the device is unlocked. Unfortunately, if you trigger a panic before the device is unlocked (by echoing ‘c’ to /proc/sysrq-trigger) the panic message is lost, so pstore itself is useless for debugging early boot. There seems to be some communication of the keys by the vendor proprietary ramoops binary making it very difficult to figure out how it’s being done.

Why the early panic message is lost is a bit mysterious, but unfortunately pstore on the pixel-3 has several proprietary components around the encrypted message handling that make it hard to debug. I suspect if you don’t set up the pstore encryption keys, the bootloader erases the pstore ram area instead of encrypting it, but I can’t prove that.

Although it might be possible to fix the pstore drivers to preserve the ramoops from before device unlock, the participation of the proprietary bootloader in preserving the memory doesn’t make that look like a promising avenue to explore.

Anatomy of the Pixel-3 Boot Sequence

The Pixel-3 device boots through recovery. What this means is that the initial ramdisk (from boot.img) init is what boots both the recovery and normal boot paths. The only difference is that for recovery (and fastboot), the device stays in the ramdisk and for normal boot it mounts the /system partition and pivots to it. What makes this happen or not is the boot flag androidboot.force_normal_boot=1 which is added by the bootloader. Pretty much all the binary content and init rc files in the ramdisk are for recovery and its allied menus.

Since the boot paths are pretty radically different, because the normal boot first pivots to a first stage before going on to a second, but in the manner of containers, it might be possible to boot recovery first, start a dmesg logger and then re-exec init through the normal path

Forcing Re-Exec

The idea is to signal init to re-exec itself for the normal path. Of course, there have to be a few changes to do this: An item has to be added to the recovery menu to signal init and init itself has to be modified to do the re-exec on the signal (note you can’t just kick off an init with a new command line because init must be pid 1 for booting). Once this is done, there are problems with selinux (it won’t actually allow init to re-exec) and some mount moves. The selinux problem is fixable by switching it from enforcing to permissive (boot option androidboot.selinux=permissive) and the mount moves (which are forbidden if you’re running binaries from the mount points being moved) can instead become bind mounts. The whole patch becomes 31 insertions across 7 files in android_system_core.

The signal I chose was SIGUSR1, which isn’t usually used by anything in the bootloader and the addition of a menu item to recovery to send this signal to init was also another trivial patch. So finally we have a system from which I can start adb to trace the kernel log (adb shell dmesg -w) and then signal to init to re-exec. Surprisingly this worked and produced as the last message fragment:

[ 190.966881] init: [libfs_mgr]Created logical partition system_a on device /dev/block/dm-0
[ 190.967697] init: [libfs_mgr]Created logical partition vendor_a on device /dev/block/dm-1
[ 190.968367] init: [libfs_mgr]Created logical partition product_a on device /dev/block/dm-2
[ 190.969024] init: [libfs_mgr]Created logical partition system_ext_a on device /dev/block/dm-3
[ 190.969067] init: DSU not detected, proceeding with normal boot
[ 190.982957] init: [libfs_avb]Invalid hash size:
[ 190.982967] init: [libfs_avb]Failed to verify vbmeta digest
[ 190.982972] init: [libfs_avb]vbmeta digest error isn't allowed
[ 190.982980] init: Failed to open AvbHandle: No such file or directory
[ 190.982987] init: Failed to setup verity for '/system': No such file or directory
[ 190.982993] init: Failed to mount /system: No such file or directory
[ 190.983030] init: Failed to mount required partitions early …
[ 190.983483] init: InitFatalReboot: signal 6
[ 190.984849] init: #00 pc 0000000000123b38 /system/bin/init
[ 190.984857] init: #01 pc 00000000000bc9a8 /system/bin/init
[ 190.984864] init: #02 pc 000000000001595c /system/lib64/libbase.so
[ 190.984869] init: #03 pc 0000000000014f8c /system/lib64/libbase.so
[ 190.984874] init: #04 pc 00000000000e6984 /system/bin/init
[ 190.984878] init: #05 pc 00000000000aa144 /system/bin/init
[ 190.984883] init: #06 pc 00000000000487dc /system/lib64/libc.so
[ 190.984889] init: Reboot ending, jumping to kernel

Which indicates exactly where the problem is.

Fixing the problem

Once the messages are identified, the problem turns out to be in system/core ec10d3cf6 “libfs_avb: verifying vbmeta digest early”, which is inherited from AOSP and which even says in in it’s commit message “the device will not boot if: 1. The image is signed with FLAGS_VERIFICATION_DISABLED is set 2. The device state is locked” which is basically my boot state, so thanks for that one google. Reverting this commit can be done cleanly and now the signed image boots without a problem.

I note that I could also simply add hashtree verification to my boot, but LineageOS is based on the eng target, which has FLAGS_VERIFICATION_DISABLED built into the main build Makefile. It might be possible to change it, but not easily I’m guessing … although I might try fixing it this way at some point, since it would make my phones much more secure.

Conclusion

Debugging android early boot is still a terribly hard problem. Probably someone with more patience for disassembling proprietary binaries could take apart pixel-3 vendor ramoops and figure out if it’s possible to get a pstore oops log out of early boot (which would be the easiest way to debug problems). But failing that the simple hack to re-exec init worked enough to show me where the problem was (of course, if init had continued longer it would likely have run into other issues caused by the way I hacked it).

Anatomy of the UEFI Boot Sequence on the Intel Galileo

3 Replies

The Basics

UEFI boot officially has three phases (SEC, PEI and DXE). However, the DXE phase is divided into DXEBoot and DXERuntime (the former is eliminated after the call to ExitBootSerivices()). The jobs of each phase are

SEC (SECurity phase). This contains all the CPU initialisation code from the cold boot entry point on. It’s job is to set the system up far enough to find, validate, install and run the PEI.
PEI (Pre-Efi Initialization phase). This configures the entire platform and then loads and boots the DXE.
DXE (Driver eXecution Environment). This is where the UEFI system loads drivers for configured devices, if necessary; mounts drives and finds and executes the boot code. After control is transferred to the boot OS, the DXERuntime stays resident to handle any OS to UEFI calls.

How it works on Quark

This all sounds very simple (and very like the way an OS like Linux boots up). However, there’s a very crucial difference: The platform really is completely unconfigured when SEC begins. In particular it won’t have any main memory, so you begin in a read only environment until you can configure some memory. C code can’t begin executing until you’ve at least found enough writable memory for a stack, so the SEC begins in hand crafted assembly until it can set up a stack.

On all x86 processors (including the Quark), power on begins execution in 16 bit mode at the ResetVector (0xfffffff0). As a helping hand, the default power on bus routing has the top 128KB of memory mapped into the top of SPI flash (read only, of course) via a PCI routing in the Legacy Bridge, meaning that the reset vector executes directly from the SPI Flash (this is actually very slow: SPI means Serial Peripheral Interface, so every byte of SPI flash has to be read serially into the instruction cache before it can be executed).

The hand crafted assembly clears the cache, transitions to Flat32 bit execution mode and sets up the necessary x86 descriptor tables. It turns out that memory configuration on the Quark SoC is fairly involved and complex so, in order to spare the programmer from having to do this all in assembly, there’s a small (512kB) static ram chip that can be easily configured, so the last assembly job of the SEC is to configure the eSRAM (to a fixed address at 2GB), set the top as the stack, load the PEI into the base (by reconfiguring the SPI flash mapping to map the entire 8MB flash to the top of memory and then copying the firmware volume containing the PEI) and begin executing.

QuarkPlatform Build Oddities

Usually the PEI code is located by the standard Flash Volume code of UEFI and the build time PCDs (Platform Configuration Database entries) which use the values in the Flash Definition File to build the firmware. However, the current Quark Platform package has a different style because it rips apart and rebuilds the flash volumes, so instead of using PCDs, it uses something it calls Master Flash Headers (MFHs) which are home grown for Quark. These are a fixed area of the flash that can be read as a database giving the new volume layout (essentially duplicating what the PCDs would normally have done). Additionally the Quark adds a non-standard signature header occupying 1k to each flash volume which serves two purposes: For the SECURE_LD case, it actually validates the volume, but for the three items in the firmware that don’t have flash headers (the kernel, the initrd and the grub config) it serves to give the lengths of each.

Laying out Flash Rom

This is a really big deal for most embedded systems because the amount of flash available is really limited. The Galileo board is nice because it supplies 8MB of flash … which is huge in embedded terms. All flash is divided into Flash Volumes¹. If you look at OVMF for instance, it builds its flash as four volumes: Three for the three SEC, PEI and DXE phases and one for the EFI variables. In EdkII, flash files are built by the flash definition file (the one with a .fdf ending). Usually some part of the flash is compressed and has to be inflated into memory (in OVMF this is PEI and DXE) and some are designed to be execute in place (usually SEC). If you look at the Galileo layout, you see that it has a big SEC phase section (called BOOTROM_OVERRIDE) designed for the top 128kb of the flash , the usual variable area and then five additional sections, two for PEI and DXE and three recovery ones. (and, of course, an additional payload section for the OS that boots from flash).

Embedded Recovery Sections

For embedded devices (and even normal computers) recovery in the face of flash failure (whether from component issues or misupdate of the flash) is really important, so the Galileo follows a two stage fallback process. The first stage is to detect a critical error signalled by the platform sticky bit, or recovery strap in the SEC and boot up to the fixed phase recovery which tries to locate a recovery capsule on the USB media². The other recovery is a simple copy of the PEI image for fallback in case the primary PEI image fails (by now you’ll have realised there are three separate but pretty much identical copies of PEI in the flash rom). One of the first fixes that can be made to the Quark build is to consolidate all of these into a single build description.

Putting it all together: Implementing a compressed PEI Phase

One of the first things I discovered when trying to update the UEFI version to something more modern is that the size of the PEI phase overflows the allowed size of the firmware volume. This means either redo the flash layout or compress the PEI image. I chose the latter and this is the story of how it went.

The first problem is that debug prints don’t work in the SEC phase, which is where the changes are going to have to be. This is a big stumbling block because without debugging, you never know where anything went wrong. It turns out that UEFI nicely supports this via a special DebugLib that outputs to the serial console, but that the Galileo firmware build has this disabled by this line:

[LibraryClasses.IA32.SEC]
...
 DebugLib|MdePkg/Library/BaseDebugLibNull/BaseDebugLibNull.inf

The BaseDebugLibNull does pretty much what you expect: throws away all Debug messages. When this is changed to something that outputs messages, the size of the PEI image explodes again, mainly because Stage1 has all the SEC phase code in it. The fix here is only to enable debugging inside the QuarkResetVector SEC phase code. You can do this in the .dsc file with

 QuarkPlatformPkg/Cpu/Sec/ResetVector/QuarkResetVector.inf {
   <LibraryClasses>
     DebugLib|MdePkg/Library/BaseDebugLibSerialPort/BaseDebugLibSerialPort.inf
 }

And now debugging works in the SEC phase!

It turns out that a compressed PEI is possible but somewhat more involved than I imagined so that will be the subject of the next blog post. For now I’ll round out with other oddities I discovered along the way

Quark Platform SEC and PEI Oddities

On the current quark build, the SEC phase is designed to be installed into the bootrom from 0xfffe 0000 to 0xffff ffff. This contains a special copy of the reset vector (In theory it contains the PEI key validation for SECURE_LD, but in practise the verifiers are hard coded to return success). The next oddity is that the stage1 image, which should really just be the PEI core actually contains another boot from scratch SEC phase, except this one is based on the standard IA32 reset vector code plus a magic QuarkSecLib and then the PEI code. This causes the stage1 bring up to be different as well, because usually, the SEC code locates the PEI core in stage1 and loads, relocates and executes it starting from the entry point PeiCore(). However, quark doesn’t do this at all. It relies on the Firmware Volume generator populating the first ZeroVector (an area occupying the first 16 bytes of the Firmware Volume Header) with a magic entry (located in the ResetVector via the magic string ‘SPI Entry Point ‘ with the trailing space). The SEC code indirects through the ZeroVector to this code and effectively re-initialises the stack and begins executing the new SEC code, which then locates the internal copy of the PEI core and jumps to it.

Secure Boot on the Intel Galileo

An Aside about vga over serial consoles

If, like me, you’re interested in the secure boot aspect, you’re going to be spending a lot of time interacting with the VGA console over the serial line and you want it to look the part. The default VGA PC console is very much stuck in an 80s timewarp. As a result, the world has moved on and it hasn’t leading to console menus that look like this

This image is directly from the Intel build docs and actually, this is Windows, which is also behind the times³; if you fire this up from an xterm, the menu looks even worse. To get VGA looking nice from an xterm, first you need to use a vga font (these have the line drawing characters in the upper 128 bytes), then you need to turn off UTF-8 (otherwise some of the upper 128 character get seen as UTF-8 encodings), turn off the C1 control characters and set a keyboard mapping that sends what UEFI is expecting as F7. This is what I end up with

xterm -fn vga -geometry 80x25 +u8 -kt sco -k8 -xrm "*backarrowKeyIsErase: false"

And don’t expect the -fn vga to work out of the box on your distro either … vga fonts went out with the ark, so you’ll probably have to install it manually. With all of this done, the result is

And all the function keys work.

Back to our regularly scheduled programming: Secure Boot

The Quark build environment comes with a SECURE_LD mode, which, at first sight, does look like secure boot. However, what it builds is a cryptographic verification system for the firmware payloads (and disables the shell for good measure). There is also an undocumented SECURE_BOOT build define instead; unfortunately this doesn’t even build (it craps out trying to add a firmware menu: the Quark is embedded and embedded don’t have firmware menus). Once this is fixed, the image will build but it won’t boot properly. The first thing to discover is that ASSERT() fails silently on this platform. Why? Well if you turn on noisy assert, which prints the file and line of the failure, you’ll find that the addition of all these strings make the firmware volumes too big to fit into their assigned spaces … Bummer. However printing nothing is incredibly useless for debugging, so just add a simple print to let you know when ASSERT() failure is hit. It turns out there are a bunch of wrong asserts, including one put in deliberately by intel to prevent secure boot from working because they haven’t tested it. Once you get rid of these, it boots and mostly works.

Mostly because there’s one problem: authenticating by enrolled binary hashes mostly fails. Why is this? Well, it turns out to be a weakness in the Authenticode spec which UEFI secure boot relies on. The spec is unclear on padding and alignment requirements (among a lot of other things; after all, why write a clear spec when there’s only going to be one implementation … ?). In the old days, when we used crcs, padding didn’t really matter because additional zeros didn’t affect the checksum but in these days of cryptographic hashes, it does matter. The problem is that the signature block appended to the EFI binary has an eight byte alignment. If you add zeros at the end to achieve this, those zeros become part of the hash. That means you have to hash IA32 binaries as if those padded zeros existed otherwise the hash is different for signed and unsigned binaries. X86-64 binaries seem to be mostly aligned, so this isn’t noticeable for them. This problem is currently inherent in edk2, so it needs patching manually in the edk2 tree to get this working.

With this commit, the Quark build system finally builds a working secure boot image. All you need to do is download the ia32 efitools (version 1.5.2 or newer — turns out I also didn’t compute hashes correctly) to an sd card and you’re ready to play.

Adventures in Embedded UEFI with Intel Galileo

4 Replies

At one of the Intel Technology Days conferences a while ago, Intel gave us a gift of a Galileo board, which is based on the Quark SoC, just before the general announcement. The promise of the Quark SoC was that it would be a fully open (down to the firmware) embedded system based on UEFI. When the board first came out, though, the UEFI code was missing (promised for later), so I put it on a shelf and forgot about it. Recently, the UEFI Security Subteam has been considering issues that impinge on embedded architectures (mostly arm) so having an actual working embedded development board could prove useful. This is the first part of the story of trying to turn the Galileo into an embedded reference platform for UEFI.

The first problem with getting the Galileo working is that if you want to talk to the UEFI part it’s done over a serial interface, with a 3.5″ jack connection. However, a quick trip to amazon solved that one. Equipped with the serial interface, it’s now possible to start running UEFI binaries. Just using the default firmware (with no secure boot) I began testing the efitools binaries. Unfortunately, they were building the size of the secure variables (my startup.nsh script does an append write to db) and eventually the thing hit an assert failure on entering the UEFI handoff. This led to the discovery that the recovery straps on the board didn’t work, there’s no way to clear the variable NVRAM and the only way to get control back was to use an external firmware flash tool. So it’s off for an unexpected trip to uncharted territory unless I want the board to stay bricked.

The flash tool Intel recommends is the Dediprog SF 100. These are a bit expensive (around US$350) and there’s no US supplier, meaning you have to order from abroad, wait ages for it to be delivered and deal with US customs on import, something I’ve done before and have no wish to repeat. So, casting about for a better solution, I came up with the Bus Pirate. It’s a fully open hardware component (including a google code repository for the firmware and the schematics) plus it’s only US$35 (that’s 10x cheaper than the dediprog), available from Amazon and works well with Linux (for full disclosure, Amazon was actually sold out when I bought mine, so I got it here instead).

The Bus Pirate comes as a bare circuit board (no case or cables), so you have to buy everything you might need to use it extra. I knew I’d need a ribbon cable with SPI plugs (the Galileo has an SPI connector for the dediprog), so I ordered one with the card. The first surprise when the card arrived was that the USB connector is actually Mini B not the now standard Micro connector. I’ve not seen one of those since I had an Android G1, but, after looking in vain for my old android one, Staples still has the cables. The next problem is that, being open hardware, there are multiple manufacturers. They all supply a nice multi coloured ribbon cable, but there are two possible orientations and both are produced. Turns out I have a sparkfun cable which is the opposite way around from the colour values in the firmware (which is why the first attempt to talk to the chip didn’t work). The Galileo has diode isolators so the SPI flash chip can be powered up and operated independently by the Bus Pirate; accounting for cable orientation, and disconnecting the Galileo from all other external power, this now works. Finally, there’s a nice Linux project, flashrom, which allows you to flash all manner of chips and it has a programmer mode for the Bus Pirate. Great, except that the default USB serial speed is 115200 and at that baud rate, it takes about ten minutes just to read an 8MB SPI flash (flashrom will read, program and then verify, giving you about 25 mins each time you redo the firmware). Speeding this up is easy: there’s an unapplied patch to increase the baud rate to 2Mbit and I wrote some code to flash only designated areas of the chip (note to self: send this upstream). The result is available on the OpenSUSE build service. The outcome is that I’m now able to build and reprogram the firmware in around a minute.

By now this is two weeks, a couple of hacks to a tool I didn’t know I’d need and around US$60 after I began the project, but at least I’m now an embedded programmer and have the scars to prove it. Next up is getting Secure Boot actually working ….

James Bottomley's random Pages

A collection of information