Author Archives: jejb

Fixing our Self Defeating Licence Compatibility Problems in Open Source

Much angst (and discussion ink) is wasted in open source over whether pulling in code from one project with a different licence into another is allowable based on the compatibility of the two licences. I call this problem self defeating because it creates sequestered islands of incompatibly licensed but otherwise fully open source code that can never ever meet in combination. Everyone from the most permissive open source person to the most ardent free software one would agree this is a problem that should be solved, but most of the islands would only agree to it being solved on their terms. Practically, we have got around this problem by judicious use of dual licensing but that requires permission from the copyright holders, which can sometimes be hard to achieve; so dual licensing is more a band aid than a solution.

In this blog post, I’m going to walk you through the reasons behind cone the most intractable compatibility disputes in open source: Apache-2 vs GPLv2. However, before we get there, I’m first going to walk through several legal issues in general contract and licensing law and then get on to the law and politics of open source licensing.

The Law of Contracts and Licences

Contracts and Licences come from very similar branches of the law and concepts that apply to one often apply to the other. For this legal tour we’ll begin with materiality in contracts followed by licences then look at repairable and irreparable legal harms and finally the conditions necessary to take court action.

Materiality in Contracts

This is actually a well studied and taught bit of the law. The essence is that every contract has a “heart” or core set of clauses which really represent what the parties want from each other and often has a set of peripheral clauses which don’t really affect the “heart” of the contract if they’re not fulfilled. Not fulfilling the latter are said to cause non-material breaches of the contract (i.e. breaches which don’t terminate the contract if they happen, although a party may still have an additional legal claim for the breach if it caused some sort of harm). A classic illustration, often used in law schools, is a contract for electrical the electrical wiring of a house that specifies yellow insulation. The contractor can’t find yellow, so wires the house with blue insulation. The contract doesn’t suffer a material breach because the wires are in the wall (where no-one can see) and there’s no safety issue with the colour and the heart of the contract was about wiring the house not about wire colour.

Materiality in Licensing

This is actually much less often discussed, but it’s still believed that licences are subject to the same materiality constraints as contracts and for this reason, licences often contain “materiality clauses” to describe what the licensor considers to be material to it. So for the licensing example, consider a publisher wishing to publish a book written by a famous author known as the “Red Writer”. A licence to publish for per copy royalties of 25% of the purchase price of the book is agreed but the author inserts a clause specifying by exact pantone number the red that must be the predominant colour of the binding (it’s why they’re known as the “Red Writer”) and also throws in a termination of copyright licence for breaches clause. The publisher does the first batch of 10,000 copies, but only after they’ve been produced discovers that the red is actually one pantone shade lighter than that specified in the licence. Since the cost of destroying the batch and reprinting is huge, the publisher offers the copies for sale knowing they’re out of spec. Some time later the “Red Writer” comes to know of the problem, decides the licence is breached and therefore terminated, so the publisher owes statutory damages (yes, they’ve registered their copyright) per copy on 10,000 books (about $300 million maximum), would the author win?

The answer of course is that no court is going to award the author $300 million. Most courts would take the view that the heart of the contract was about money and if the author got their royalties per book, there was no material breach and the licence continues in force for the publisher. The “Red Writer” may have a separate tort claim for reputational damage if any was caused by the mis-colouring of the book, but that’s it.

Open Source Enforcement and Harm

Looking at the examples above, you can see that most commercial applications of the law eventually boil down to money: you go to court alleging a harm, the court must agree and then assess the monetary compensation for the harm which becomes damages. Long ago in community open source, we agreed that money could never compensate for a continuing licence violation because if it could we’d have set a price for buying yourself out of the terms of the licence (and some Silicon Valley Rich Companies would actually be willing to pay it, since it became the dual licence business model of companies like MySQL)1. The principle that mostly applies in open source enforcement actions is that the harm is to the open source ecosystem and is caused by non-compliance with the licence. Since such harm can only be repaired by compliance that’s the essence of the demand. Most enforcement cases have been about egregious breaches: lack of any source code rather than deficiencies in the offer to provide source code, so there’s actually very little in court records with regard to materiality of licence breaches.

One final thing to note about enforcement cases is there must always be an allegation of material harm to someone or something because you can’t go into court and argue on abstract legal principles (as we seem to like to do in various community mailing lists), you must show actual consequences as well. In addition to consequences, you must propose a viable remedy for the harm that a court could impose. As I said above in open source cases it’s often about harms to the open source ecosystem caused by licence breaches, which is often accepted unchallenged by the defence because the case is about something obviously harmful to open source, like failure to provide source code (and the remedy is correspondingly give us the source code). However, when considering about the examples below it’s instructive to think about how an allegation of harm around a combination of incompatible open source licences would play out. Since the source code is available, there would be much more argument over what the actual harm to the ecosystem, if any, was and even if some theoretical harm could be demonstrated, what would the remedy be?

Applying this to Apache-2 vs GPLv2

The divide between the Apache Software Foundation (ASF) and the Free Software Foundation (FSF) is old and partly rooted in politics. For proof of this notice the FSF says that the two licences (GPLv2 and Apache-2) are legally incompatible and in response the ASF says no-one should use any GPL licences anyway. The purpose of this section is to guide you through the technicalities of the incompatibility and then apply the materiality lessons from above to see if they actually matter.

Why GPLv2 is Incompatible with Apache-2

The argument is that Apache-2 contains two incompatible clauses: the patent termination clause (section 3) which says that if you launch an action against anyone alleging the licensed code infringes your patent then all your rights to patents in the code under the Apache-2 licence terminate; and the Indemnity clause (Section 9) which says that if you want to offer an a warranty you must indemnify every contributor against any liability that warranty might incur. By contrast, GPLv2 contains an implied patent licence (Section 7) and a No Warranty clause (Section 11). Licence scholars mostly agree that the patent and indemnity terms in GPLv2 are weaker than those in Apache-2.

The incompatibility now occurs because GPLv2 says in Section 2 that the entire work after the combination must be shipped under GPLv2, which is possible: Apache is mostly permissive except for the stronger patent and indemnity clauses. However, it is arguable that without keeping those stronger clauses on the Apache-2 code, you’ve violated the Apache-2 licence and the GPLv2 no additional restrictions clause (Section 6) prevents you from keeping the stronger licensing and indemnity clauses even on the Apache-2 portions of the code. Thus Apache-2 and GPLv2 are incompatible.

Materiality and Incompatibility

It should be obvious from the above that it’s hard to make a materiality argument for dropping the stronger apache2 provisions because someone, somewhere might one day get into a situation where they would have helped. However, we can look at the materiality of the no additional restrictions clause in GPLv2. The FSF has always taken the absolutist position on this, which is why they think practically every other licence is GPLv2 incompatible: when you dig at least one clause in every other open source licence can be regarded as an additional restriction. We also can’t take the view that the whole clause is not material: there are obviously some restrictions (like you must pay me for every additional distribution of the code) that would destroy the open source nature of the licence. This is the whole point of the no additional restrictions clause: to prevent the downstream addition of clauses incompatible with the free software goal of the licence.

I mentioned in the section on Materiality in Licences that some licences have materiality clauses that try to describe what’s important to the licensor. It turns out that GPLv2 actually does have a materiality clause: the preamble. We all tend to skip the preamble when analysing the licence, but there’s no denying it’s 7 paragraphs of justification for why the licence looks like it does and what its goals are.

So, to take the easiest analysis first, does the additional indemnity Apache-2 requires represent a material additional restriction. The preamble actually says “for each author’s protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors’ reputations.” Even on a plain reading an additional strengthening of that by providing an indemnity to the original authors has to be consistent with the purpose as described, so the indemnity clause can’t be regarded as a material additional restriction (a restriction which would harm the aims of the licence) when read in combination with the preamble.

Now the patent termination clause. The preamble has this to say about patents “Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone’s free use or not licensed at all.” So giving licensees the ability to terminate the patent rights for patent aggressors would appear to be an additional method of fulfilling the last sentence. And, again, the patent termination clause seems to be consistent with the licence purpose and thus must also not be a material additional restriction.

Thus the final conclusion is that while the patent and indemnity clauses of Apache-2 do represent additional restrictions, they’re not material additional restrictions according to the purpose of the licence as outlined by its materiality clause and thus the combination is permitted. This doesn’t mean the combination is free of consequences: the added code still carries the additional restrictions and you must call that out to the downstream via some mechanism like licensing tags, but it can be done.

Proving It

The only way to prove the above argument is to win in court on it. However, here lies the another good reason why combining Apache-2 and GPLv2 is allowed: there’s no real way to demonstrate harm to anything (either the copyright holder who agreed to GPLv2 or the Community) and without a theory of actual Harm, no-one would have standing to get to court to test the argument. This may look like a catch-22, but it’s another solid reason why, even in the absence of the materiality arguments, this would ultimately be allowed (if you can’t prevent it, it must be allowable, right …).

Community Problems with the Materiality Approach

The biggest worry about the loosening of the “no additional restrictions” clause of the GPL is opening the door to further abuse of the licence by unscrupulous actors. While I agree that this should be a concern, I think it is adequately addressed by rooting the materiality of the licence in the preamble or in provable harm to the open source community. There is also the flip side of this: licences are first and foremost meant to serve the needs of their development community rather than become inflexible implements for a group of enforcers, so even if there were some putative additional abuse in this approach, I suspect it would be outweighed by the licence compatibility benefit to the development communities in general.

Conclusion

The first thing to note is that Open Source incompatible licence combination isn’t as easy as simply combining the code under a single licence: You have to preserve the essential elements of both licences in the code which is combined (although not necessarily the whole project), so for an Apache-2/GPLv2 combination, you’ll need a note on the files saying they follow the stronger Apache patent termination and indemnity even if they’re otherwise GPLv2. However, as long as you’re careful the combination works for either of two reasons: because the Apache-2 restrictions aren’t material additional restrictions under the GPLv2 preamble or because no-one was actually harmed in the making of the combination (or both).

One can see from the above that similar arguments can be applied to various other supposedly incompatible licence combinations (exercise for the reader: try it with BSD-4-Clause and GPLv2). One final point that should be made is that licences and contracts are also all about what was in the minds of the parties, so for open source licences on community code, the norms and practices of the community matter in addition to what the licence actually says and what courts have made of it. In the final analysis, if the community norm of, say, a GPLv2 project is to accept Apache-2 code allowing for the stronger patent and indemnity clauses, then that will become the understood basis for interpreting the GPLv2 licence in that community.

For completeness, I should point out I’ve used the no harm no foul reasoning before when arguing that CDDL and GPLv2 are compatible.

Converting Engines to OpenSSL-3 Providers

Engines in OpenSSL have a long history of providing new algorithms (Russian GOST hash/signature etc) but they can also be used to interface external crypto tokens (pkcs#11) or even key managers like my own TPM engine. I’ve actually been using my TPM2 engine for nearly a decade so that I no longer have to have an unprotected private keys anywhere on my laptops (including for ssh). The purpose of this post is to look at the differences between Providers and Engines and give advice on the minimum necessary Provider implementation to give back all the Engine functionality. So this post is aimed at Engine developers who wish to convert to Providers rather than giving user advice for either.

TPMs and Engines

TPM2 actually has a remarkable number of algorithms: hashing, symmetric encryption, asymmetric signatures, key derivation, etc. However, most TPMs are connected to the host over very slow busses (usually serial), which means that no-one in their right mind would use a TPM for bulk data operations (like hashing or symmetric encryption) since it will take orders of magnitude longer than if the native CPU did it. Thus from an Engine point of view, the TPM is really only good for guarding private asymmetric keys and doing sign or decrypt operations on them, which are the only capabilities the TPM engine has.

Hashes and Signatures

Although I said above we don’t use the TPM for doing hashes, the TPM2_Sign() routines insist on knowing which hash they’re signing. For ECDSA signatures, this is irrelevant, since the hash type plays no part in the signature (it’s always truncated to key length and converted to a bignum) but for RSA the ASN.1 form of the hash description is part of the toBeSigned data. The problem now is that early TPM2’s only had two hash algorithms (sha1 and sha256) and the engine wanted to be able to use larger hash sizes. The solution was actually easy: lie about the hash size for ECDSA, so always give the hash that’s the width of the key (sha256 for NIST P-256 and sha384 for NIST P-384) and left truncate the passed in hash if larger or left zero pad if smaller.

For RSA, the problem is more acute, since TPM2_Sign() actually takes a raw digest and adds the hash description but the engine code sends down the fully described hash which merely needs to be padded if PKCS1 (PSS data is fully padded when sent down) and encrypted with the private key. The solution to this taken years ago was not to bother with TPM2_Sign() at all for RSA keys but instead to do a Decrypt operation1. This also means that TPM RSA engine keys are marked as decryption keys, not signing keys.

The Engine Itself

Given that the TPM is really only guarding the private keys, it only makes sense to substitute engine functions for the private key operations. Although the TPM can do public key operations, the core OpenSSL routines do them much faster and no information is leaked about the private key by doing them through OpenSSL, so Engine keys were constructed from standard OpenSSL keys by substituting a couple of private key methods from the underlying key types. One thing Engines were really bad at was passing additional parameters at key creation time and doing key wrapping. The result is that most Engines already have a separate tool to create engine keys (create_tpm2_key for the TPM2 engine) because complex arguments are needed for TPM specific things like key policy.

TPM keys are really both public and private keys combined and the public part of the key can be accessed without a password (unlike OpenSSL keys) or even access to the TPM that created the key. However, the engine code doesn’t usually know when only the public part of the key will be required and password prompting is done in OpenSSL at key loading (the TPM doesn’t need a password until key use), so usually after a TPM key is created, the public key is also separately derived using a pkey operation and used as a normal public key.

The final, and most problematic Engine feature, is key loading. Engine keys must be loaded using a special API (ENGINE_load_private_key). OpenSSL built in applications require you to specify the key type (-keyform option) but most well written OpenSSL applications simply try loading the PEM key first, then the DER key then the Engine key (since they all have different APIs), but frequently the Engine key is forgotten leading to the application having to be patched if you want to use them with any engine.

Converting Engines to Providers

The provider API has several pieces which apply to asymmetric key handling: Store, Encode/Decode, Key Management, Signing and Decryption (plus many more if you provide hashes or symmetric algorithms). One thing to remember about the store API is that if you only have file based keys, you should use the generic file store instead. Implementing your own store is only necessary if you also have a URI based input (like PKCS#11). In fact the TPM Engine has a URI for persistent keys, so the TPM store implementation will be dealt with later.

Provider Basics

If a provider is specified on the OpenSSL command line, it will become the sole provider of every algorithm. That means that providers like the TPM2 one, which only fill in a subset of functions cannot operate on their own and must always be used with another provider (usually the default one). After initialization (see below) all provider actions are governed by algorithm tables. One of the key questions for any provider is what to do about algorithm names and properties. Because the TPM2 provider relies on external providers for other algorithms, it must use consistent key names (so “EC” for Elliptic curve and “RSA” for RSA), even though it has only a single key type. There are also elements of the provider key managements, like the way Elliptic Curve keys change name to “ECDSA” for signing and “ECDH” for derivation, which is driven by the key management query operation function. As far as I can tell, this provides no benefit and merely serves to add complexity to the provider, so my provider doesn’t implement these functions and uses the same key names throughout.

The most mysterious string of all is the algorithm property one. The manual gives very little clue as to what should be in it besides “provider=<provider name>”. Empirically it seems to have input, output and structure elements, which are primarily used by encoders and decoders: input can be either der or pem and structure must be the same as the OSSL_OBJECT_PARAM_DATA_STRUCTURE string produced by the der decoder (although you are free to choose any name for this). output is even more varied and the best current list is provided by the source; however the only encoder the TPM2 provider actually provides is the text one.

One of the really nice things about providers is that when OpenSSL is presented with a key to load, every provider will be tried (usually in the order they’re specified on the command line) to decode and load the key. This completely fixes the problem with missing ENGINE_load_private_key() functions is applications because now all applications can use any provider key. This benefit alone is enough to outweigh all the problems of doing the actual conversion to a provider.

Replacing Engine Controls

Engine controls were key/value pairs passed into engines. The TPM2 engine has two: “PIN” for the parent authority and “NVPREFIX” for the prefix which identifies a non-volatile key. Although these can be passed in with the ENGINE_ctrl() functions, they were mostly set in the configuration file. This latter mechanism can be replaced with the provider base callback core_get_params(). Most engine controls actually set global variables and with the provider, they could be placed into the provider context. However, for code sharing it’s easier simply to keep the current globals mechanism.

Initialization and Contexts

Every provider has to have an OSSL_provider_init() routine which fills in a dispatch table and allocates a core context, which is passed in to every other context routine. For a provider, there’s really only one instance, so storing variables in the provider context is really no different (except error handling and actually getting destructors) from using static variables and since the engine used static variables, that’s what we’ll stick with. However, pretty much every routine will need an allocated library context, so it’s easiest to allocate at provider init time and pass it through as the provider context. The dispatch routine must contain a query_operation function, and probably needs a teardown function if you need to use a destructor, but nothing else.

All provider function groups require a newctx() and freectx() call. This is not optional because the current OpenSSL code calls them without checking so they cannot be NULL. Thus for function groups (like encoders and key management) where new contexts aren’t really required it makes sense to use pass through context functions that simply pass through the provider context for newctx() and do nothing for freectx().

The man page implies it is necessary to pick a load of functions from the in argument, but it seems unnecessary for those which the OpenSSL library already provides. I assume it’s something to do with a provider not requiring OpenSSL symbols, but it’s impossible to implement a provider today without relying on other OpenSSL functions than those which can be picked out of the in argument.

Decoders

Decoders are used to convert a read file from PEM to DER (this is essentially the same conversion for every provider, so it is strange you have to do this rather than it being done in the core routines) and then DER to an internal key structure. The remaining decoders take DER in and output a labelled key structure (which is used as a component of the EVP_PKEY), if you do both RSA and EC keys, you need one for each key type and, unfortunately, they must be provided and may not cross decode (the RSA decoder must reject EC keys and vice versa). This is actually required so the OpenSSL core can tell what type of key it has but is a royal pain for things like the TPM where the key DER is identical regardless of key type:

const OSSL_ALGORITHM decoders[] = {
	{ "DER", "provider=tpm2,input=pem", decode_pem_fns },
	{ "RSA", "provider=tpm2,input=der,structure=TPM2", decode_rsa_fns },
	{ "EC", "provider=tpm2,input=der,structure=TPM2", decode_ec_fns },
	{ NULL, NULL, NULL }
};

The decode_pem_fns can be cut and pasted from any provider with the sole exception that you probably have a different PEM guard string that you need to check for.

Then a sample decoder function set looks like:

static const OSSL_DISPATCH decode_rsa_fns[] = {
	{ OSSL_FUNC_DECODER_NEWCTX, (void (*)(void))tpm2_passthrough_newctx },
	{ OSSL_FUNC_DECODER_FREECTX, (void (*)(void))tpm2_passthrough_freectx },
	{ OSSL_FUNC_DECODER_DECODE, (void (*)(void))tpm2_rsa_decode },
	{ 0, NULL }
};

The main job of the DECODER_DECODE function is to take the DER form of the key and convert it to an internal PKEY and send that PKEY up by reference so it can be consumed by a key management load.

Encoders

By and large, engines all come with creation tools for key files, which means that while you could now use the encoder routines to create key files, it’s probably better off to stick with what you have (especially for things like the TPM that can have complex policy statements attached to keys), so you can omit providing any encoder functions at all. The only possible exception is if you want the keys pretty printing, you might consider a text output encoder:

const OSSL_ALGORITHM encoders[] = {
	{ "RSA", "provider=tpm2,output=text", encode_text_fns },
	{ "EC", "provider=tpm2,output=text", encode_text_fns },
	{ NULL, NULL, NULL }
};

Which largely follows the format for decoders:

static const OSSL_DISPATCH encode_text_fns[] = {
	{ OSSL_FUNC_ENCODER_NEWCTX, (void (*)(void))tpm2_passthrough_newctx },
	{ OSSL_FUNC_ENCODER_FREECTX, (void (*)(void))tpm2_passthrough_freectx },
	{ OSSL_FUNC_ENCODER_ENCODE, (void (*)(void))tpm2_encode_text },
	{ 0, NULL }
};

Note: there are many more encode/decode function types you could supply, but the above are the essential ones.

Key Management

Nothing in the key management functions requires the underlying key object to be reference counted since it belongs to an already reference counted EVP_PKEY structure in the OpenSSL generic routines. However, the signature operations can’t be implemented without context duplication and the signature context must contain a reference to the provider key so, depending on how the engine implements keys, duplicating via reference might be easier than duplicating via copy. The minimum functionality to implement is LOAD, FREE and HAS. If you are doing Elliptic Curve derive or reference counting your engine keys, you will also need NEW. You also have to provide both GET_PARAMS and GETTABLE_PARAMS (many key management functions have to implement pairs like this) for at least the BITS, SECURITY_BITS and SIZE properties)2.

You must also implement the EXPORT (and EXPORT_TYPES, which must be provided but has no callers) so that you can convert your engine key to an external public key. Note the EXPORT function must fail if asked to export the private key otherwise the default provider will try to do the private key operations via the exported key as well.

If you need to do Elliptic Curve key derivation you must also implement IMPORT (and IMPORT_TYPES) because the creation of the peer key (even though it’s a public one) will necessarily go through your provider key managment functions.

The HAS function can be problematic because OpenSSL doesn’t assume the interchangeability of public and private keys, even if it is true of the engine. Thus the engine must remember in the decode routines what key selector was used (public, private or both) and make sure to condition HAS on that value.

Signatures

This is one of the most confusing areas for simple signing devices (which don’t do hashing) because you’d assume you can implement NEWCTX, FREECTX, SIGN_INIT and SIGN and be done. Unfortunately, in spite of the fact that all the DIGEST_SIGN_… functions can be implemented in terms of the previous functions and generic hashing, they aren’t, so all providers are required to duplicate hashing and signing functions including constructing the binary ASN.1 for the certificate signature function (via GET_CTX_PARAMS and its pair GETTABLE_CTX_PARAMS). Another issue a sign only token will get into is padding: OpenSSL supports a variety of padding schemes (for RSA) but is deprecating their export, so if your token doesn’t do an expected form of padding, you’ll need to implement that in your provider as well. Recalling that the TPM2 provider uses RSA Decryption for signatures means that the TPM2 provider implementation is entirely responsible for padding all signatures. In order to try to come up with a common solution, I added an opensslmissing directory to my provider under the MIT licence that anyone is free to incorporate into their provider if they end up having the same digest and padding problems I did.

Decryption and Derivation

The final thing a private key provider needs to do is decryption. This is a very different operation between Elliptic Curve and RSA keys, so you need two different operations for each (OSSL_OP_ASYM_CIPHER for RSA and OSSL_OP_KEYEXCH for EC). Each ends up being a slightly special snowflake: RSA because it may need OAEP padding (which the TPM does) but with the most usual cipher being md5 (so OAEP padding with arbitrary mask and hash function is also in opensslmissing), which the TPM doesn’t do. and EC because it requires derivation from another public key. The problem with this latter operation is that because of the way OpenSSL works, the public key must be imported into the provider before it can be used, so you must provide NEW, IMPORT and IMPORT_TYPES routines for key management for this to happen.

Store

The store functions only need to be used if you have to load keys that aren’t file based (for file based keys the default provider file store will load them). For the TPM there are a set of NV Keys with 0x81 MSO prefix that aren’t file based. We load these in the engine with //nvkey:<hex> as the designator (and the //nvkey: prefix is overridable in the config file). To get this to work in the Provider is slightly problematic because the scheme (the //nvkey: prefix) must be specified as the provider algorithm_name which is usually a constant in a static array. This means that the stores actually can’t be static and must have the configuration defined name poked into it before the store is used, but this is relatively easy to arrange this in the OSSL_provider_init() function. Once this is done, it’s relatively easy to create a store. The only really problematic function is the STORE_EOF one, which is designed around files but means you have to keep an eof indicator in the context and update it to be 1 once the load function has complete.

The Provider Recursion Problem

This doesn’t seem to be discussed anywhere else, but it can become a huge issue if your provider depends on another library which also uses OpenSSL. The TPM2 provider depends on either the Intel or IBM TSS libraries and both of those use OpenSSL for cryptographic operations around TPM transport security since both of them use ECDH to derive a seed for session encryption and HMAC. The problem is that ordinarily the providers are called in the order they’re listed, so you always have to specify –provider default –provider tpm2 to make up for the missing public key operations in the TPM2 provider. However, the OpenSSL core operates a cache for the provider operations it has previously found and searches the cache first before doing any other lookups, so if the EC key management routines are cached (as they are if you input a TPM format key) and the default ones aren’t (because inputting TPM format keys requires no public key operations), the next attempt to generate an ephemeral EC key for the ECDH security derivation will find the TPM2 provider first. So say you are doing a signature which requires HMAC security to guard against interposer tampering. The use of ECDH in the HMAC seed derivation will then call back into the provider to do an ECDH operation which also requires session security and will thus call back again into the provider ad infinitum (or at least until stack overflow). The only way to break out of this infinite recursion is to try to prime the cache with the default provider as well as the TPM2 provider, so the tss library functions can find the default provider first. The (absolutely dirty) hack I have to do this is inside the pkey decode function as

	if (alg == TPM_ALG_ECC) {
		EVP_PKEY_CTX *ctx = EVP_PKEY_CTX_new_id(EVP_PKEY_EC, NULL);
		EVP_PKEY_CTX_free(ctx);
	}

Which currently works to break the recursion loop. However it is an unreliable hack because internally the OpenSSL hash bucket implementation orders the method cache by provider address and since the TPM2 provider is dynamically loaded it has a higher address than the OpenSSL default one. However, this will not survive security techniques like Address Space Layout Randomization.

Conclusions

Hopefully I’ve given a rapid (and possibly useful) overview of converting an engine to a provider which will give some pointers about provider conversion to all the engine token implementations out there. Please feel free to repurpose my opensslmissing routines under the MIT licence without any obligations to get them back upstream (although I would be interested in hearing about bugs and feature enhancements). In the end, it was only 1152 lines of C to implement the TPM2 provider (additive on top of the common shared code base with the existing Engine) and 681 lines in opensslmissing, showing firstly that there is still an need for OpenSSL itself to do the missing routines as a provider export and secondly that it really takes a fairly small amount of provider code to wrapper an existing engine implementation provided you’re discriminating about what functions you actually provide. As a final remark I should note that the openssl_tpm2_engine has a fairly extensive test suite which all now pass with the provider implementation as well.

Using SIP to Replace Mobile and Land Lines

If you read more than a few articles in my blog you’ve probably figured out that I’m pretty much a public cloud Luddite: I run my own cloud (including my own email server) and don’t really have much of my data in any public cloud. I still have public cloud logins: everyone wants to share documents with Google nowadays, but Google regards people who don’t use its services “properly” with extreme prejudice and I get my account flagged with a security alert quite often when I try to log in.

However, this isn’t about my public cloud phobia, it’s about the evolution of a single one of my services: a cloud based PBX. It will probably come as no surprise that the PBX I run is Asterisk on Linux but it may be a surprise that I’ve been running it since the early days (since 1999 to be exact). This is the story of why.

I should also add that the motivation for this article is that I’m unable to get a discord account: discord apparently has a verification system that requires a phone number and explicitly excludes any VOIP system, which is all I have nowadays. This got me to thinking that my choices must be pretty unusual if they’re so pejoratively excluded by a company whose mission is to “Create Space for Everyone to find Belonging”. I’m sure the suspicion that this is because Discord the company also offers VoIP services and doesn’t like the competition is unworthy.

Early Days

I’ve pretty much worked remotely in the US all my career. In the 90s this meant having three phone lines (These were actually physical lines into the house): one for the family, one for work and one for the modem. When DSL finally became a thing and we were running a business, the modem was replaced by a fax machine. The minor annoyance was knowing which line was occupied but if line 1 is the house and line 2 the office, it’s not hard. The big change was unbundling. This meant initially the call costs to the UK through the line provider skyrocketed and US out of state rates followed. The way around this was to use unbundled providers via dial-around (a prefix number), but finding the cheapest was hard and the rates changed almost monthly. I needed a system that could add the current dial-around prefix for the relevant provider automatically. The solution: asterisk running on a server in the basement with two digium FX cards for the POTS lines (fax facility now being handled by asterisk) and Aastra 9113i SIP phones wired over the house ethernet with PoE injectors. Some fun jiggery pokery with asterisk busy lamp feature allowed the lights on the SIP phone to indicate busy lines and extensions.conf could be programmed to keep the correct dial-around prefix. For a bonus, asterisk can be programmed to do call screening, so now if the phone system doesn’t recognize your number you get told we don’t accept solicitation calls and to hang up now, otherwise press 0 to ring the house phone … and we’ve had peaceful dinner times ever after. It was also somewhat useful to have each phone in the house on its own PBX extension so people could call from the living room to my office without having to yell.

Enter SIP Trunking

While dial-arounds worked successfully for a few years, they always ended with problems (usually signalled by a massive phone bill) and a new dial-around was needed. However by 2007 several companies were offering SIP trunking over the internet. The one I chose (Localphone, a UK based company) was actually a successful ring back provider before moving into SIP. They offered a pay as you go service with phone termination in whatever country you were calling. The UK and US rates were really good, so suddenly the phone bills went down and as a bonus they gave me a free UK incoming number (called a DID – Direct Inward Dialing) which family and friends in the UK could call us on at local UK rates. Pretty much every call apart from local ones was now being routed over the internet, although most incoming calls, apart for those from the UK, were still over the POTS lines.

The beginning of Mobile (For Me)

I was never really a big consumer of mobile phones, but that all changed in 2009 when google presented all kernel developers with a Nexus One. Of course, they didn’t give us SIM cards to go with it, so my initial experiments were all over wifi. I soon had CyanogenMod installed and found a SIP client called Sipdroid. This allowed me to install my Nexus One as a SIP extension on the house network. SIP calls over 2G data were not very usable (the bandwidth was too low), but implementing multiple codecs and speex support got it to at least work (and is actually made me an android developer … scratching my own itch again). The bandwidth problems on 2G evaporated on 3G and SIP became really usable (although I didn’t have a mobile “plan”, I did use pay-as-you-go SIMs while travelling). It already struck me that all you really needed the mobile network for was data and then all calls could simply travel to a SIP provider. When LTE came along it seemed to be confirming this view because IP became the main communication layer.

I suppose I should add that I used the Nexus One long beyond its design life, even updating its protocol stack so it kept working. I did this partly because it annoyed certain people to see me with an old phone (I have a set of friends who were very amused by this and kept me supplied with a stock of Nexus One phones in case my old one broke) but mostly because of inertia and liking small phones.

SIP Becomes My Only Phone Service

In 2012, thanks to a work assignment, we relocated from the US to London. Since these moves take a while, I relocated the in-house PBX machine to a dedicated server in Los Angeles (my nascent private cloud), ditched the POTS connections and used the UK incoming number as our primary line that could be delivered to us while we were in temporary accommodation as well as when we were in our final residence in London. This did have the somewhat inefficient result that when you called from the downstairs living room to the upstairs office, the call was routed over an 8,000 mile round trip from London to Los Angeles and back, but thanks to internet latency improvements, you couldn’t really tell. The other problem was that the area code I’d chosen back in 2007 was in Whitby, some 200 Miles north of London but fortunately this didn’t seem to be much of an issue except for London Pizza delivery places who steadfastly refused to believe we lived locally.

When the time came in 2013 to move back to Seattle in the USA, the adjustment was simply made by purchasing a 206 area code DID and plugging it into the asterisk system and continued using a fully VoIP system based in Los Angeles. Although I got my incoming UK number for free, being an early service consumer, renting DIDs now costs around $1 per month depending on your provider.

SIP and the Home Office

I’ve worked remotely all my career (even when in London). However, I’ve usually worked for a company with a physical office setup and that means a phone system. Most corporate PBX’s use SIP under the covers or offer a SIP connector. So, by dint of finding the PBX administrator I’ve usually managed to get a SIP extension that will simply plug into my asterisk PBX. Using correct dial plan routing (and a prefix for outbound calling), the office number usually routes to my mobile and desk phone, meaning I can make and receive calls from my office number wherever in the world I happen to be. For those who want to try this at home, the trick is to find the phone system administrator; if you just ask the IT department, chances are you’ll simply get a blanket “no” because they don’t understand it might be easy to do and definitely don’t want to find out.

Evolution to Fully SIP (Data Only) Mobile

Although I said above that I maintained a range of in-country Mobile SIMs, this became less true as the difficulty in running in-country SIMs increased (most started to insist you add cash or use them fairly regularly). When COVID hit in 2020, and I had no ability to travel, my list of in-country SIMs was reduced to one from 3 UK largely because they allowed you to keep your number provided you maintained a balance (and they had a nice internet roaming agreement which meant you paid UK data rates in a nice range of countries). The big problem giving up a mobile number was no text messaging when travelling (SMS). For years I’ve been running a xmpp server, but the subset of my friends who have xmpp accounts has always been under 10% so it wasn’t very practical (actually, this is somewhat untrue because I wrote an xmpp to google chat bridge but the interface became very impedance mismatched as Google moved to rich media).

The major events that got me to move away from xmpp and the Nexus One were the shutdown of the 3G network in the US and the viability of the Matrix federated chat service (the Matrix android client relied on too many modern APIs ever to be backported to the version of android that ran on the Nexus One). Of the available LTE phones, I chose the Pixel-3 as the smallest and most open one with the best price/performance (and rapidly became acquainted with the fact that only some of them can actually be rooted) and LineageOS 17.1 (Android 10). The integration of SIP with the Dialer is great (I can now use SIP on the car’s bluetooth, yay!) but I rapidly ran into severe bugs in the Google SIP implementation (which hasn’t been updated for years). I managed to find and fix all the bugs (or at least those that affected me most, repositories here; all beginning with android_ and having the jejb-10 branch) but that does now mean I’m stuck on Android 10 since Google ripped SIP out in Android 12.

For messaging I adopted matrix (Apart from the Plumbers Matrix problem, I haven’t really written about it since matrix on debian testing just works out of the box) and set up bridges to Signal, Google Chat, Slack and WhatsApp (The WhatsApp one requires you be running WhatsApp on your phone, but I run mine on an Android VM in my cloud) all using the 3 UK Sim number where they require a mobile number confirmation. The final thing I did was to get a universal roaming data SIM and put it in my phone, meaning I now rely on matrix for messaging and SIP for voice when I travel because the data SIM has no working mobile number at all (either for voice or SMS). In many ways, this is no hardship: I never really had a permanent SMS number when travelling because of the use of in-country SIMs, so no-one has a number for me they rely on for SMS.

Conclusion and Problems

Although I implied above I can’t receive SMS, that’s not quite true: one of my VOIP numbers does accept SMS inbound and is able to send outbound, the problem is that it doesn’t come over the SIP MESSAGE protocol, but instead goes to a web page in the provider backend, making it inconvenient to use and meaning I have to know the message is coming (although I do use it for things like Delta Boarding passes, which only send the location of the web page to receive pkpasses over SMS). However, this isn’t usually a problem because most people I know have moved on from SMS to rich messaging over one of the protocols I have (and if one came along with a new protocol, well I can install a bridge for that).

In terms of SIP over an IP substrate giving rise to unbundled services, I could claim to be half right, since most of the modern phone like services have a SIP signalling core. However, the unbundling never really came about: the silo provider just moved from landline to mobile (or a mobile resale service like Google Fi). Indeed, today, if you give anyone your US phone number they invariably assume it is a mobile (and then wonder why you don’t reply to their SMS messages). This mobile assumption problem can be worked around by emphasizing “it’s a landline” every time you give out your VOIP number, but people don’t always retain the information.

So what about the future? I definitely still like the way my phone system works … having a single number for the house which any household member can answer from anywhere and side numbers for travelling really suits me, and I have the technical skills to maintain it indefinitely (provided the SIP trunking providers sill exist), but I can see the day coming where the Discord intolerance of non-siloed numbers is going to spread and most silos will require non-VOIP phone numbers with the same prejudice, thus locking people who don’t comply out in much the same way as it’s happening with email now; however, hopefully that day for VoIP is somewhat further off.

Paying Maintainers isn’t a Magic Bullet

Over the last few years it’s become popular to suggest that open source maintainers should be paid. There are a couple of stated motivations for this, one being that it would improve the security and reliability of the ecosystem (pioneered by several companies like Tidelift) and the others contending that it would be a solution to the maintainer burnout and finally that it would solve the open source free rider problem. The purpose of this blog is to examine each of these in turn to seeing if paying maintainers actually would solve the problem (or, for some, does the problem even exist in the first place).

Free Riders

The free rider problem is simply expressed: There’s a class of corporations which consume open source, for free, as the foundation of their profits but don’t give back enough of their allegedly ill gotten gains. In fact, a version of this problem is as old as time: the “workers” don’t get paid enough (or at all) by the “bosses”; greedy people disproportionately exploit the free but limited resources of the planet. Open Source is uniquely vulnerable to this problem because of the free (as in beer) nature of the software: people who don’t have to pay for something often don’t. Part of the problem also comes from the general philosophy of open source which tries to explain that it’s free (as in freedom) which matters not free (as in beer) and everyone, both producers and consumers should care about the former. In fact, in economic terms, the biggest impact open source has had on industry is from the free (as in beer) effect.

Open Source as a Destroyer of Market Value

Open Source is often portrayed as a “disrupter” of the market, but it’s not often appreciated that a huge part of that disruption is value destruction. Consider one of the older Open Source systems: Linux. As an operating system (when coupled with GNU or other user space software) it competed in the early days with proprietary UNIX. However, it’s impossible to maintain your margin competing against free and the net result was that one by one the existing players were forced out of the market or refocussed on other offerings and now, other than for historical or niche markets, there’s really no proprietary UNIX maker left … essentially the value contained within the OS market was destroyed. This value destruction effect was exploited brilliantly by Google with Android: to enter and disrupt an existing lucrative smart phone market, created and owned by Apple, with a free OS based on Open Source successfully created a load of undercutting handset manufacturers eager to be cheaper than Apple who went on to carve out an 80% market share. Here, the value isn’t completely destroyed, but it has significantly reduced (smart phones going from a huge margin business to a medium to low margin one).

All of this value destruction is achieved by the free (as in beer) effect of open source: the innovator who uses it doesn’t have to pay the full economic cost for developing everything from scratch, they just have to pay the innovation expense of adapting it (such adaptation being made far easier by access to the source code). This effect is also the reason why Microsoft and other companies railed about Open Source being a cancer on intellectual property: because it is1. However, this view is also the product of rigid and incorrect thinking: by destroying value in existing markets, open source presents far more varied and unique opportunities in newly created ones. The cardinal economic benefit of value destruction is that it lowers the barrier to entry (as Google demonstrated with Android) thus opening the market up to new and varied competition (or turning monopoly markets into competitive ones).

Envy isn’t a Good Look

If you follow the above, you’ll see the supposed “free rider” problem is simply a natural consequence of open source being free as in beer (someone is creating a market out of the thing you offered for free precisely because they didn’t have to pay for it): it’s not a problem to be solved, it’s a consequence to be embraced and exploited (if you’re clever enough). Not all of us possess the business acumen to exploit market opportunities like this, but if you don’t, envying those who do definitely won’t cause your quality of life to improve.

The bottom line is that having a mechanism to pay maintainers isn’t going to do anything about this supposed “free rider” problem because the companies that exploit open source and don’t give back have no motivation to use it.

Maintainer Burnout

This has become a hot topic over recent years with many blog posts and support groups devoted to it. From my observation it seems to matter what kind of maintainer you are: If you only have hobby projects you maintain on an as time becomes available basis, it seems the chances of burn out isn’t high. On the other hand, if you’re effectively a full time Maintainer, burn out becomes a distinct possibility. I should point out I’m the former not the latter type of maintainer, so this is observation not experience, but it does seem to me that burn out at any job (not just that of a Maintainer) seems to happen when delivery expectations exceed your ability to deliver and you start to get depressed about the ever increasing backlog and vocal complaints. In industry when someone starts to burn out, the usual way of rectifying it is either lighten the load or provide assistance. I have noticed that full time Maintainers are remarkably reluctant to give up projects (presumably because each one is part of their core value), so helping with tooling to decrease the load is about the only possible intervention here.

As an aside about tooling, from parallels with Industry, although tools correctly used can provide useful assistance, there are sure fire ways to increase the possibility of burn out with inappropriate demands:

It does strike me that some of our much venerated open source systems, like github, have some of these same management anti-patterns, like encouraging Maintainers to chase repository stars to show value, having a daily reminder of outstanding PRs and Issues, showing everyone who visits your home page your contribution records for every project over the last year.

To get back to the main point, again by parallel with Industry, paying people more doesn’t decrease industrial burn out; it may produce a temporary feel good high, but the backlog pile eventually overcomes this. If someone is already working at full stretch at something they want to do giving them more money isn’t going to make them stretch further. For hobby maintainers like me, even if you could find a way to pay me that my current employer wouldn’t object to, I’m already devoting as much time as I can spare to my Maintainer projects, so I’m unlikely to find more (although I’m not going to refuse free money …).

Security and Reliability

Everyone wants Maintainers to code securely and reliably and also to respond to bug reports within a fixed SLA. Obviously usual open source Maintainers are already trying to code securely and reliably and aren’t going to do the SLA thing because they don’t have to (as the licence says “NO WARRANTY …”), so paying them won’t improve the former and if they’re already devoting all the time they can to Maintenance, it won’t achieve the latter either. So how could Security and Reliability be improved? All a maintainer can really do is keep current with current coding techniques (when was the last time someone offered a free course to Maintainers to help with this?). Suggesting to a project that if they truly believed in security they’d rewrite it in Rust from tens of thousands of lines of C really is annoying and unhelpful.

One of the key ways to keep software secure and reliable is to run checkers over the code and do extensive unit and integration testing. The great thing about this is that it can be done as a side project from the main Maintenance task provided someone triages and fixes the issues generated. This latter is key; simply dumping issue reports on an overloaded maintainer makes the overload problem worse and adds to a pile of things they might never get around to. So if you are thinking of providing checker or tester resources, please also think how any generated issues might get resolved without generating more work for a Maintainer.

Business Models around Security and Reliability

A pretty old business model for code checking and testing is to have a distribution do it. The good ones tend to send patches upstream and their business model is to sell the software (or at least a support licence to it) which gives the recipients a SLA as well. So what’s the problem? Mainly the economics of this tried and trusted model. Firstly what you want supported must be shipped by a distribution, which means it must have a big enough audience for a distribution to consider it a fairly essential item. Secondly you end up paying a per instance use cost that’s an average of everything the distribution ships. The main killer is this per instance cost, particularly if you are a hyperscaler, so it’s no wonder there’s a lot of pressure to shift the cost from being per instance to per project.

As I said above, Maintainers often really need more help than more money. One good way to start would potentially be to add testing and checking (including bug fixing and upstreaming) services to a project. This would necessarily involve liaising with the maintainer (and could involve an honorarium) but the object should be to be assistive (and thus scalably added) to what the Maintainer is already doing and prevent the service becoming a Maintainer time sink.

Professional Maintainers

Most of the above analysis assumed Maintainers are giving all the time they have available to the project. However, in the case where a Maintainer is doing the project in their spare time or is an Employee of a Company and being paid partly to work on the project and partly on other things, paying them to become a full time Maintainer (thus leaving their current employment) has the potential to add the hours spent on “other things” to the time spent on the project and would thus be a net win. However, you have to also remember that turning from employee to independent contractor also comes with costs in terms of red tape (like health care, tax filings, accounting and so on), which can become a significant time sink, so the net gain in hours to the project might not be as many as one would think. In an ideal world, entities paying maintainers would also consider this problem and offer services to offload the burden (although none currently seem to consider this). Additionally, turning from part time to full time can increase the problem of burn out, particularly if you spend increasing portions of your newly acquired time worrying about admin issues or other problems associated with running your own consulting business.

Conclusions

The obvious conclusion from the above analysis is that paying maintainers mostly doesn’t achieve it’s stated goals. However, you have to remember that this is looking at the problem thorough the prism of claimed end results. One thing paying maintainers definitely does do is increase the mechanisms by which maintainers themselves make a living (which is a kind of essential existential precursor). Before paying maintainers became a thing, the only real way of making a living as a maintainer was reputation monetization (corporations paid you to have a maintainer on staff or because being a maintainer demonstrated a skill set they needed in other aspects of their business) but now a Maintainer also has the option to turn Professional. Increasing the net ways of rewarding Maintainership therefore should be a net benefit to attracting people into all ecosystems.

In general, I think that paying maintainers is a good thing, but should be the beginning of the search for ways of remunerating Open Source contributors, not the end.

Linux Plumbers Conference Matrix and BBB integration

The recently completed Linux Plumbers Conference (LPC) 2021 used the Big Blue Button (BBB) project again as its audio/video online conferencing platform and Matrix for IM and chat. Why we chose BBB has been discussed previously. However this year we replaced RocketChat with Matrix to achieve federation, allowing non-registered conference attendees to join the chat. Also, based on feedback from our attendees, we endeavored to replace the BBB chat window with a Matrix one so anyone could see and participate in one contemporaneous chat stream within BBB and beyond. This enabled chat to be available before, during and after each session.

One thing that emerged from our initial disaster with Matrix on the first day is that we failed to learn from the experiences of other open source conferences (i.e. FOSDEM, which used Matrix and ran into the same problems). So, an object of this post is to document for posterity what we did and how to repeat it.

Integrating Matrix Chat into BBB

Most of this integration was done by Guy Lunardi.

It turns out that Chat is fairly deeply embedded into BBB, so replacing the existing chat module is hard. Fortunately, BBB also contains an embedded etherpad which is simply produced via an iFrame redirection. So what we did is to disable the BBB chat panel and replace it with a new iFrame based component that opened an embedded Matrix chat client. The client we chose was riot-embedded, which is a relatively recent project but seemed to work reasonably well. The final problem was to pass through user credentials. Up until three days before the conference, we had been happy with the embedded Matrix client simply creating a one-time numbered guest account every time it was opened, but we worried about this being a security risk and so implemented pass through login credentials at the last minute (life’s no fun unless you live dangerously).

Our custom front end for BBB (lpcfe) was created last year by Jon Corbet. It uses a fairly simple email/registration confirmation code for username/password via LDAP. The lpcfe front end Jon created is here git://git.lwn.net/lpcfe.git; it manages the whole of the conference log in process and presents the current and future sessions (with join buttons) according to the timezone of the browser viewing it.

The credentials are passed through directly using extra parameters to BBB (see commit fc3976e “Pass email and regcode through to BBB”). We eventually passed these through using a GET request. Obviously if we were using a secret password, this would be a problem, but since the password was a registration code handed out by a third party, it’s acceptable. I imagine if anyone wishes to take this work forward, add native Matrix device/session support in riot-embedded would be better.

The main change to get this working in riot-embedded is here, and the supporting patch to BBB is here.

Note that the Matrix room ID used by the client was added as an extra parameter to the flat text file that drives the conference track layout of lpcfe. All Matrix rooms were created as public (and published) so anyone going to our :lpc.events matrix domain could see and join them.

Setting up Matrix for the Conference

We used the matrix-synapse server and did a standard python venv pip install on Ubuntu of the latest tag. We created around 30+ public rooms: one for each Microconference and track of the conference and some admin and hallway rooms. We used LDAP to feed the authentication portion of lpcfe/Matrix, but we had a problem using email addresses since the standard matrix user name cannot have an ‘@’ symbol in it. Eventually we opted to transform everyone’s email to a matrix compatible form simply by replacing the ‘@’ with a ‘.’, which is why everyone in our conference appeared with ridiculously long matrix user names like @jejb.ibm.com:lpc.events

This ‘@’ to ‘.’ transformation was a huge source of problems due to the unwillingness of engineers to read instructions, so if we do this over again, we’ll do the transformation silently in the login javascript of our Matrix web client. (we did this in riot-embedded but ran out of time to do it in Element web as well).

Because we used LDAP, the actual matrix account for each user was created the first time they log into our server, so we chose at this point to use auto-join to add everyone to the 30+ LPC Matrix rooms we’d already created. This turned out to be a huge problem.

Testing our Matrix and BBB integration

We tried to organize a “Town Hall” event where we invited lots of people to test out the infrastructure we’d be using for the conference. Because we wanted this to be open, we couldn’t use the pre-registration/LDAP authentication infrastructure so Jon quickly implemented a guest mode (and we didn’t auto join anyone to any Matrix rooms other than the townhall chat).

In the end we got about 220 users to test during which time the Matrix and BBB infrastructure behaved quite well. Based on this test, we chose a 2 vCPU Linode VM for our Matrix server.

What happened on the Day

Come the Monday of the conference, the first problem we ran into was procrastination: the conference registered about 1,000 attendees, of whom, about 500 tried to log on about 5 minutes prior to the first session. Since accounts were created and rooms joined upon the first login, this is clearly a huge thundering herd problem of our own making … oops. The Matrix server itself shot up to 100% CPU on the python synapse process and simply stayed there, adding new users at a rate of about one every 30 seconds. All the chat tabs froze because logins were taking ages as well. The first thing we did was to scale the server up to a 16 CPU bare metal system, but that didn’t help because synapse is single threaded … all we got was the matrix synapse python process running at 100% one one of the CPUs, still taking 30 seconds per first log in.

Fixing the First Day problems

The first thing we realized is we had to multi-thread the synapse server. This is well known but the issue is also quite well hidden deep in the Matrix documents. It also happens that the Matrix documents are slightly incomplete. The first scaling attempt we tried: simply adding 16 generic worker apps to scale across all our physical CPUs failed because the Matrix server stopped federating and then the database crashed with “FATAL: remaining connection slots are reserved for non-replication superuser connections”.

Fixing the connection problem (alter system set max_connections = 1000;) triggered a shared memory too small issue which was eventually fixed by bumping the shared buffer segment to 8GB (alter system set shared_buffers=1024000;). I suspect these parameters were way too large, but the Linode we were on had 32GB of main memory, so fine tuning in this emergency didn’t seem a good use of time.

Fixing the worker problem was way more complex. The way Matrix works, you have to use a haproxy to redirect incoming connections to individual workers and you have to ensure that the same worker always services the same transaction (which you achieve by hashing on IP address). We got a lot of advice from FOSDEM on this aspect, but in the end, instead of using an external haproxy, we went for the built in forward proxy load balancing in nginx. The federation problem seems to be that Matrix simply doesn’t work without a federation sender. In the end, we created 15 generic workers and one each of media server, frontend server and federation sender.

Our configuration files are

once you have all the units enabled in systemd, you can then simply do systemctl start/stop matrix-synapse.target

Finally, to fix the thundering herd problem (for people who hadn’t already logged in), we ran through the entire spreadsheet of email/confirmation numbers doing an automatic login using the user management API on the server itself. At this point we had about half the accounts auto created, so this script created the rest.

emaillist=lpc2021-all-attendees.txt
IFS='   '

while read first last confirmation email; do
    bbblogin=${email/+*@/@}
    matrixlogin=${bbblogin/@/.}
    curl -XPOST -d '{"type":"m.login.password", "user":"'${matrixlogin}'", "password":"'${confirmation}'"}' "http://localhost:8008/_matrix/client/r0/login"
    sleep 1
done < ${emaillist}

The lpc2021-all-attendees.txt is a tab separated text file used to drive the mass mailings to plumbers attendees, but we adapted it to log everyone in to the matrix server.

Conclusion

With the above modifications, the matrix server on a Dedicated 32GB (16 cores) Linode ran smoothly for the rest of the conference. The peak load got to 17 and the peak total CPU usage never got above 70%. Finally, the peak memory usage was around 16GB including cache (so the server was a bit over provisioned).

In the end, 878 of the 944 registered attendees logged into our BBB servers at one time or another and we got a further 100 external matrix users (who may or may not also have had a conference account).

The Community Corrosive Effects of CLAs

As one of the kernel DCO advocates, I’ve written many times about using the DCO instead of a CLA for copyright and patent contributions under open source licences. In spite of my obvious biases, I’ll try to give a factual overview of the cases for the DCO and CLA system. First, it should be noted that both the DCO and any CLA are types of Contribution Agreements (a set of terms by which contributors are agreeing to be bound). It should also be acknowledged that the DCO is a far more recent invention than CLAs. The DCO was first pioneered by the Linux kernel in 2004 (having been designed by Diane Peters, then of OSDL) and was subsequently adopted by a broad range of open source projects. However, in legal terms, the DCO is much less well understood than a standard CLA type agreement between the contributor and some entity, which is largely the reason you find a number of lawyers still advocating for the use of CLAs in various open source projects: because they’d like to stick with something that has more miles on it, or because they’re invested in the older model of community, largely pioneered by Apache. The biggest problem today is that the operation of most CLAs is asymmetrical: they take from the contributor more rights than the open source code actually needs, so lets begin with a summary of each type of Contribution Agreement.

DCO

The DCO is a legal representation by the contributor to everyone who might ever use the code. It requires no second party on the other side to counter sign it or act as the receiving entity, so it exactly mirrors the inbound=outbound licensing model first coined by Richard Fontana. The DCO explicitly grants to all downstream recipients only the exact rights the Open Source licence requires (and nothing more). In this sense it is fully symmetrical: the rights granted by the contributor are the same as the rights received by the downstream (i.e. inbound=outbound). Every contributor under the DCO retains their own copyright (or their company does if the contribution is a work for hire). The main alleged disadvantage of the DCO is that it encourages distributed ownership and makes it very hard to change the licence of the project because each contributor has only granted the rights necessary for the current licence, so if the new one requires more or different rights, all the current contributors have to re-grant those new or different rights (which can be a huge number of people for large long running projects). Since the DCO is a representation to everyone and requires no receiving entity, the project collecting the code doesn’t require any formal legal entity, like a foundation, to operate and thus the DCO gives rise to a truly lightweight structure for any project. The other big advantage of the DCO is that all of the representations are tracked by the Signed-off-by: tag on the commit, which goes in the git repository of the project code, so anyone with a clone of the repository has complete access to information about who changed what and where their DCO signoff is.

CLA

All current Open Source CLAs are structured as agreements between the contributor and a second party. Most often, the second party is a Foundation or a Corporation, making them quite heavy weight in terms of setup, admin and overhead. Every current CLA that I know about takes more rights from the contributor than the open source licence actually requires. For instance the Apache Individual CLA grants the right to copy, derive and sublicence to the Apache foundation who then relicence the contribution to the project usually under the Apache 2.0 licence. This is a classic asymmetric grant because the Apache foundation receives far more rights in the contribution than it grants to the downstream recipients. The FSF CLA is even more extreme because they require assignment of the copyright (so they will own the code and you, the author, will have no further right or interest in it except possibly for minimal moral rights to be named the author). Apart from the asymmetric grant, which places the receiving entity in a privileged position in the ecosystem, the other problem with CLAs is that they’re legal agreements, so they require a lawyer to prepare them, a mechanism to ensure people sign them and a mechanism to keep all the signatures … sometimes this can be in filing cabinets if paper instead of electronic copies are used. This repository of agreements then isn’t available to anyone except the tracking entity, meaning that if someone needs to know if John Doe signed a CLA, they have to reach out and ask. In some cases the actual filing cabinets got lost as projects changed offices, so some CLA based projects don’t actually have complete records of all their CLAs.

CLAs Catalyse Community Corrosion

The main driver of community corrosion is the temptation to abuse a position of power (this temptation becomes irresistable over time because, as Baron Acton put it, “all power corrupts”). Since CLAs by their nature force a power imbalance between the contributor and the receiving entity, they act as focal points for this corrosion. Communities are very sensitive to what they see as their work being misused, so the fastest way to lose community trust is to abuse the power the CLA gave you to go against the community itself. There are numerous examples of this in the Corporate World, the most topical one today being the Elastic change from Apache 2.0 to SSPL to better monetize the code the community contributed freely to. One might think the solution to this is never to sign a CLA if the holder of the power imbalance is a corporation … i.e. only do it if the other entity is a not for profit foundation. But ask yourself, how much do you trust the people running the foundation and do its bylaws guarantee your rights in the code? Relicensing for commercial gain isn’t the only way the community could be abused, so how sure are you of the power you’re handing to a foundation which, after all, is an entity governed by some type of board, all of whom likely have political agendas, won’t be abused? To see some examples of foundations not being in tune with their community, one only has to look at the FSF and Richard Stallman. Based on all of this I conclude, like Drew DeVault, that you should never sign a CLA under any circumstances.

The bottom line is that if you do sign a CLA some decision will happen at some point that you don’t agree with but which you already gave away the power to block because of the rights imbalance inherent in the CLA you signed. Inevitably this decision will cause you to feel betrayed because your views are being ignored and as a contributor you feel you should be heard, so you’ll sour on the project. This is the community corrosion catalyst buried deep inside all CLAs.

One final thing to note is that it is possible to craft a CLA that only takes the rights it needs, in the same way the DCO does, it’s just that no project I know has ever done this. However, even if this experiment were attempted, you still need a recipient entity, plus all the infrastructure to do signing and track the signed agreements, so you’d still be better off using a lightweight DCO process.

Conclusion: For Community Small is Beautiful

The way to avoid the community corrosion problem is to do everything minimally: use a DCO to take only the rights the downstream requires and to avoid all the heavyweight recipient, signing and tracking infrastructure. Don’t set up a foundation unless you absolutely need an entity, say to handle cash, and if you must set one up, never give it any control over the project (like appointing a change control or architecture control board for instance) everything you set up should be as small as possible and clearly serve the project and its community. Above all, don’t use a CLA because it will cause a rights imbalance that corrodes your community and it will require a large amount of overhead to run.

Owning Your Own Copyrights in Open Source

This article covers several aspects: owning the copyrights you develop outside of your employed time and the more thorny aspect of owning the copyrights in open source projects you work on for your employer. It will also take a look at the middle ground of being a contract entity doing paid work on open source. This article follows the historical sweep of my journey through this field and so some aspects may be outdated and all are within the bounds of the US legal system and it’s most certainly not complete, just a description of what I did and what I learned.

Why Should you Own your Own Source code?

In the early days of open source, everything was a hobby project and everyone owned their own contributions. Owning your own contribution was a sort of mark of franchise in the project. Of course, there were some projects, notably the FSF ones, which didn’t believe in distributed ownership and insisted you contribute ownership of your copyrights to them so they could look after the project for you. Obviously, since I’m a Linux Kernel developer and with the Linux Kernel being a huge distributed copyright project, it’s easy to see which side of the argument I fall.

The main rights you give up if you don’t own the code you create are the right to re-licence and the right to enforce. It probably hadn’t occurred to you that if you actually find a licence violation in a project you contribute to for your employer, you’ll have no standing to demand that the problem get addressed. In fact, any enforcement on the code would have to be done by the proper owner: your employer. Plus your employer can control the ultimate destination of that ownership, including selling your code to a copyright troll if they so wished … while you may trust your employer now you work for them, do you trust them to do the right thing for all time, especially since they may be bought out by EvilCorp on down the road?

The relicensing problem can also be thorny: as a strong open source contributor you’ve likely been on the receiving end of requests to relicense (“I really like the code in your project X and would like to incorporate it in my open source project Y, but there’s a licence compatibility problem, would you dual license it?”) and thought nothing about saying “yes”. However, if your employer owns the code, you were likely lying when you said “yes” because you have no relicensing rights and you must ask your employer for permission to do the relicensing.

All the above points up the dangers in the current ecosystem. Project contributors often behave like they own the code but if they don’t they can be leaving a legal minefield in their wakes. The way to fix this is to own your own code … or at least understand the limitations of your rights if you don’t.

Open Source in Your Own Time

It’s a mistake to think that just because you work on something in your own time it isn’t actually owned by your employer. Historically, at least in the US, employment agreements contain incredibly broad provisions for invention ownership which basically try to claim anything you invent at any hour of the day or night that might be even vaguely related to your employment. Not unnaturally this caused huge volumes of litigation around startups where former employees successfully develop innovations their prior employer declined to pursue (at least until it started making money). This has lead to a slew of state based legal safe harbour protections for employee inventions. Most of them, like the Illinois Statute I first used, have similar wording

A provision in an employment agreement which provides that an employee shall assign or offer to assign any of the employee’s rights in an invention to the employer does not apply to an invention for which no equipment, supplies, facilities, or trade secret information of the employer was used and which was developed entirely on the employee’s own time … is … void and unenforceable.

765 ILCS 1060/2

In fact most states now require the wording to appear in the employment contract, so you likely don’t have to look up the statute to figure out what to do. The biggest requirements are that it be on your own time and you not be using any employer equipment, so the most important thing is to make sure you have your own laptop or computer. If you follow the requirements to the letter, you should be safe enough in owning your own time open source code. However, if you really want a guarantee you need to take extra precautions.

Own Time Open Source Carve Outs in Employment agreements

When you join a company, one of the things you’ll sign is a prior invention disclosure form, usually as an appendix to the invention assignment agreement as part of your employment contract. Here’s an example one from the SEC database (ironically for a Chinese subsidiary). Look particularly at section 2(a) “Inventions Retained and Licensed”. It’s basically pure CYA for the company, and most people leave Exhibit A blank, but you shouldn’t do that. What you should do is list all your current and future (by doing sweeping guesswork) own open source projects. The most useful clause in 2(a) says “I agree that I will not incorporate any Prior Inventions into any products …” so you and your employer have now agreed that all the listed projects are outside the scope of your employment agreement.

As far as I can tell, no-one really looks at Exhibit A at all, so I’ve been really general and put things like “The Linux Kernel” and “Open Source UEFI software” “Open Source cryptography such as gnupg, openssl and gnutls” and never been challenged on it.

One legitimate question, which will probably happen if your carve outs are very broad, is what happens if your employer specifically asks you to work on a project you’ve declared in Exhibit A? Ideally you could use this as an opportunity to negotiate an addendum to your contract covering your ownership of open source. However, if you don’t want to rock the boat, you can simply do nothing and rely on the fact that the agreement has something to say about this. The sample section 2(a) above goes on to give your employer a non-exclusive licence, which you could take as agreement to your continued ownership of the copyrights in the code, even through your employer is now instructing you (and paying you) to work on it. However, the say nothing approach has never been tested in court and may be vulnerable to challenge, so a safer course is to send your manager an email pointing out the issue and proposing to follow the licence in the employment contract. If they do nothing, thinking the matter settled, as most managers do, then you have legal cover for continuing to own your own copyrights. You can make it as vague as you like, so using the above sample agreement, something like “You’ve asked me to work on Project X which was listed in Exhibit A of my employment agreement. To move forward, I’m happy to licence all future works on this project to you under the terms of section 2(a)”. It looks innocuous, but it’s actually a statement that your company doesn’t get copyright ownership because of the actual wording in section 2(a) says the company gets a non-exclusive licence if you incorporate any works listed in Exhibit A. Remember to save the email somewhere safe (and any reply which is additional proof it was seen) just in case.

Owning Open Source Produced on Company Time

The first thing to note is that if your employer pays for you to work on open source, absent any side agreement, the code that you produce will be owned by your employer. This isn’t some US specific thing, this is a general principle of employment the world over (they pay you, so they own it). So even if you work in Europe, your employer will still own your open source copyrights if they pay you to work on the project, moral rights arguments notwithstanding. The only way to change this is to get some sort of explicit or implicit (if you want to go the carve out route above) agreement about the ownership.

Although I’ve negotiated both joint and exclusive ownership of open source via employment agreements, the actual agreements are still the property of the relevant corporations and thus, unfortunately, while I can describe some of the elements, I can’t publish the text (employment agreements are the crown jewels the HR dragons guard).

How to Negotiate

Most employers (or at least their lawyers) will refuse point blank to change the wording of employment agreements. However, what you want can be a side agreement and usually doesn’t require rewording the employment agreement at all. All you need is the understanding that the side agreement will get executed. One big problem can be that most negotiations over employment agreements occur with people from HR, which is a department with the least understanding of open source, so you don’t want to be negotiating the side agreement with them, you want to talk to the person that is hiring you. You also need to present your request as reasonable, so find out if anyone inside your prospective new company has done something similar. Often they have, and they’ll likely be someone in open source you’ve at least heard of so you can approach them and ask for details. “But you gave a copyright ownership side agreement to X” is often a great way to advance your cause. Don’t be afraid to ask and argue politely but firmly … hiring talented developers is very competitive nowadays so they have (or at least the manager who wants to hire you has) a vested interest in keeping you happy.

Consider Joint Ownership

Joint ownership is a specific legal term meaning the rights in a copyright are shared by the joint owners. Effectively this sharing means that either party may enforce without consulting the other and either party may license the work without consulting the other (but here they must share any profits from the licence equally among joint owners).

Joint ownership is often a good solution because it gives you the right to relicence and the right to enforce, while also giving your employer a share in what they paid to produce. Joint ownership is often far easier to sell to corporations than one or other of you having exclusive ownership because it gives them all the rights they would have had anyway. The only slight concern you may have down the road is it does give them the right to relicence or sell on their ownership, say to an open core business or to an enforcement troll. However, the good news is that as joint owner you now have a right to a half share of any profit they (and the new owner) make out of such a rights transfer, which can potentially act as a deterrent to the transaction if you remind them of this requirement.

Open Source as a Contractor

In some ways this is the best relationship. There are no work for hire assumptions about companies you contract for owning your free time, so doing other open source projects is easy. However, a contractor is bound by whatever contract you sign, so you need someone with legal training to help you make sure it is actually equitable. You can’t get around this legal requirement: the protections that exist for employees don’t exist for contractors, so if you sign a contract saying in exchange for a certain sum company X owns the entirety of your output, you will be bound by it. So remember: read the contract and negotiate the terms.

Copyright Ownership as a Contractor

Surprisingly, in a relationship where you’re contracted to get something upstream, it’s often in the client’s best interest to have the contractor own the copyrights in Open Source. It means the contractor is responsible for all the nitty gritty of pushing patches and dealing with contribution agreements and the client simply gets the end product: the thing they wanted upstream. I’ve found this a surprisingly easy sell to most legal departments. Even if the client does want some sort of ownership of the code, you can offer joint ownership as the easy route to you taking on all the hassle and them getting the benefits of ownership.

Trade Secrets

As a contractor, you’ll likely be forced to sign an NDA never to reveal client secrets. This is pretty usual, but the pitfall in open source, particularly if you’re doing a driver for a device whose programming manual is under NDA, is that you are going to be revealing them contrary to the NDA. You need this handled in an equitable fashion in the contract to avoid unpleasant problems long after the job is done. The simplest phrase you need is something like “Client understands that open source is developed in public and authorizes that all information necessary to producing X under this contract be disclosed to the public”.

Patents

Patents can be a huge minefield with contract open source, because as a contractor who owns the copyrights and negotiates the contribution agreements, you have no authority to bind your client’s patents. You really don’t want to find yourself being used as a conduit for a patent ambush on open source (where a client contracts with you to put code into a project which reads on a patent they hold and then turns around and patent trolls the ecosystem) so you need contract language binding the client patents at least in the work you’re doing for them. Something simple like “Client grants a perpetual and irrevocable licence, consistent with the terms of the open source licence for X, to all contributions made by contractor to X that read on patents client holds now or may in future acquire”. This latter is pretty narrow, so you could start out by trying to get a patent licence for the entirety of project X and negotiate down from there.

Conclusions

Owning your own copyrights in open source is possible provided you’re careful. The strategies outlined above are based on my own experiences (all in the US) as a contract employee from 1995-2008 there after as a regular employee but are not the only ones you could pursue, so ask around to see what others have done as well. The main problem with all the strategies above is that they work well when you’re negotiating your employment. If you’re already working at some corporation they’re unlikely to be helpful to you unless you really have a simple own time open source project. Oh, and just remember that while the snippets I quoted above for the contract case may actually have been in contracts I signed, this isn’t legal advice and you should have a lawyer advise you how best to incorporate the various points raised.

Papering Over our TPM 2.0 TSS Divisions

For years I’ve been hoping that the Trusted Computing Group (TCG) based IBM and Intel TSS (TCG Software Stack) would simply integrate with one another into a single package. The rationale is pretty simple: the Intel TSS is already quite a large collection of libraries so adding one more (the IBM TSS has a single library) wouldn’t be too much of a burden. Both TSSs are based on TCG specifications, except that the IBM TSS is based on the TPM 2.0 Library Specification and the Intel TSS is based on the TPM Software Stack (also, not at all confusingly, abbreviated TSS). There’s actually very little overlap between these specifications so co-existence seems very reasonable. Before we get into the stories of these two stacks and what they do, I should confess my biases: while I’ve worked with the TCG over the years, I’ve always harboured the view that the complete lack of adoption of TPM 2.0’s predecessor (TPM 1.2) was because of the hugely complicated nature of the TCG mandated software stack which was implemented in Linux by trousers. It is my firm belief that the complexity of the API lead to the lack of uptake, even though I made several efforts over the years to make use of it.

My primary interest in the TPM has been as a secure laptop keystore (since I already paid for a TPM, I didn’t see the need to fork out again for one of the new security dongles; plus the TPM is infinitely scalable in the number of keys, unlike most dongles). The key to making the TPM usable in this form is integration with existing Cryptographic systems (via plugins if they do them). Since openssl has an engine plugin, I’ve already produced an openssl TPM2 engine, patches for gnupg and engine integration patches for openvpn (upstream in 2.5) and openssh as well as a PKC11 exporter (to make file based engine keys exportable as PKCS11 tokens). Note a lot of the patches aren’t strictly TPM patches, they’re actually making openssl engines work in places they previously didn’t. However, the one thing most of the patches that actually touch the TPM have in common is that they have to pick one or other of the available TSSs to operate with. Before describing the TSS agnostic solution, lets look at why these two TSSs exist and what the difference is between them and why you might choose one over the other.

Schizophrenia at the TCG

As I said in the introduction, both TSSs are based on TCG specifications. These standards aren’t ambiguous: they lay out in excruciating detail what the header files are called and what the prototypes and structures have to be. Both TSS implementations are the way they are because they wouldn’t be following the standards if they deviated even slightly. The problem is the standards don’t agree with each other in meaningful ways. For instance the TPM Library standards define every structure in terms of the fundamental unit of TPM data: the TPM2B structure, which defines a 16 bit big endian length followed by a data unit of that length. The TPM Library standards (in Part 4 section 9.10.6) lay out that every TPM2B_X structure shall be a union of a ‘b’ element which is a TPM2B and a ‘t’ element which is the actual structure. However the TPM Software Stack specification eliminates the plain TPM2B so every TPM2B_X structure in the latter specification are not unions, they are simply the ‘t’ form of the structure. This means that although TPM2B_X structures in each specification are byte for byte the same, they are definitionally different when written as C code and can’t be assigned to each other … oops. The TPM Library standard lays out additional structures for an elaborate calling convention for the TPM2_Command interfaces which are completely different from the ESYS_Command interfaces in the TPM Software Stack.

The reason it’s all done this way? well the specifications were built by completely different committees for what the committees saw as separate use cases, so they didn’t see a need to reconcile the differences. As long as the definitions were byte for byte compatible, everything would work out correctly on the wire. The problem was the TPM Library specification was released nearly a decade ahead of the TPM Software Stack specification, so the first TSS created had to follow the former because the latter didn’t exist.

Sessions, HMAC and Encryption

One of the perennial problems of a TPM is that integrity and security of the information going over the wire is the responsibility of the user. However, the encryption and integrity computations involved, particularly the key derivations, are incredibly involved (even though well documented in the TPM Library specification, so naturally everyone would like the TSS to do this. The problem the TPM Secure Stack had is that all the way up to its ESAPI specification, the security and integrity computations were still the responsibility of the user, so it didn’t begin to be useful until ESAPI was finalized a couple of years ago.

The Resource Manager Problem

TPM 2.0 was designed to be far leaner in terms of resources than TPM 1.2, which meant there was a very small limit to the number of sessions and volatile objects it could contain at any one time. This necessitated the use of a “resource manager” to control access otherwise applications would get unexpected out of resource errors. The Intel TSS has its own resource manager. However, the Linux Kernel itself incorporated a resource manager in the TPM device in 4.12 and the IBM TSS avoids the need for its own resource manager by using this, and will, therefore not work correctly on earlier kernel versions.

Inside the IBM TSS

Even though the IBM TSS is based on a solid and easily comprehensible and detailed specification, that specification itself suffers from a couple of defects. The first being it assumes you’re submitting to a physical TPM, so the specification has no functional (library based) submission API for TPM commands, so the IBM TSS had to invent API it called TSS_Execute() which is a way of sending TPM commands directly to the physical TPM over the kernel’s device interfaces. Secondly, the standard contains no routing interfaces (telling it what destination the TPM is on: should it open the /dev/tpmrm0 device or send the commands to the TPM over an IP socket), so this is controlled in the IBM TSS by several environment variables (TPM_INTERFACE_TYPE, which can be either “dev” or “socsim” for either a physical device or a network socket. The endpoints being controlled by TPM_DEVICE for “dev” type, which specifies which device to use, defaulting to /dev/tpmrm0 or TPM_SERVER_NAME and TPM_PLAFORM_PORT for “socsim”).

The invented TSS_Execute() API also does all the encryption and HMAC parts necessary for secure and integrity verified communication with the TPM, so it acts as a fully functional TSS. The main drawback of the IBM TSS is that it stores essential information about the sessions and handles in files which will, by default, be dropped into the local directory. Most users of the IBM TSS have to set TPM_DATA_DIR to be a specially created directory under /tmp to avoid leaving messy artifacts in users home directories.

Inside the Intel TSS

The TPM Software Stack consists of a large number of different specifications, including the resource manager (which is now unnecessary for kernels above 4.12) the TCTI which specifies the routing information for the TPM. It turns out that even in the Intel TSS, environment variables are the most convenient form to specify this information but, unfortunately, the name of the environment variable has been left up to each use case instead of being standardised in the library meaning you’ll have to consult the man page to figure out what it is. The next set of standards: SAPI and ESAPI define functional interfaces to the TPM with one submission API for each command and additionally a corresponding ..._Async()/..._Finish() pair for asynchronous programming. The only real difference between SAPI and ESAPI is that the latter also does the necessary session cryptography for security and integrity, so it’s pretty much the only usable interface for TPM commands. Unfortunately, the ESAPI interface, as constructed by the TCG, has several cases of premature abstraction the worst of which is a separate abstraction for the TPM handle interface which lives only as long as the lifetime of the connection object and which necessitates multiple conversions to and from internal handle objects if your session or object lives longer than the connection (which can be the case).

There is one final wrinkle is that in the handle abstraction, ESAPI has no API for retrieving the real TPM handle. I’d always wondered why the Intel TSS tpm2 tools always saved the objects they create to a context instead of simply returning the handle to them, but this is the reason: without the ability to transform an internal handle to an external one, you either save the context or let the object die when the connection terminates. This problem is one forced by the ESAPI standard, but eventually it became enough of a problem that the Intel TSS introduced its own additional API to remedy.

The other major difference between the Intel and IBM TSSs is memory handling for returned results: The IBM TSS requires pre-allocated structures whereas the Intel TSS insists on allocation on return. It looks like the Intel TSS should be able to tell if the return pointer is allocated or NULL, but right at the moment it always allocates and overwrites the pointer.

Constructing a unifying Interface for both the IBM and Intel TSSs

In essence the process for converting something that runs with the IBM TSS to being TSS Agnostic is a fairly simple three step process which I’ll illustrate by reference to the openssl tpm2 engine which has already been converted:

  1. Hide the structural differences by inserting a set of macros: VAL() and VAL_2B() which hide most of the TCG induced structure schizophrenia.
  2. Convert the API call structure to be functional instead of via a single TSS_Execute() call. This is quite involved so I did it by adding tpm2_Function() wrappers for each specific invocation.
  3. Introduce the correct premature abstraction for internal and external representation of handles. This was the nastiest step for me because handles are stored in long lived engine structures, and the internal and external representations are both forms of uint32_t even in ESAPI (meaning the compiler won’t complain if you assign one to the other) so it was incredibly painful to get this conversion correct.

Once this is done, the remaining step was to introduce a header which did the impedance matching between the Intel and IBM TSSs and an autoconf macro to detect which TSS is installed and the resulting configure and compile just works. The resulting code will now build and run under either TSS. I should point out that the Intel TSS is missing several helper routines, but these are added into the intel-tss.h header file by copying the from the original IBM TSS. Finally an autoconf check is added to look for the missing internal to external handle transform, and everything is ready to go.

It does seem like it would be easier to port an existing Intel TSS application to the IBM TSS, since points 2 and 3 will already be sorted out. However, all the major TSS library using applications are IBM TSS based, so I haven’t actually been able to verify this.

Remaining Problems and Anomalies

The biggest remaining issue was the test scripts. The openssl TPM2 engine has 27 of them all told, all designed to check the engine function by invoking it via openssl when connected to a software TPM. These scripts are all highly dependent on the IBM TSS command line binaries and the Intel TSS versions seem to be very unstable in terms of argument structure making it pretty much impossible to convert, so I elected finally to have the tests run only if the IBM TSS CLI is installed. The next problem was that the Intel TSS version of the engine didn’t actually pass all the tests. However this was quickly narrowed down to a bug in the Intel TSS when using bound sessions on the NULL seed.

The sole remaining issue is a curious performance anomaly. When running time make check with the IBM TSS, the result is:

real 0m6.100s
user 0m2.827s
sys 0m0.822s

and the same command with the Intel TSS (running one fewer test and skipping the NULL seed) is:

real	0m10.948s
user	0m6.822s
sys	0m0.859s

Showing that the Intel TSS is nearly twice as slow as the IBM one with most of the time differential being user time. Since the tests use a software TPM which can perform the cryptographic operations at the speed of the main CPU, this is showing some type of issue with the command transmission system of the Intel TSS, likely having to do with the fact that most applications use synchronous TPM operations (the engine certainly does) but in the Intel TSS, the synchronous operations are implemented as the corresponding asynchronous pair. Regardless of the root cause, this is unlikely to be a problem with real world TPM crypto where the time taken for any operation will be dominated by the slowness of the physical TPM.

Conclusion

The TSS agnostic scheme adopted by the openssl TPM2 engine should be easily adaptable for all the other non-engine TPM code bases, and thus should pave the way for users not having to choose between applications which only support the Intel or IBM TSSs and can choose to install the best supported one on their distribution. The next steps are to investigate adapting this infrastructure to the existing gnupg patches (done and upstream) and also see if it can be used to solve the gnutls conundrum over supporting TPM based keys.

Deploying Encrypted Images for Confidential Computing

In the previous post I looked at how you build an encrypted image that can maintain its confidentiality inside AMD SEV or Intel TDX. In this post I’ll discuss how you actually bring up a confidential VM from an encrypted image while preserving secrecy. However, first a warning: This post represents the state of the art and includes patches that are certainly not deployed in distributions and may not even be upstream, so if you want to follow along at home you’ll need to patch things like qemu, grub and OVMF. I should also add that, although I’m trying to make everything generic to confidential environments, this post is based on AMD SEV, which is the only confidential encrypted1 environment currently shipping.

The Basics of a Confidential Computing VM

At its base, current confidential computing environments are about using encrypted memory to run the virtual machine and guarding the encryption key so that the owner of the host system (the cloud service provider) can’t get access to it. Both SEV and TDX have the encryption technology inside the main memory controller meaning the L1 cache isn’t encrypted (still vulnerable to cache side channels) and DMA to devices must also be done via unencryped memory. This latter also means that both the BIOS and the Operating System of the guest VM must be enlightened to understand which pages to encrypted and which must not. For this reason, all confidential VM systems use OVMF2 to boot because this contains the necessary enlightening. To a guest, the VM encryption looks identical to full memory encryption on a physical system, so as long as you have a kernel which supports Intel or AMD full memory encryption, it should boot.

Each confidential computing system has a security element which sits between the encrypted VM and the host. In SEV this is an aarch64 processor called the Platform Security Processor (PSP) and in TDX it is an SGX enclave running Intel proprietary code. The job of the PSP is to bootstrap the VM, including encrypting the initial OVMF and inserting the encrypted pages. The security element also includes a validation certificate, which incorporates a Diffie-Hellman (DH) key. Once the guest owner obtains and validates the DH key it can use it to construct a one time ECDH encrypted bundle that can be passed to the security element on bring up. This bundle includes an encryption key which can be used to encrypt secrets for the security element and a validation key which can be used to verify measurements from the security element.

The way QEMU boots a Q35 machine is to set up all the configuration (including a disk device attached to the VM Image) load up the OVMF into rom memory and start the system running. OVMF pulls in the QEMU configuration and constructs the necessary ACPI configuration tables before executing grub and the kernel from the attached storage device. In a confidential VM, the first task is to establish a Guest Owner (the person whose encrypted VM it is) which is usually different from the Host Owner (the person running or controlling the Physical System). Ownership is established by transferring an encrypted bundle to the Secure Element before the VM is constructed.

The next step is for the VMM (QEMU in this case) to ask the secure element to provision the OVMF Firmware. Since the initial OVMF is untrusted, the Guest Owner should ask the Secure Element for an attestation of the memory contents before the VM is started. Since all paths lead through the Host Owner, who is also untrusted, the attestation contains a random nonce to prevent replay and is HMAC’d with a Guest Supplied key from the Launch Bundle. Once the Guest Owner is happy with the VM state, it supplies the Wrapped Key to the secure element (along with the nonce to prevent replay) and the Secure Element unwraps the key and provisions it to the VM where the Guest OS can use it for disc encryption. Finally, the enlightened guest reads the encrypted disk to unencrypted memory using DMA but uses the disk encryptor to decrypt it to encrypted memory, so the contents of the Encrypted VM Image are never visible to the Host Owner.

The Gaps in the System

The most obvious gap is that EFI booting systems don’t go straight from the OVMF firmware to the OS, they have to go via an EFI bootloader (grub, usually) which must be an efi binary on an unencrypted vFAT partition. The second gap is that grub must be modified to pick the disk encryption key out of wherever the Secure Element has stashed it. The third is that the key is currently stashed in VM memory before OVMF starts, so OVMF must know not to use or corrupt the memory. A fourth problem is that the current recommended way of booting OVMF has a flash drive for persistent variable storage which is under the control of the host owner and which isn’t part of the initial measurement.

Plugging The Gaps: OVMF

To deal with the problems in reverse order: the variable issue can be solved simply by not having a persistent variable store, since any mutable configuration information could be used to subvert the boot and leak the secret. This is achieved by stripping all the mutable variable handling out of OVMF. Solving key stashing simply means getting OVMF to set aside a page for a secret area and having QEMU recognise where it is for the secret injection. It turns out AMD were already working on a QEMU configuration table at a known location by the Reset Vector in OVMF, so the secret area is added as one of these entries. Once this is done, QEMU can retrieve the injection location from the OVMF binary so it doesn’t have to be specified in the QEMU Machine Protocol (QMP) command. Finally OVMF can protect the secret and package it up as an EFI configuration table for later collection by the bootloader.

The final OVMF change (which is in the same patch set) is to pull grub inside a Firmware Volume and execute it directly. This certainly isn’t the only possible solution to the problem (adding secure boot or an encrypted filesystem were other possibilities) but it is the simplest solution that gives a verifiable component that can be invariant across arbitrary encrypted boots (so the same OVMF can be used to execute any encrypted VM securely). This latter is important because traditionally OVMF is supplied by the host owner rather than being part of the VM image supplied by the guest owner. The grub script that runs from the combined volume must still be trusted to either decrypt the root or reboot to avoid leaking the key. Although the host owner still supplies the combined OVMF, the measurement assures the guest owner of its correctness, which is why having a fairly invariant component is a good idea … so the guest owner doesn’t have potentially thousands of different measurements for approved firmware.

Plugging the Gaps: QEMU

The modifications to QEMU are fairly simple, it just needs to scan the OVMF file to determine the location for the injected secret and inject it correctly using a QMP command.. Since secret injection is already upstream, this is a simple find and make the location optional patch set.

Plugging the Gaps: Grub

Grub today only allows for the manual input of the cryptodisk password. However, in the cloud we can’t do it this way because there’s no guarantee of a secure tty channel to the VM. The solution, therefore, is to modify grub so that the cryptodisk can use secrets from a provider, in addition to the manual input. We then add a provider that can read the efi configuration tables and extract the secret table if it exists. The current incarnation of the proposed patch set is here and it allows cryptodisk to extract a secret from an efisecret provider. Note this isn’t quite the same as the form expected by the upstream OVMF patch in its grub.cfg because now the provider has to be named on the cryptodisk command line thus

cryptodisk -s efisecret

but in all other aspects, Grub/grub.cfg works. I also discovered several other deviations from the initial grub.cfg (like Fedora uses /boot/grub2 instead of /boot/grub like everyone else) so the current incarnation of grub.cfg is here. I’ll update it as it changes.

Putting it All Together

Once you have applied all the above patches and built your version of OVMF with grub inside, you’re ready to do a confidential computing encrypted boot. However, you still need to verify the measurement and inject the encrypted secret. As I said before, this isn’t easy because, due to replay defeat requirements, the secret bundle must be constructed on the fly for each VM boot. From this point on I’m going to be using only AMD SEV as the example because the Intel hardware doesn’t yet exist and AMD kindly gave IBM research a box to play with (Anyone with a new EPYC 7xx1 or 7xx2 based workstation can likely play along at home, but check here). The first thing you need to do is construct a launch bundle. AMD has a tool called sev-tool to do this for you and the first thing you need to do is obtain the platform Diffie Hellman certificate (pdh.cert). The tool will extract this for you

sevtool --pdh_cert_export

Or it can be given to you by the cloud service provider (in this latter case you’ll want to verify the provenance using sevtool –validate_cert_chain, which contacts the AMD site to verify all the details). Once you have a trusted pdh.cert, you can use this to generate your own guest owner DH cert (godh.cert) which should be used only one time to give a semblance of ECDHE. godh.cert is used with pdh.cert to derive an encryption key for the launch bundle. You can generate this with

sevtool --generate_launch_blob <policy>

The gory details of policy are in the SEV manual chapter 3, but most guests use 1 which means no debugging. This command will generate the godh.cert, the launch_blob.bin and a tmp_tk.bin file which you must save and keep secure because it contains the Transport Encryption and Integrity Keys (TEK and TIK) which will be used to encrypt the secret. Figuring out the qemu command line options needed to launch and pause a SEV guest is a bit of a palaver, so here is mine. You’ll likely need to change things, like the QMP port and the location of your OVMF build and the launch secret.

Finally you need to get the launch measure from QMP, verify it against the sha256sum of OVMF.fd and create the secret bundle with the correct GUID headers. Since this is really fiddly to do with sevtool, I wrote this python script3 to do it all (note it requires qmp.py from the qemu git repository). You execute it as

sevsecret.py --passwd <disk passwd> --tiktek-file <location of tmp_tk.bin> --ovmf-hash <hash> --socket <qmp socket>

And it will verify the launch measure and encrypt the secret for the VM if the measure is correct and start the VM. If you got everything correct the VM will simply boot up without asking for a password (if you inject the wrong secret, it will still ask). And there you have it: you’ve booted up a confidential VM from an encrypted image file. If you’re like me, you’ll also want to fire up gdb on the qemu process just to show that the entire memory of the VM is encrypted …

Conclusions and Caveats

The above script should allow you to boot an encrypted VM anywhere: locally or in the cloud, provided you can access the QMP port (most clouds use libvirt which introduces yet another additional layering pain). The biggest drawback, if you refer to the diagram, is the yellow box: you must trust the secret element, which in both Intel and AMD is proprietary4, in order to get confidential computing to work. Although there is hope that in future the secret element could be fully open source, it isn’t today.

The next annoyance is that launching a confidential VM is high touch requiring collaboration from both the guest owner and the host owner (due to the anti-replay nonce). For a single launch, this is a minor annoyance but for an autoscaling (launch VMs as needed) platform it becomes a major headache. The solution seems to be to have some Hardware Security Module (HSM), like the cloud uses today to store encryption keys securely, and have it understand how to measure and launch encrypted VMs on behalf of the guest owner.

The final conclusion to remember is that confidentiality is not security: your VM is as exploitable inside a confidential encrypted VM as it was outside. In many ways confidentiality and security are opposites, in that security in part requires reducing the trusted code and confidentiality requires pulling as much as possible inside. Confidential VMs do have an answer to the Cloud trust problem since the enterprise can now deploy VMs without fear of tampering by the cloud provider, but those VMs are as insecure in the cloud as they were in the Enterprise Data Centre. All of this argues that Confidential Computing, while an important milestone, is only one step on the journey to cloud security.

Patch Status

The OVMF patches are upstream (including modifications requested by Intel for TDX). The QEMU and grub patch sets are still on the lists.

Building Encrypted Images for Confidential Computing

With both Intel and AMD announcing confidential computing features to run encrypted virtual machines, IBM research has been looking into a new format for encrypted VM images. The first question is why a new format, after all qcow2 only recently deprecated its old encrypted image format in favour of luks. The problem is that in confidential computing, the guest VM runs inside the secure envelope but the host hypervisor (including the QEMU process) is untrusted and thus runs outside the secure envelope and, unfortunately, even for the new luks format, the encryption of the image is handled by QEMU and so the encryption key would be outside the secure envelope. Thus, a new format is needed to keep the encryption key (and, indeed, the encryption mechanism) within the guest VM itself. Fortunately, encrypted boot of Linux systems has been around for a while, and this can be used as a practical template for constructing a fully confidential encrypted image format and maintaining that confidentiality within a hostile cloud environment. In this article, I’ll explore the state of the art in encrypted boot, constructing EFI encrypted boot images, and finally, in the follow on article, look at deploying an encrypted image into a confidential environment and maintaining key secrecy in the cloud.

Encrypted Boot State of the Art

Luks and the cryptsetup toolkit have been around for a while and recently (in 2018), the luks format was updated to version 2. However, actually booting a linux kernel from an encrypted partition has always been a bit of a systems problem, primarily because the bootloader (grub) must decrypt the partition to actually load the kernel. Fortunately, grub can do this, but unfortunately the current grub in most distributions (2.04) can only read the version 1 luks format. Secondly, the user must type the decryption passphrase into grub (so it can pull the kernel and initial ramdisk out of the encrypted partition to boot them), but grub currently has no mechanism to pass it on to the initial ramdisk for mounting root, meaning that either the user has to type their passphrase twice (annoying) or the initial ramdisk itself has to contain a file with the disk passphrase. This latter is the most commonly used approach and only has minor security implications when the system is in motion (the ramdisk and the key file must be root read only) and the password is protected at rest by the fact that the initial ramdisk is also on the encrypted volume. Even more annoying is the fact that there is no distribution standard way of creating the initial ramdisk. Debian (and Ubuntu) have the most comprehensive documentation on how to do this, so the next section will look at the much less well documented systemd/dracut mechanism.

Encrypted Boot for Systemd/Dracut

Part of the problem here seems to be less that stellar systems co-ordination between the two components. Additionally, the way systemd supports passphraseless encrypted volumes has been evolving for a while but changed again in v246 to mirror the Debian method. Since cloud images are usually pretty up to date, I’ll describe this new way. Each encrypted volume is referred to by UUID (which will be the UUID of the containing partition returned by blkid). To get dracut to boot from an encrypted partition, you must pass in

rd.luks.uuid=<UUID>

but you must also have a key file named

/etc/cryptsetup-keys.d/luks-<UUID>.key

And, since dracut hasn’t yet caught up with this, you usually need a cryptodisk.conf file in /etc/dracut.conf.d/ which contains

install_items+=" /etc/cryptsetup-keys.d/* "

Grub and EFI Booting Encrypted Images

Traditionally grub is actually installed into the disk master boot record, but for EFI boot that changed and the disk (or VM image) must have an EFI System partition which is where the grub.efi binary is installed. Part of the job of the grub.efi binary is to find the root partition and source the /boot/grub1/grub.cfg. When you install grub on an EFI partition a search for the root by UUID is actually embedded into the grub binary. Another problem is likely that your distribution customizes the location of grub and updates the boot variables to tell the system where it is. However, a cloud image can’t rely on the boot variables and must be installed in the default location (\EFI\BOOT\bootx64.efi). This default location can be achieved by adding the –removable flag to grub-install.

For encrypted boot, this becomes harder because the grub in the EFI partition must set up the cryptographic location by UUID. However, if you add

GRUB_ENABLE_CRYPTODISK=y

To /etc/default/grub it will do the necessary in grub-install and grub-mkconfig. Note that on Fedora, where every other GRUB_ENABLE parameter is true/false, this must be ‘y’, unfortunately grub-install will look for =y not =true.

Putting it all together: Encrypted VM Images

Start by extracting the root of an existing VM image to a tar file. Make sure it has all the tools you will need, like cryptodisk and grub-efi. Create a two partition raw image file and loopback mount it (I usually like 4GB) with a small efi partition (p1) and an encrypted root (p2):

truncate -s 4GB disk.img
parted disk.img mklabel gpt
parted disk.img mkpart primary 1Mib 100Mib
parted disk.img mkpart primary 100Mib 100%
parted disk.img set 1 esp on
parted disk.img set 1 boot on

Now setup the efi and cryptosystem (I use ext4, but it’s not required). Note at this time luks will require a password. Use a simple one and change it later. Also note that most encrypted boot documents advise filling the encrypted partition with random numbers. I don’t do this because the additional security afforded is small compared with the advantage of converting the raw image to a smaller qcow2 one.

losetup -P -f disk.img          # assuming here it uses loop0
l=($(losetup -l|grep disk.img)) # verify with losetup -l
mkfs.vfat ${l}p1
blkid ${l}p1       # remember the EFI partition UUID
cryptsetup --type luks1 luksFormat ${l}p2 # choose temp password
blkid ${l}p2       # remember this as <UUID> you'll need it later 
cryptsetup luksOpen ${l}p2 cr_root
mkfs.ext4 /dev/mapper/cr_root
mount /dev/mapper/cr_root /mnt
tar -C /mnt -xpf <vm root tar file>
for m in run sys proc dev; do mount --bind /$m /mnt/$m; done
chroot /mnt

Create or modify /etc/fstab to have root as /dev/disk/cr_root and the EFI partition by label under /boot/efi. Now set up grub for encrypted boot2

echo "GRUB_ENABLE_CRYPTODISK=y" >> /etc/default/grub
mount /boot/efi
grub-install --removable --target=x86_64-efi
grub-mkconfig -o /boot/grub/grub.cfg

For Debian, you’ll need to add an /etc/crypttab entry for the encrypted disk:

cr_root UUID=<uuid> luks none

And then re-create the initial ramdisk. For dracut systems, you’ll have to modify /etc/default/grub so the GRUB_CMDLINE_LINUX has a rd.luks.uuid=<UUID> entry. If this is a selinux based distribution, you may also have to trigger a relabel.

Now would also be a good time to make sure you have a root password you know or to install /root/.ssh/authorized_keys. You should unmount all the binds and /mnt and try EFI booting the image. You’ll still have to type the password a couple of times, but once the image boots you’re operating inside the encrypted envelope. All that remains is to create a fast boot high entropy low iteration password and replace the existing one with it and set the initial ramdisk to use it. This example assumes your image is mounted as SCSI disk sda, but it may be a virtual disk or some other device.

dd if=/dev/urandom bs=1 count=33|base64 -w 0 > /etc/cryptsetup-keys.d/luks-<UUID>.key
chmod 600 /etc/cryptsetup-keys.d/luks-<UUID>.key
cryptsetup --key-slot 1 luksAddKey /dev/sda2 # permanent recovery key
cryptsetup --key-slot 0 luksRemoveKey /dev/sda2 # remove temporary
cryptsetup --key-slot 0 --iter-time 1 luksAddKey /dev/sda2 /etc/cryptsetup-keys.d/luks-<UUID>.key

Note the “-w 0” is necessary to prevent the password from having a trailing newline which will make it difficult to use. For mkinitramfs systems, you’ll now need to modify the /etc/crypttab entry

cr_root UUID=<UUID> /etc/cryptsetup-keys.d/luks-<UUID>.key luks

For dracut you need the key install hook in /etc/dracut.conf.d as described above and for Debian you need the keyfile pattern:

echo "KEYFILE_PATTERN=\"/etc/cryptsetup-keys.d/*\"" >>/etc/cryptsetup-initramfs/conf-hook

You now rebuild the initial ramdisk and you should now be able to boot the cryptosystem using either the high entropy password or your rescue one and it should only prompt in grub and shouldn’t prompt again. This image file is now ready to be used for confidential computing.