Category Archives: Uncategorized

Converting Engines to OpenSSL-3 Providers

Engines in OpenSSL have a long history of providing new algorithms (Russian GOST hash/signature etc) but they can also be used to interface external crypto tokens (pkcs#11) or even key managers like my own TPM engine. I’ve actually been using my TPM2 engine for nearly a decade so that I no longer have to have an unprotected private keys anywhere on my laptops (including for ssh). The purpose of this post is to look at the differences between Providers and Engines and give advice on the minimum necessary Provider implementation to give back all the Engine functionality. So this post is aimed at Engine developers who wish to convert to Providers rather than giving user advice for either.

TPMs and Engines

TPM2 actually has a remarkable number of algorithms: hashing, symmetric encryption, asymmetric signatures, key derivation, etc. However, most TPMs are connected to the host over very slow busses (usually serial), which means that no-one in their right mind would use a TPM for bulk data operations (like hashing or symmetric encryption) since it will take orders of magnitude longer than if the native CPU did it. Thus from an Engine point of view, the TPM is really only good for guarding private asymmetric keys and doing sign or decrypt operations on them, which are the only capabilities the TPM engine has.

Hashes and Signatures

Although I said above we don’t use the TPM for doing hashes, the TPM2_Sign() routines insist on knowing which hash they’re signing. For ECDSA signatures, this is irrelevant, since the hash type plays no part in the signature (it’s always truncated to key length and converted to a bignum) but for RSA the ASN.1 form of the hash description is part of the toBeSigned data. The problem now is that early TPM2’s only had two hash algorithms (sha1 and sha256) and the engine wanted to be able to use larger hash sizes. The solution was actually easy: lie about the hash size for ECDSA, so always give the hash that’s the width of the key (sha256 for NIST P-256 and sha384 for NIST P-384) and left truncate the passed in hash if larger or left zero pad if smaller.

For RSA, the problem is more acute, since TPM2_Sign() actually takes a raw digest and adds the hash description but the engine code sends down the fully described hash which merely needs to be padded if PKCS1 (PSS data is fully padded when sent down) and encrypted with the private key. The solution to this taken years ago was not to bother with TPM2_Sign() at all for RSA keys but instead to do a Decrypt operation¹. This also means that TPM RSA engine keys are marked as decryption keys, not signing keys.

The Engine Itself

Given that the TPM is really only guarding the private keys, it only makes sense to substitute engine functions for the private key operations. Although the TPM can do public key operations, the core OpenSSL routines do them much faster and no information is leaked about the private key by doing them through OpenSSL, so Engine keys were constructed from standard OpenSSL keys by substituting a couple of private key methods from the underlying key types. One thing Engines were really bad at was passing additional parameters at key creation time and doing key wrapping. The result is that most Engines already have a separate tool to create engine keys (create_tpm2_key for the TPM2 engine) because complex arguments are needed for TPM specific things like key policy.

TPM keys are really both public and private keys combined and the public part of the key can be accessed without a password (unlike OpenSSL keys) or even access to the TPM that created the key. However, the engine code doesn’t usually know when only the public part of the key will be required and password prompting is done in OpenSSL at key loading (the TPM doesn’t need a password until key use), so usually after a TPM key is created, the public key is also separately derived using a pkey operation and used as a normal public key.

The final, and most problematic Engine feature, is key loading. Engine keys must be loaded using a special API (ENGINE_load_private_key). OpenSSL built in applications require you to specify the key type (-keyform option) but most well written OpenSSL applications simply try loading the PEM key first, then the DER key then the Engine key (since they all have different APIs), but frequently the Engine key is forgotten leading to the application having to be patched if you want to use them with any engine.

Converting Engines to Providers

The provider API has several pieces which apply to asymmetric key handling: Store, Encode/Decode, Key Management, Signing and Decryption (plus many more if you provide hashes or symmetric algorithms). One thing to remember about the store API is that if you only have file based keys, you should use the generic file store instead. Implementing your own store is only necessary if you also have a URI based input (like PKCS#11). In fact the TPM Engine has a URI for persistent keys, so the TPM store implementation will be dealt with later.

Provider Basics

If a provider is specified on the OpenSSL command line, it will become the sole provider of every algorithm. That means that providers like the TPM2 one, which only fill in a subset of functions cannot operate on their own and must always be used with another provider (usually the default one). After initialization (see below) all provider actions are governed by algorithm tables. One of the key questions for any provider is what to do about algorithm names and properties. Because the TPM2 provider relies on external providers for other algorithms, it must use consistent key names (so “EC” for Elliptic curve and “RSA” for RSA), even though it has only a single key type. There are also elements of the provider key managements, like the way Elliptic Curve keys change name to “ECDSA” for signing and “ECDH” for derivation, which is driven by the key management query operation function. As far as I can tell, this provides no benefit and merely serves to add complexity to the provider, so my provider doesn’t implement these functions and uses the same key names throughout.

The most mysterious string of all is the algorithm property one. The manual gives very little clue as to what should be in it besides “provider=<provider name>”. Empirically it seems to have input, output and structure elements, which are primarily used by encoders and decoders: input can be either der or pem and structure must be the same as the OSSL_OBJECT_PARAM_DATA_STRUCTURE string produced by the der decoder (although you are free to choose any name for this). output is even more varied and the best current list is provided by the source; however the only encoder the TPM2 provider actually provides is the text one.

One of the really nice things about providers is that when OpenSSL is presented with a key to load, every provider will be tried (usually in the order they’re specified on the command line) to decode and load the key. This completely fixes the problem with missing ENGINE_load_private_key() functions is applications because now all applications can use any provider key. This benefit alone is enough to outweigh all the problems of doing the actual conversion to a provider.

Replacing Engine Controls

Engine controls were key/value pairs passed into engines. The TPM2 engine has two: “PIN” for the parent authority and “NVPREFIX” for the prefix which identifies a non-volatile key. Although these can be passed in with the ENGINE_ctrl() functions, they were mostly set in the configuration file. This latter mechanism can be replaced with the provider base callback core_get_params(). Most engine controls actually set global variables and with the provider, they could be placed into the provider context. However, for code sharing it’s easier simply to keep the current globals mechanism.

Initialization and Contexts

Every provider has to have an OSSL_provider_init() routine which fills in a dispatch table and allocates a core context, which is passed in to every other context routine. For a provider, there’s really only one instance, so storing variables in the provider context is really no different (except error handling and actually getting destructors) from using static variables and since the engine used static variables, that’s what we’ll stick with. However, pretty much every routine will need an allocated library context, so it’s easiest to allocate at provider init time and pass it through as the provider context. The dispatch routine must contain a query_operation function, and probably needs a teardown function if you need to use a destructor, but nothing else.

All provider function groups require a newctx() and freectx() call. This is not optional because the current OpenSSL code calls them without checking so they cannot be NULL. Thus for function groups (like encoders and key management) where new contexts aren’t really required it makes sense to use pass through context functions that simply pass through the provider context for newctx() and do nothing for freectx().

The man page implies it is necessary to pick a load of functions from the in argument, but it seems unnecessary for those which the OpenSSL library already provides. I assume it’s something to do with a provider not requiring OpenSSL symbols, but it’s impossible to implement a provider today without relying on other OpenSSL functions than those which can be picked out of the in argument.

Decoders

Decoders are used to convert a read file from PEM to DER (this is essentially the same conversion for every provider, so it is strange you have to do this rather than it being done in the core routines) and then DER to an internal key structure. The remaining decoders take DER in and output a labelled key structure (which is used as a component of the EVP_PKEY), if you do both RSA and EC keys, you need one for each key type and, unfortunately, they must be provided and may not cross decode (the RSA decoder must reject EC keys and vice versa). This is actually required so the OpenSSL core can tell what type of key it has but is a royal pain for things like the TPM where the key DER is identical regardless of key type:

const OSSL_ALGORITHM decoders[] = {
	{ "DER", "provider=tpm2,input=pem", decode_pem_fns },
	{ "RSA", "provider=tpm2,input=der,structure=TPM2", decode_rsa_fns },
	{ "EC", "provider=tpm2,input=der,structure=TPM2", decode_ec_fns },
	{ NULL, NULL, NULL }
};

The decode_pem_fns can be cut and pasted from any provider with the sole exception that you probably have a different PEM guard string that you need to check for.

Then a sample decoder function set looks like:

static const OSSL_DISPATCH decode_rsa_fns[] = {
	{ OSSL_FUNC_DECODER_NEWCTX, (void (*)(void))tpm2_passthrough_newctx },
	{ OSSL_FUNC_DECODER_FREECTX, (void (*)(void))tpm2_passthrough_freectx },
	{ OSSL_FUNC_DECODER_DECODE, (void (*)(void))tpm2_rsa_decode },
	{ 0, NULL }
};

The main job of the DECODER_DECODE function is to take the DER form of the key and convert it to an internal PKEY and send that PKEY up by reference so it can be consumed by a key management load.

Encoders

By and large, engines all come with creation tools for key files, which means that while you could now use the encoder routines to create key files, it’s probably better off to stick with what you have (especially for things like the TPM that can have complex policy statements attached to keys), so you can omit providing any encoder functions at all. The only possible exception is if you want the keys pretty printing, you might consider a text output encoder:

const OSSL_ALGORITHM encoders[] = {
	{ "RSA", "provider=tpm2,output=text", encode_text_fns },
	{ "EC", "provider=tpm2,output=text", encode_text_fns },
	{ NULL, NULL, NULL }
};

Which largely follows the format for decoders:

static const OSSL_DISPATCH encode_text_fns[] = {
	{ OSSL_FUNC_ENCODER_NEWCTX, (void (*)(void))tpm2_passthrough_newctx },
	{ OSSL_FUNC_ENCODER_FREECTX, (void (*)(void))tpm2_passthrough_freectx },
	{ OSSL_FUNC_ENCODER_ENCODE, (void (*)(void))tpm2_encode_text },
	{ 0, NULL }
};

Note: there are many more encode/decode function types you could supply, but the above are the essential ones.

Key Management

Nothing in the key management functions requires the underlying key object to be reference counted since it belongs to an already reference counted EVP_PKEY structure in the OpenSSL generic routines. However, the signature operations can’t be implemented without context duplication and the signature context must contain a reference to the provider key so, depending on how the engine implements keys, duplicating via reference might be easier than duplicating via copy. The minimum functionality to implement is LOAD, FREE and HAS. If you are doing Elliptic Curve derive or reference counting your engine keys, you will also need NEW. You also have to provide both GET_PARAMS and GETTABLE_PARAMS (many key management functions have to implement pairs like this) for at least the BITS, SECURITY_BITS and SIZE properties)².

You must also implement the EXPORT (and EXPORT_TYPES, which must be provided but has no callers) so that you can convert your engine key to an external public key. Note the EXPORT function must fail if asked to export the private key otherwise the default provider will try to do the private key operations via the exported key as well.

If you need to do Elliptic Curve key derivation you must also implement IMPORT (and IMPORT_TYPES) because the creation of the peer key (even though it’s a public one) will necessarily go through your provider key managment functions.

The HAS function can be problematic because OpenSSL doesn’t assume the interchangeability of public and private keys, even if it is true of the engine. Thus the engine must remember in the decode routines what key selector was used (public, private or both) and make sure to condition HAS on that value.

Signatures

This is one of the most confusing areas for simple signing devices (which don’t do hashing) because you’d assume you can implement NEWCTX, FREECTX, SIGN_INIT and SIGN and be done. Unfortunately, in spite of the fact that all the DIGEST_SIGN_… functions can be implemented in terms of the previous functions and generic hashing, they aren’t, so all providers are required to duplicate hashing and signing functions including constructing the binary ASN.1 for the certificate signature function (via GET_CTX_PARAMS and its pair GETTABLE_CTX_PARAMS). Another issue a sign only token will get into is padding: OpenSSL supports a variety of padding schemes (for RSA) but is deprecating their export, so if your token doesn’t do an expected form of padding, you’ll need to implement that in your provider as well. Recalling that the TPM2 provider uses RSA Decryption for signatures means that the TPM2 provider implementation is entirely responsible for padding all signatures. In order to try to come up with a common solution, I added an opensslmissing directory to my provider under the MIT licence that anyone is free to incorporate into their provider if they end up having the same digest and padding problems I did.

Decryption and Derivation

The final thing a private key provider needs to do is decryption. This is a very different operation between Elliptic Curve and RSA keys, so you need two different operations for each (OSSL_OP_ASYM_CIPHER for RSA and OSSL_OP_KEYEXCH for EC). Each ends up being a slightly special snowflake: RSA because it may need OAEP padding (which the TPM does) but with the most usual cipher being md5 (so OAEP padding with arbitrary mask and hash function is also in opensslmissing), which the TPM doesn’t do. and EC because it requires derivation from another public key. The problem with this latter operation is that because of the way OpenSSL works, the public key must be imported into the provider before it can be used, so you must provide NEW, IMPORT and IMPORT_TYPES routines for key management for this to happen.

Store

The store functions only need to be used if you have to load keys that aren’t file based (for file based keys the default provider file store will load them). For the TPM there are a set of NV Keys with 0x81 MSO prefix that aren’t file based. We load these in the engine with //nvkey:<hex> as the designator (and the //nvkey: prefix is overridable in the config file). To get this to work in the Provider is slightly problematic because the scheme (the //nvkey: prefix) must be specified as the provider algorithm_name which is usually a constant in a static array. This means that the stores actually can’t be static and must have the configuration defined name poked into it before the store is used, but this is relatively easy to arrange this in the OSSL_provider_init() function. Once this is done, it’s relatively easy to create a store. The only really problematic function is the STORE_EOF one, which is designed around files but means you have to keep an eof indicator in the context and update it to be 1 once the load function has complete.

The Provider Recursion Problem

This doesn’t seem to be discussed anywhere else, but it can become a huge issue if your provider depends on another library which also uses OpenSSL. The TPM2 provider depends on either the Intel or IBM TSS libraries and both of those use OpenSSL for cryptographic operations around TPM transport security since both of them use ECDH to derive a seed for session encryption and HMAC. The problem is that ordinarily the providers are called in the order they’re listed, so you always have to specify –provider default –provider tpm2 to make up for the missing public key operations in the TPM2 provider. However, the OpenSSL core operates a cache for the provider operations it has previously found and searches the cache first before doing any other lookups, so if the EC key management routines are cached (as they are if you input a TPM format key) and the default ones aren’t (because inputting TPM format keys requires no public key operations), the next attempt to generate an ephemeral EC key for the ECDH security derivation will find the TPM2 provider first. So say you are doing a signature which requires HMAC security to guard against interposer tampering. The use of ECDH in the HMAC seed derivation will then call back into the provider to do an ECDH operation which also requires session security and will thus call back again into the provider ad infinitum (or at least until stack overflow). The only way to break out of this infinite recursion is to try to prime the cache with the default provider as well as the TPM2 provider, so the tss library functions can find the default provider first. The (absolutely dirty) hack I have to do this is inside the pkey decode function as

	if (alg == TPM_ALG_ECC) {
		EVP_PKEY_CTX *ctx = EVP_PKEY_CTX_new_id(EVP_PKEY_EC, NULL);
		EVP_PKEY_CTX_free(ctx);
	}

Which currently works to break the recursion loop. However it is an unreliable hack because internally the OpenSSL hash bucket implementation orders the method cache by provider address and since the TPM2 provider is dynamically loaded it has a higher address than the OpenSSL default one. However, this will not survive security techniques like Address Space Layout Randomization.

Conclusions

Hopefully I’ve given a rapid (and possibly useful) overview of converting an engine to a provider which will give some pointers about provider conversion to all the engine token implementations out there. Please feel free to repurpose my opensslmissing routines under the MIT licence without any obligations to get them back upstream (although I would be interested in hearing about bugs and feature enhancements). In the end, it was only 1152 lines of C to implement the TPM2 provider (additive on top of the common shared code base with the existing Engine) and 681 lines in opensslmissing, showing firstly that there is still an need for OpenSSL itself to do the missing routines as a provider export and secondly that it really takes a fairly small amount of provider code to wrapper an existing engine implementation provided you’re discriminating about what functions you actually provide. As a final remark I should note that the openssl_tpm2_engine has a fairly extensive test suite which all now pass with the provider implementation as well.

Paying Maintainers isn’t a Magic Bullet

4 Replies

Over the last few years it’s become popular to suggest that open source maintainers should be paid. There are a couple of stated motivations for this, one being that it would improve the security and reliability of the ecosystem (pioneered by several companies like Tidelift) and the others contending that it would be a solution to the maintainer burnout and finally that it would solve the open source free rider problem. The purpose of this blog is to examine each of these in turn to seeing if paying maintainers actually would solve the problem (or, for some, does the problem even exist in the first place).

Free Riders

The free rider problem is simply expressed: There’s a class of corporations which consume open source, for free, as the foundation of their profits but don’t give back enough of their allegedly ill gotten gains. In fact, a version of this problem is as old as time: the “workers” don’t get paid enough (or at all) by the “bosses”; greedy people disproportionately exploit the free but limited resources of the planet. Open Source is uniquely vulnerable to this problem because of the free (as in beer) nature of the software: people who don’t have to pay for something often don’t. Part of the problem also comes from the general philosophy of open source which tries to explain that it’s free (as in freedom) which matters not free (as in beer) and everyone, both producers and consumers should care about the former. In fact, in economic terms, the biggest impact open source has had on industry is from the free (as in beer) effect.

Open Source as a Destroyer of Market Value

Open Source is often portrayed as a “disrupter” of the market, but it’s not often appreciated that a huge part of that disruption is value destruction. Consider one of the older Open Source systems: Linux. As an operating system (when coupled with GNU or other user space software) it competed in the early days with proprietary UNIX. However, it’s impossible to maintain your margin competing against free and the net result was that one by one the existing players were forced out of the market or refocussed on other offerings and now, other than for historical or niche markets, there’s really no proprietary UNIX maker left … essentially the value contained within the OS market was destroyed. This value destruction effect was exploited brilliantly by Google with Android: to enter and disrupt an existing lucrative smart phone market, created and owned by Apple, with a free OS based on Open Source successfully created a load of undercutting handset manufacturers eager to be cheaper than Apple who went on to carve out an 80% market share. Here, the value isn’t completely destroyed, but it has significantly reduced (smart phones going from a huge margin business to a medium to low margin one).

All of this value destruction is achieved by the free (as in beer) effect of open source: the innovator who uses it doesn’t have to pay the full economic cost for developing everything from scratch, they just have to pay the innovation expense of adapting it (such adaptation being made far easier by access to the source code). This effect is also the reason why Microsoft and other companies railed about Open Source being a cancer on intellectual property: because it is³. However, this view is also the product of rigid and incorrect thinking: by destroying value in existing markets, open source presents far more varied and unique opportunities in newly created ones. The cardinal economic benefit of value destruction is that it lowers the barrier to entry (as Google demonstrated with Android) thus opening the market up to new and varied competition (or turning monopoly markets into competitive ones).

Envy isn’t a Good Look

If you follow the above, you’ll see the supposed “free rider” problem is simply a natural consequence of open source being free as in beer (someone is creating a market out of the thing you offered for free precisely because they didn’t have to pay for it): it’s not a problem to be solved, it’s a consequence to be embraced and exploited (if you’re clever enough). Not all of us possess the business acumen to exploit market opportunities like this, but if you don’t, envying those who do definitely won’t cause your quality of life to improve.

The bottom line is that having a mechanism to pay maintainers isn’t going to do anything about this supposed “free rider” problem because the companies that exploit open source and don’t give back have no motivation to use it.

Maintainer Burnout

This has become a hot topic over recent years with many blog posts and support groups devoted to it. From my observation it seems to matter what kind of maintainer you are: If you only have hobby projects you maintain on an as time becomes available basis, it seems the chances of burn out isn’t high. On the other hand, if you’re effectively a full time Maintainer, burn out becomes a distinct possibility. I should point out I’m the former not the latter type of maintainer, so this is observation not experience, but it does seem to me that burn out at any job (not just that of a Maintainer) seems to happen when delivery expectations exceed your ability to deliver and you start to get depressed about the ever increasing backlog and vocal complaints. In industry when someone starts to burn out, the usual way of rectifying it is either lighten the load or provide assistance. I have noticed that full time Maintainers are remarkably reluctant to give up projects (presumably because each one is part of their core value), so helping with tooling to decrease the load is about the only possible intervention here.

As an aside about tooling, from parallels with Industry, although tools correctly used can provide useful assistance, there are sure fire ways to increase the possibility of burn out with inappropriate demands:

It does strike me that some of our much venerated open source systems, like github, have some of these same management anti-patterns, like encouraging Maintainers to chase repository stars to show value, having a daily reminder of outstanding PRs and Issues, showing everyone who visits your home page your contribution records for every project over the last year.

To get back to the main point, again by parallel with Industry, paying people more doesn’t decrease industrial burn out; it may produce a temporary feel good high, but the backlog pile eventually overcomes this. If someone is already working at full stretch at something they want to do giving them more money isn’t going to make them stretch further. For hobby maintainers like me, even if you could find a way to pay me that my current employer wouldn’t object to, I’m already devoting as much time as I can spare to my Maintainer projects, so I’m unlikely to find more (although I’m not going to refuse free money …).

Security and Reliability

Everyone wants Maintainers to code securely and reliably and also to respond to bug reports within a fixed SLA. Obviously usual open source Maintainers are already trying to code securely and reliably and aren’t going to do the SLA thing because they don’t have to (as the licence says “NO WARRANTY …”), so paying them won’t improve the former and if they’re already devoting all the time they can to Maintenance, it won’t achieve the latter either. So how could Security and Reliability be improved? All a maintainer can really do is keep current with current coding techniques (when was the last time someone offered a free course to Maintainers to help with this?). Suggesting to a project that if they truly believed in security they’d rewrite it in Rust from tens of thousands of lines of C really is annoying and unhelpful.

One of the key ways to keep software secure and reliable is to run checkers over the code and do extensive unit and integration testing. The great thing about this is that it can be done as a side project from the main Maintenance task provided someone triages and fixes the issues generated. This latter is key; simply dumping issue reports on an overloaded maintainer makes the overload problem worse and adds to a pile of things they might never get around to. So if you are thinking of providing checker or tester resources, please also think how any generated issues might get resolved without generating more work for a Maintainer.

Business Models around Security and Reliability

A pretty old business model for code checking and testing is to have a distribution do it. The good ones tend to send patches upstream and their business model is to sell the software (or at least a support licence to it) which gives the recipients a SLA as well. So what’s the problem? Mainly the economics of this tried and trusted model. Firstly what you want supported must be shipped by a distribution, which means it must have a big enough audience for a distribution to consider it a fairly essential item. Secondly you end up paying a per instance use cost that’s an average of everything the distribution ships. The main killer is this per instance cost, particularly if you are a hyperscaler, so it’s no wonder there’s a lot of pressure to shift the cost from being per instance to per project.

As I said above, Maintainers often really need more help than more money. One good way to start would potentially be to add testing and checking (including bug fixing and upstreaming) services to a project. This would necessarily involve liaising with the maintainer (and could involve an honorarium) but the object should be to be assistive (and thus scalably added) to what the Maintainer is already doing and prevent the service becoming a Maintainer time sink.

Professional Maintainers

Most of the above analysis assumed Maintainers are giving all the time they have available to the project. However, in the case where a Maintainer is doing the project in their spare time or is an Employee of a Company and being paid partly to work on the project and partly on other things, paying them to become a full time Maintainer (thus leaving their current employment) has the potential to add the hours spent on “other things” to the time spent on the project and would thus be a net win. However, you have to also remember that turning from employee to independent contractor also comes with costs in terms of red tape (like health care, tax filings, accounting and so on), which can become a significant time sink, so the net gain in hours to the project might not be as many as one would think. In an ideal world, entities paying maintainers would also consider this problem and offer services to offload the burden (although none currently seem to consider this). Additionally, turning from part time to full time can increase the problem of burn out, particularly if you spend increasing portions of your newly acquired time worrying about admin issues or other problems associated with running your own consulting business.

Conclusions

The obvious conclusion from the above analysis is that paying maintainers mostly doesn’t achieve it’s stated goals. However, you have to remember that this is looking at the problem thorough the prism of claimed end results. One thing paying maintainers definitely does do is increase the mechanisms by which maintainers themselves make a living (which is a kind of essential existential precursor). Before paying maintainers became a thing, the only real way of making a living as a maintainer was reputation monetization (corporations paid you to have a maintainer on staff or because being a maintainer demonstrated a skill set they needed in other aspects of their business) but now a Maintainer also has the option to turn Professional. Increasing the net ways of rewarding Maintainership therefore should be a net benefit to attracting people into all ecosystems.

In general, I think that paying maintainers is a good thing, but should be the beginning of the search for ways of remunerating Open Source contributors, not the end.

Linux Plumbers Conference Matrix and BBB integration

49 Replies

The recently completed Linux Plumbers Conference (LPC) 2021 used the Big Blue Button (BBB) project again as its audio/video online conferencing platform and Matrix for IM and chat. Why we chose BBB has been discussed previously. However this year we replaced RocketChat with Matrix to achieve federation, allowing non-registered conference attendees to join the chat. Also, based on feedback from our attendees, we endeavored to replace the BBB chat window with a Matrix one so anyone could see and participate in one contemporaneous chat stream within BBB and beyond. This enabled chat to be available before, during and after each session.

One thing that emerged from our initial disaster with Matrix on the first day is that we failed to learn from the experiences of other open source conferences (i.e. FOSDEM, which used Matrix and ran into the same problems). So, an object of this post is to document for posterity what we did and how to repeat it.

Integrating Matrix Chat into BBB

Most of this integration was done by Guy Lunardi.

It turns out that Chat is fairly deeply embedded into BBB, so replacing the existing chat module is hard. Fortunately, BBB also contains an embedded etherpad which is simply produced via an iFrame redirection. So what we did is to disable the BBB chat panel and replace it with a new iFrame based component that opened an embedded Matrix chat client. The client we chose was riot-embedded, which is a relatively recent project but seemed to work reasonably well. The final problem was to pass through user credentials. Up until three days before the conference, we had been happy with the embedded Matrix client simply creating a one-time numbered guest account every time it was opened, but we worried about this being a security risk and so implemented pass through login credentials at the last minute (life’s no fun unless you live dangerously).

Our custom front end for BBB (lpcfe) was created last year by Jon Corbet. It uses a fairly simple email/registration confirmation code for username/password via LDAP. The lpcfe front end Jon created is here git://git.lwn.net/lpcfe.git; it manages the whole of the conference log in process and presents the current and future sessions (with join buttons) according to the timezone of the browser viewing it.

The credentials are passed through directly using extra parameters to BBB (see commit fc3976e “Pass email and regcode through to BBB”). We eventually passed these through using a GET request. Obviously if we were using a secret password, this would be a problem, but since the password was a registration code handed out by a third party, it’s acceptable. I imagine if anyone wishes to take this work forward, add native Matrix device/session support in riot-embedded would be better.

The main change to get this working in riot-embedded is here, and the supporting patch to BBB is here.

Note that the Matrix room ID used by the client was added as an extra parameter to the flat text file that drives the conference track layout of lpcfe. All Matrix rooms were created as public (and published) so anyone going to our :lpc.events matrix domain could see and join them.

Setting up Matrix for the Conference

We used the matrix-synapse server and did a standard python venv pip install on Ubuntu of the latest tag. We created around 30+ public rooms: one for each Microconference and track of the conference and some admin and hallway rooms. We used LDAP to feed the authentication portion of lpcfe/Matrix, but we had a problem using email addresses since the standard matrix user name cannot have an ‘@’ symbol in it. Eventually we opted to transform everyone’s email to a matrix compatible form simply by replacing the ‘@’ with a ‘.’, which is why everyone in our conference appeared with ridiculously long matrix user names like @jejb.ibm.com:lpc.events

This ‘@’ to ‘.’ transformation was a huge source of problems due to the unwillingness of engineers to read instructions, so if we do this over again, we’ll do the transformation silently in the login javascript of our Matrix web client. (we did this in riot-embedded but ran out of time to do it in Element web as well).

Because we used LDAP, the actual matrix account for each user was created the first time they log into our server, so we chose at this point to use auto-join to add everyone to the 30+ LPC Matrix rooms we’d already created. This turned out to be a huge problem.

Testing our Matrix and BBB integration

We tried to organize a “Town Hall” event where we invited lots of people to test out the infrastructure we’d be using for the conference. Because we wanted this to be open, we couldn’t use the pre-registration/LDAP authentication infrastructure so Jon quickly implemented a guest mode (and we didn’t auto join anyone to any Matrix rooms other than the townhall chat).

In the end we got about 220 users to test during which time the Matrix and BBB infrastructure behaved quite well. Based on this test, we chose a 2 vCPU Linode VM for our Matrix server.

What happened on the Day

Come the Monday of the conference, the first problem we ran into was procrastination: the conference registered about 1,000 attendees, of whom, about 500 tried to log on about 5 minutes prior to the first session. Since accounts were created and rooms joined upon the first login, this is clearly a huge thundering herd problem of our own making … oops. The Matrix server itself shot up to 100% CPU on the python synapse process and simply stayed there, adding new users at a rate of about one every 30 seconds. All the chat tabs froze because logins were taking ages as well. The first thing we did was to scale the server up to a 16 CPU bare metal system, but that didn’t help because synapse is single threaded … all we got was the matrix synapse python process running at 100% one one of the CPUs, still taking 30 seconds per first log in.

Fixing the First Day problems

The first thing we realized is we had to multi-thread the synapse server. This is well known but the issue is also quite well hidden deep in the Matrix documents. It also happens that the Matrix documents are slightly incomplete. The first scaling attempt we tried: simply adding 16 generic worker apps to scale across all our physical CPUs failed because the Matrix server stopped federating and then the database crashed with “FATAL: remaining connection slots are reserved for non-replication superuser connections”.

Fixing the connection problem (alter system set max_connections = 1000;) triggered a shared memory too small issue which was eventually fixed by bumping the shared buffer segment to 8GB (alter system set shared_buffers=1024000;). I suspect these parameters were way too large, but the Linode we were on had 32GB of main memory, so fine tuning in this emergency didn’t seem a good use of time.

Fixing the worker problem was way more complex. The way Matrix works, you have to use a haproxy to redirect incoming connections to individual workers and you have to ensure that the same worker always services the same transaction (which you achieve by hashing on IP address). We got a lot of advice from FOSDEM on this aspect, but in the end, instead of using an external haproxy, we went for the built in forward proxy load balancing in nginx. The federation problem seems to be that Matrix simply doesn’t work without a federation sender. In the end, we created 15 generic workers and one each of media server, frontend server and federation sender.

Our configuration files are

once you have all the units enabled in systemd, you can then simply do systemctl start/stop matrix-synapse.target

Finally, to fix the thundering herd problem (for people who hadn’t already logged in), we ran through the entire spreadsheet of email/confirmation numbers doing an automatic login using the user management API on the server itself. At this point we had about half the accounts auto created, so this script created the rest.

emaillist=lpc2021-all-attendees.txt
IFS='   '

while read first last confirmation email; do
    bbblogin=${email/+*@/@}
    matrixlogin=${bbblogin/@/.}
    curl -XPOST -d '{"type":"m.login.password", "user":"'${matrixlogin}'", "password":"'${confirmation}'"}' "http://localhost:8008/_matrix/client/r0/login"
    sleep 1
done < ${emaillist}

The lpc2021-all-attendees.txt is a tab separated text file used to drive the mass mailings to plumbers attendees, but we adapted it to log everyone in to the matrix server.

Conclusion

With the above modifications, the matrix server on a Dedicated 32GB (16 cores) Linode ran smoothly for the rest of the conference. The peak load got to 17 and the peak total CPU usage never got above 70%. Finally, the peak memory usage was around 16GB including cache (so the server was a bit over provisioned).

In the end, 878 of the 944 registered attendees logged into our BBB servers at one time or another and we got a further 100 external matrix users (who may or may not also have had a conference account).

Retro Engineering: Updating a Nexus One for the modern world

4 Replies

A few of you who’ve met me know that my current Android phone is an ancient Nexus One. I like it partly because of the small form factor, partly because I’ve re-engineered pieces of the CyanogneMod OS it runs to suit me and can’t be bothered to keep upporting to newer versions and partly because it annoys a lot of people in the Open Source Community who believe everyone should always be using the latest greatest everything. Actually, the last reason is why, although the Nexus One I currently run is the original google gave me way back in 2010, various people have donated a stack of them to me just in case I might need a replacement.

However, the principle problem with running one of these ancient beasts is that they cannot, due to various flash sizing problems, run anything later than Android 2.3.7 (or CyanogenMod 7.1.0) and since the OpenSSL in that is ancient, it won’t run any TLS protocol beyond 1.0 so with the rush to move to encryption and secure the web, more and more websites are disallowing the old (and, lets admit it, buggy) TLS 1.0 protocol, meaning more and more of the web is steadily going dark to my mobile browser. It’s reached the point where simply to get a boarding card, I have to download the web page from my desktop and transfer it manually to the phone. This started as an annoyance, but it’s becoming a major headache as the last of the websites I still use for mobile service go dark to me. So the task I set myself is to fix this by adding the newer protocols to my phone … I’m an open source developer, I have the source code, it should be easy, right …?

First Problem, the source code and Build Environment

Ten years ago, I did build CyanogenMod from scratch and install it on my phone, what could be so hard about reviving the build environment today. Firstly there was finding it, but github still has a copy and the AOSP project it links to still keeps old versions, so simply doing a

curl https://dl-ssl.google.com/dl/googlesource/git-repo/repo > ~/bin/repo
repo init -u -u git://github.com/CyanogenMod/android.git -b gingerbread --repo-url=git://github.com/android/tools_repo.git
repo sync

Actually worked (of course it took days of googling to remember these basic commands). However the “brunch passion” command to actually build it crashed and burned somewhat spectacularly. Apparently the build environment has moved on in the last decade.

The first problem is that most of the prebuilt x86 binaries are 32 bit. This means you have to build the host for 32 bit, and that involves quite a quest on an x86_64 system to make sure you have all the 32 bit build precursors. The next problem is that java 1.6.0 is required, but fortunately openSUSE build service still has it. Finally, the big problem is a load of c++ compile issues which turn out to be due to the fact that the c++ standard has moved on over the years and gcc-7 tries the latest one. Fortunately this can be fixed with

export HOST_GLOBAL_CPPFLAGS=-std=gnu++98

And the build works. If you’re only building the OpenSSL support infrastructure, you don’t need to build the entire thing, but figuring out the pieces you do need is hard, so building everything is a good way to finesse the dependency problem.

Figuring Out how to Upgrade OpenSSL

Unfortunately, this is Android, so you can’t simply drop a new OpenSSL library into the system and have it work. Firstly, the version of OpenSSL that Android builds with (at least for 2.3.7) is heavily modified, so even an android build of vanilla OpenSSL won’t work because it doesn’t have the necessary patches. Secondly, OpenSSL is very prone to ABI breaks, so if you start with 0.9.8, for instance, you’re never going to be able to support TLS 1.2. Fortunately, Android 2.3.7 has OpenSSL 1.0.0a so it is in the 1.0.0 ABI and versions of openssl for that ABI do support later versions of TLS (but only in version 1.0.1 and beyond). The solution actually is to look at external/openssl and simply update it to the latest version the project has (for CyanogenMod this is cm-10.1.2 which is openssl 1.0.1c … still rather ancient but at least supporting TLS 1.2).

cd external/openssl
git checkout cm-10.1.2
mm

And it builds and even works when installed on the phone … great. Except that nothing can use the later ciphers because the java provider (JSSE) also needs updating to support them. Updating the JSSE provider is a bit of a pain but you can do it in two patches:

Once this is done and installed you can browse most websites. There are still some exceptions because of websites that have caught the “can’t use sha1 in any form” bug, but these are comparatively minor. The two patches apply to libcore and once you have them, you can rebuild and install it.

Safely Installing the updated files

Installing new files in android can be a bit of a pain. The ideal way would be to build the entire rom and reflash, but that’s a huge pain, so the simple way is simply to open the /system partition and dump the files in. Opening the system partition is easy, just do

adb shell
# mount -o remount,rw /system

Uploading the required files is more difficult primarily because you want to make sure you can recover if there’s a mistake. I do this by transferring the files to <file>.new:

adb push out/target/product/passion/system/lib/libcrypto.so /system/lib/libcrypto.so.new
adb push out/target/product/passion/system/lib/libssl.so /system/lib/libssl.so.new
adb push out/target/product/passion/system/framework/core.jar /system/framework/core.jar.new

Now move everything into place and reboot

adb shell
# mv /system/lib/libcrypto.so /system/lib/libcrypto.so.old && mv /system/lib/libcrtypto.so.new /system/lib/libcrypto.so
# mv /system/lib/libssl.so /system/lib/libssl.so.old && mv /system/lib/libssl.so.new /system/lib/libssl.so
# mv /system/framework/core.jar /system/framework/core.jar.old && mv /system/framework/core.jar.new /system/framework/core.jar

If the reboot fails, use adb to recover

adb shell
# mount /system
# mv /system/lib/libcrypto.so.old /system/lib/libcrypto.so
...

Conclusions

That’s it. Following the steps above, my Nexus One can now browse useful internet sites like my Airline and the New York times. The only website I’m still having trouble with is the Wall Street Journal because they disabled all ciphers depending on sha1

The Mythical Economic Model of Open Source

1 Reply

It has become fashionable today to study open source through the lens of economic benefits to developers and sometimes draw rather alarming conclusions. It has also become fashionable to assume a business model tie and then berate the open source community, or their licences, for lack of leadership when the business model fails. The purpose of this article is to explain, in the first part, the fallacy of assuming any economic tie in open source at all and, in the second part, go on to explain how economics in open source is situational and give an overview of some of the more successful models.

Open Source is a Creative Intellectual Endeavour

All the creative endeavours of humanity, like art, science or even writing code, are often viewed as activities that produce societal benefit. Logically, therefore, the people who engage in them are seen as benefactors of society, but assuming people engage in these endeavours purely to benefit society is mostly wrong. People engage in creative endeavours because it satisfies some deep need within themselves to exercise creativity and solve problems often with little regard to the societal benefit. The other problem is that the more directed and regimented a creative endeavour is, the less productive its output becomes. Essentially to be truly creative, the individual has to be free to pursue their own ideas. The conundrum for society therefore is how do you harness this creativity for societal good if you can’t direct it without stifling the very creativity you want to harness? Obviously society has evolved many models that answer this (universities, benefactors, art incubation programmes, museums, galleries and the like) with particular inducements like funding, collaboration, infrastructure and so on.

Why Open Source development is better than Proprietary

Simply put, the Open Source model, involving huge freedoms to developers to decide direction and great opportunities for collaboration stimulates the intellectual creativity of those developers to a far greater extent than when you have a regimented project plan and a specific task within it. The most creatively deadening job for any engineer is to find themselves strictly bound within the confines of a project plan for everything. This, by the way, is why simply allowing a percentage of paid time for participating in Open Source seems to enhance input to proprietary projects: the liberated creativity has a knock on effect even in regimented development. However, obviously, the goal for any Corporation dependent on code development should be to go beyond the knock on effect and actually employ open source methodologies everywhere high creativity is needed.

What is Open Source?

Open Source has it’s origin in code sharing models, permissive from BSD and reciprocal from GNU. However, one of its great values is the reasons why people do open source aren’t the same reasons why the framework was created in the first place. Today Open Source is a framework which stimulates creativity among developers and helps them create communities, provides economic benefits to corportations (provided they understand how to harness them) and produces a great societal good in general in terms of published reusable code.

Economics and Open Source

As I said earlier, the framework of Open Source has no tie to economics, in the same way things like artistic endeavour don’t. It is possible for a great artist to make money (as Picasso did), but it’s equally possible for a great artist to live all their lives in penury (as van Gough did). The demonstration of the analogy is that trying to measure the greatness of the art by the income of the artist is completely wrong and shortsighted. Developing the ability to exploit your art for commercial gain is an additional skill an artist can develop (or not, as they choose) it’s also an ability they could fail in and in all cases it bears no relation to the societal good their art produces. In precisely the same way, finding an economic model that allows you to exploit open source (either individually or commercially) is firstly a matter of choice (if you have other reasons for doing Open Source, there’s no need to bother) and secondly not a guarantee of success because not all models succeed. Perhaps the easiest way to appreciate this is through the lens of personal history.

Why I got into Open Source

As a physics PhD student, I’d always been interested in how operating systems functioned, but thanks to the BSD lawsuit and being in the UK I had no access to the actual source code. When Linux came along as a distribution in 1992, it was a revelation: not only could I read the source code but I could have a fully functional UNIX like system at home instead of having to queue for time to write up my thesis in TeX on the limited number of department terminals.

After completing my PhD I was offered a job looking after computer systems in the department and my first success was shaving a factor of ten off the computing budget by buying cheap pentium systems running Linux instead of proprietary UNIX workstations. This success was nearly derailed by an NFS bug in Linux but finding and fixing the bug (and getting it upstream into the 1.0.2 kernel) cemented the budget savings and proved to the department that we could handle this new technology for a fraction of the cost of the old. It also confirmed my desire to poke around in the Operating System which I continued to do, even as I moved to America to work on Proprietary software.

In 2000 I got my first Open Source break when the product I’d been working on got sold to a silicon valley startup, SteelEye, whose business plan was to bring High Availability to Linux. As the only person on the team with an Open Source track record, I became first the Architect and later CTO of the company, with my first job being to make the somewhat eccentric Linux SCSI subsystem work for the shared SCSI clusters LifeKeeper then used. Getting SCSI working lead to fund interactions with the Linux community, an Invitation to present on fixing SCSI to the Kernel Summit in 2002 and the maintainership of SCSI in 2003. From that point, working on upstream open source became a fixture of my Job requirements but progressing through Novell, Parallels and now IBM it also became a quality sought by employers.

I have definitely made some money consulting on Open Source, but it’s been dwarfed by my salary which does get a boost from my being an Open Source developer with an external track record.

The Primary Contributor Economic Models

Looking at the active contributors to Open Source, the primary model is that either your job description includes working on designated open source projects so you’re paid to contribute as your day job
or you were hired because of what you’ve already done in open source and contributing more is a tolerated use of your employer’s time, a third, and by far smaller group is people who work full-time on Open Source but fund themselves either by shared contributions like patreon or tidelift or by actively consulting on their projects. However, these models cover existing contributors and they’re not really a route to becoming a contributor because employers like certainty so they’re unlikely to hire someone with no track record to work on open source, and are probably not going to tolerate use of their time for developing random open source projects. This means that the route to becoming a contributor, like the route to becoming an artist, is to begin in your own time.

Users versus Developers

Open Source, by its nature, is built by developers for developers. This means that although the primary consumers of open source are end users, they get pretty much no say in how the project evolves. This lack of user involvement has been lamented over the years, especially in projects like the Linux Desktop, but no real community solution has ever been found. The bottom line is that users often don’t know what they want and even if they do they can’t put it in technical terms, meaning that all user driven product development involves extensive and expensive product research which is far beyond any open source project. However, this type of product research is well within the ability of most corporations, who can also afford to hire developers to provide input and influence into Open Source projects.

Business Model One: Reflecting the Needs of Users

In many ways, this has become the primary business model of open source. The theory is simple: develop a traditional customer focussed business strategy and execute it by connecting the gathered opinions of customers to the open source project in exchange for revenue for subscription, support or even early shipped product. The business value to the end user is simple: it’s the business value of the product tuned to their needs and the fact that they wouldn’t be prepared to develop the skills to interact with the open source developer community themselves. This business model starts to break down if the end users acquire developer sophistication, as happens with Red Hat and Enterprise users. However, this can still be combatted by making sure its economically unfeasible for a single end user to match the breadth of the offering (the entire distribution). In this case, the ability of the end user to become involved in individual open source projects which matter to them is actually a better and cheaper way of doing product research and feeds back into the synergy of this business model.

This business model entirely breaks down when, as in the case of the cloud service provider, the end user becomes big enough and technically sophisticated enough to run their own distributions and sees doing this as a necessary adjunct to their service business. This means that you can no-longer escape the technical sophistication of the end user by pursuing a breadth of offerings strategy.

Business Model Two: Drive Innovation and Standardization

Although venture capitalists (VCs) pay lip service to the idea of constant innovation, this isn’t actually what they do as a business model: they tend to take an innovation and then monetize it. The problem is this model doesn’t work for open source: retaining control of an open source project requires a constant stream of innovation within the source tree itself. Single innovations get attention but unless they’re followed up with another innovation, they tend to give the impression your source tree is stagnating, encouraging forks. However, the most useful property of open source is that by sharing a project and encouraging contributions, you can obtain a constant stream of innovation from a well managed community. Once you have a constant stream of innovation to show, forking the project becomes much harder, even for a cloud service provider with hundreds of developers, because they must show they can match the innovation stream in the public tree. Add to that Standardization which in open source simply means getting your project adopted for use by multiple consumers (say two different clouds, or a range of industry). Further, if the project is largely run by a single entity and properly managed, seeing the incoming innovations allows you to recruit the best innovators, thus giving you direct ownership of most of the innovation stream. In the early days, you make money simply by offering user connection services as in Business Model One, but the ultimate goal is likely acquisition for the talent possesed, which is a standard VC exit strategy.

All of this points to the hypothesis that the current VC model is wrong. Instead of investing in people with the ideas, you should be investing in people who can attract and lead others with ideas

Other Business Models

Although the models listed above have proven successful over time, they’re by no means the only possible ones. As the space of potential business models gets explored, it could turn out they’re not even the best ones, meaning the potential innovation a savvy business executive might bring to open source is newer and better business models.

Conclusions

Business models are optional extras with open source and just because you have a successful open source project does not mean you’ll have an equally successful business model unless you put sufficient thought into constructing and maintaining it. Thus a successful open source start up requires three elements: A sound business model, or someone who can evolve one, a solid community leader and manager and someone with technical ability in the problem space.

If you like working in Open Source as a contributor, you don’t necessarily have to have a business model at all and you can often simply rely on recognition leading to opportunities that provide sufficient remuneration.

Although there are several well known business models for exploiting open source, there’s no reason you can’t create your own different one but remember: a successful open source project in no way guarantees a successful business model.

efitools arm 32 bit build fixed

1 Reply

It turned out there was a bug in gnu-efi, in the linker script. I’ve added a local fix for this to the OBS UEFI project and filed a bug with gnu-efi on sourceforge. The net result is that the arm 32 bit binaries are now passing most of their tests (the remaining problems may be due to faults in the UEFI environment, still investigating). I should also have the OVMF image for arm (the ArmVirt package) building for 32 bit arm shortly.

I’ve released version 1.6.1 of efitools to make sure a new version gets rebuilt.