On the Salient Misunderstandings of Assembly Theory.

October 25th 2022

TL;DR Uthamacumaran et. al. contrasted Assembly Theory with measures common in Algorithmic Information Theory (AIT) but their theoretical and empirical claims do not undermine previously published work on Assembly Theory or its application to life detection. The reasons for this are 1) Assembly Indices are not intended to be optimal measure of compression, and 2) Assembly Theory deals with the problem of randomness by focusing on copy number, which has not been considered in AIT.

Assembly theory is a novel theoretical framework designed to understand how specific objects are constructed in combinatorial spaces, and by doing so it captures key physical processes underlying living systems. Assembly theory is actively being developed by multiple teams at different institutions and may have applications in different domains, but this post will focus on its application to life detection. Assembly theory provides a measure of complexity that can be used to understand the minimum number of steps required to construct an object, known as the Assembly Index. When applied to molecular systems we have shown that 1) this measure of complexity can be observed directly in experimental and natural systems, and 2) this measure can be used as an unambiguous biosignature to experimentally distinguish living and non-living samples (Marshall et. al. 2021)1. The methods of assembly theory have also been used to identify new drug candidates2.

Recently a preprint manuscript3 was posted on arxiv which claims to contradict our findings, and specifically cast doubt on whether Assembly Indices can be used to distinguish living and non-living systems. This preprint by Uthamacumaran et. al. reveals multiple conceptual misunderstandings of Assembly Theory, and its application to life detection. Here I will attempt to clarify these misunderstandings.

The easiest place to start is by reviewing the abstract of Uthamacumaran et. al, which is copied in full below:

A recently introduced approach termed “Assembly Theory”, featuring a computable index based on basic principles of statistical compression has been claimed to be a novel and superior approach to classifying and distinguishing living from non-living systems and the complexity of molecular biosignatures. Here, we demonstrate that the assembly pathway method underlying this index is a suboptimal restricted version of Huffman’s encoding (Shannon-Fano type), widely adopted in computer science in the 1950s, that is comparable (or inferior) to other popular statistical and computable compression schemes. We show how simple modular instructions can mislead the assembly index, leading to failure to capture subtleties beyond trivial statistical properties that are not realistic in biological systems. We present cases whose low complexities can arbitrarily diverge from the random-like appearance to which the assembly pathway method would assign arbitrarily high statistical significance, and show that it fails in simple cases (synthetic or natural). Our theoretical and empirical results imply that the assembly index, whose computable nature we show is not an advantage, does not offer any substantial advantage over existing concepts and methods computable or uncomputable. Alternatives are discussed.

The first two sentences reveal the misunderstanding. While the computation of the Assembly index may involve searching for repeated subunits in a manner that is similar to compression algorithms, the goal of Assembly theory is not to find an optimal compression of molecules, but to bound the number of unit operations required to construct a molecule based on operations that are physically realizable (e.g. making chemical bonds). In certain limits there may be an equivalence between optimal compression and bounding the number of unit operations required to construct a physical artifact in the real world. But assembly theory is not an attempt to reinvent concepts from computer science. The goal of assembly theory is to develop a new understanding of evolved matter that accounts for causal history in terms of what physical operations are possible and have a clear interpretation when measured in the laboratory.

To make this more specific, consider the fact that molecular assembly theory treats bonds as the elementary objects from which molecules are constructed. Thus, the shortest path is determined by finding the shortest recursively assembled pathway constructed by joining operations with bonds. This may seem like a small detail, but it is critical for distinguishing assembly theory from complexity measures based on concepts from computer science, because it impacts the physical interpretation of MA. A simple example of where this could go wrong, one could build an “assembly space” based on combining atoms, instead of bonds, and that would have a different graphical representation and minimal path. But this representation does not have a physical interpretation, because physical systems do not build molecules by combining atoms with no bonds – molecules are made by making bonds (i.e. the entire basis of the discipline of chemical synthesis!). Assembly spaces for molecules must be constructed from bonds, because bond formation is what the physical environment must cause to assemble molecules.

Uthamacumaran et. al., show via analogy how calculating the assembly index of a sequence of characters (a string) can be understood as a limited form of compression using Huffman Encoding. In Assembly theory intermediate objects made along the shortest pathway are always members of the same combinatorial space as the original object. They are not abstractions which exist outside the space itself. Put simply, the assembly index of a molecule (or other object) is not a computationally optimal compression of the object. The assembly index defines a bound on the steps within the combinatorial space of molecules which can be implemented to construct the object. Assembly theory is not interested in the space of all (un)computable functions (which requires defining Universal Turing Machines and other mathematical abstractions which can only exist because of living systems).

Uthamacumaran et. al. framed their work within the context of Algorithmic Information Theory (AIT) which is a very different persepctive than the authors of Assembly theory. This difference in persepctive might be the underlying cause of the confusion. Before moving on to the remaining claims in Uthamacumaran et. al, I think it is useful to contextualize the view point of AIT and identify the core assumptions that can lead scientists to radically different conclusions.

Algorithmic Information Theory is based on the idea that the information contained in an object (such as a string, or molecule) should be no greater than the information required to algorithmically generate that object by running a program that can produce the object as its output. In AIT the Algorithmic Complexity or Kolmogorov Complexity of an object is the length of the shortest program that outputs that object. Accordingly, AIT emphasizes the importance of optimal compression, because if you can compress an object you know that the program to produce that object can be no bigger than the compressed version (plus the size of the algorithm to decompress it). But this immediately introduces a problem: using statistical compression alone it is impossible to distinguish between maximum algorithmic information and complete randomness.

Consider a simple algorithm to generate a random string of 0s and 1s: flip a (fair) coin and write down 1 if it shows Heads, otherwise write down 0, repeat until string is desired length. Using this algorithm, you could easily generate large strings. But we know they don’t have any reliable statistical structure (because they’re random) which means that a compression algorithm based on statistical structure will not be very effective, and will likely return a poor compression of the object, it might even be much bigger than the algorithm itself. So, using compression alone we cannot distinguish between complete randomness and high algorithmic information. This problem with randomness is critical to understanding AIT, and complexity measures related to it, which are designed to solve this problem (and others like it).

In AIT to find the amount of information contained in an object, we don’t just need to search over all the possible (statistical) compressions of the object, we need to search over all the possible programs which could generate the object as an output, which will include our short algorithm for generating random strings. If searching over the space of possible programs sounds impossible, that’s because it is in almost all cases. While Algorithmic Complexity cannot be computed exactly, there are many approximation methods that are shown to be effective for different use cases. Practitioners of AIT consider the uncomputability of Algorithmic Complexity to be an advantage of the measure because it ensures that objects which can be described easily using any abstraction correspond to low algorithmic information. This perceived advantage to uncomputable complexity measures is key for understanding the perspective of Uthamacumaran et. al. and why that perspective fails to apply to assembly theory.

A core conceptual difference between Assembly Theory and AIT is the way randomness is handled, which itself is a consequence of the different goals of the theories. Assembly theory is concerned with understanding how things are constructed in physical systems, not the possibilities of Universal Turing Machines in an abstracted sense. Accordingly, Assembly Theory distinguishes statistical randomness from complex phenomena by focusing on copy number. If we ran the algorithm above to generate a single string of length 100, we’d find that the Assembly Index of that individual string could be quite high. But if we ran the same algorithm again, we’d get a completely different string, and chances are that if we ran it 100 times we’d get 100 unique different strings. The fact that each string is unique means that they each have a copy number of 1, and Assembly Theory makes no claims about objects with a copy number of 1, since they might happen by chance alone. But identical copies of molecules with sufficiently high MA cannot occur in the absence of evolved systems which cause their repeated formation. This is a solution to the problem of randomness that doesn’t require us to search over an abstract infinite space of programs, and hence does not require an uncomputable measure. In fact, this solution emerges from considering physical systems, such as abiotic chemical systems, where copy numbers can span the range from 101 – 1023. Life detection based on Molecular Assembly was demonstrated using a mass spectrometer which requires 100s-1 millions of identical copies of a molecule for detection.

Hopefully by clarifying this conceptual distinction, the confusion in the remaining claims in this preprint are apparent. For example, returning to the abstract, they claim to “… show how simple modular instructions can mislead the assembly index, leading to failure to capture subtleties beyond trivial statistical properties that are pervasive in biological systems.” But what would it mean to mislead the assembly index? In this case “simple modular instructions” are the specification of an algorithm which would lead to the production of high MA compounds. But modular instructions themselves are signatures of a computational process which is only possible because living systems exist. In fact, there are many short algorithms to produce high MA compounds, some of which have been deployed to search chemical space for new drug candidates. Perhaps the easiest one is: “produce a molecule which contains one of each of the elements in the periodic table,” which would produce a compound (let’s call it Marshallarane4), with an exceptionally high MA. But providing a description of a molecule is not the same as causing the physical production of that molecule! Easily describing a graph that represents a molecule is not the same as easily synthesizing the real molecule in chemistry or biology.

Turning to the last significant claim in the abstract, “We present cases whose low complexities can arbitrarily diverge from the random-like appearance to which the assembly pathway method would assign arbitrarily high statistical significance, and show that it fails in simple cases (synthetic or natural).” It is not clear what they mean by “it fails in sample cases (synthetic or natural),” since the authors have not shown any real molecules which mislead MA. The proof that underlies their claim is based in AIT. The authors stated goal is to demonstrate that they can identify a program that contains knowledge of assembly theory and which can generate a high MA molecule, and that they can do so with a relatively small program. It is a formalized version of the algorithm for Mashallarane above. The authors provide only a mathematical proof that such a program exists, not an example. The existence of such an algorithm is not an issue for Assembly Theory, or its application to life detection because if you set out to find life, and you only find a programmable chemical robot, you’ve still found life.

Finally, the work from Uthamacumaran et. al. contains some empirical work comparing Molecular Assembly Indices to other compression algorithms. This analysis is based on ~100 molecules used in (Marshall et. al. 2021) and made available in the supplemental information. They combined this data with mass spectrometry data via a publicly available dataset and used the classification scheme from Figure 2 of (Marshall et. al. 2021), to argue that MA does not distinguish between living and non-living systems, by comparing only the compression of the different data. The claims made in the preprint based on this analysis are the result of factual misunderstandings of the work in (Marshall et. al. 2021). The labels from Figure 2 of (Marshall et. al. 2021) are Natural Product, Industrial Compound, Metabolite, and Pharmaceutical. The data are derived from biological or technological processes and the labeling used was irrelevant for the life detection claims made (Marshall et. al. 2021). The key test for agnostic life detection claims was the use of mass spectrometry to distinguish known abiotic samples (such as the Murchison meteorite, or Miller-Urey mixtures) from known biological samples (such as yeast, E.Coli, and seawater) shown in Figure 4 of that manuscript (Marshall et. al. 2021). The challenge for life detection is identifying suitable abiotic controls on Earth which is heavily contaminated by living systems. None of the analysis by Uthamacumaran et. al. speaks to this process or to the experimental analysis done in (Marshall et. al. 2021) because they have not provided an abiotic control that can produce high MA compounds.

A recurring argument in the preprint is that other measures of statistical compression correlate with MA, and therefore could have been used in place of MA. But this ignores how MA is derived from the structure of the compounds, which itself is directly related to the results of experiments. MA is grounded in the physics of how molecules are built in reality, not in the abstracted concept of Turing Machines. This connection between the theoretical construction of Assembly Indices and the experimental measurements is essential for the predictions of the theory. The fact that forms of statistical compression correlate with and can be analogized to Assembly indices is interesting, but it does not constitute a limitation of Assembly Theory.

In summary, the preprint by Uthamacumaran et. al. compared Assembly Theory to AIT, and demonstrated that there are simple algorithms to describe (but not physically synthesize) high MA compounds. They have not demonstrated how those algorithms would manifest in the absence of a Turing machine, how those algorithms could result in chemical synthesis, or what the implications of their claims might be for life detection. The calculations they have performed do not have any bearing on the life detection claims made in (Marshall et. al. 2021) or the other peer reviewed claims of Assembly Theory. Despite the alternative complexity measures discussed, there are no other published agonistic methods for distinguishing the molecular artifacts of living and non-living systems which have been tested experimentally.