A Text to ELF/DWARF Compiler

This proposal is about improving the design of Elftoolchain’s text to ELF/DWARF translator.

Last Significant Update:

2024-07-10

Status:

Draft

Comments to:

When writing tests we often need custom ELF and DWARF content, including ‘malformed’ content for testing error handling. The ability to precisely specify ELF and DWARF content using some kind of human-readable notation is hence very useful for writing tests for Elftoolchain’s libraries and utilities. Human-readable ‘source’ descriptions of ELF/DWARF content can also be meaningfully placed under revision control, aiding project comprehension.

Current Status

We currently have a Python-based script named elfc, part of the Libelf test suite, that converts a YAML-based description of an ELF object to its file format.

The YAML format is actually not well suited for this task — I had only used it because I had not wanted to design a new notation for describing binary content. The elfc tool also does not handle DWARF.

Some of our test suites (for libdwarf, nm, etc.) use ELF and DWARF content in ‘real world’ object files for their tests. There are however a couple of drawbacks when using ‘real’ data:

The use of such object files (binary blobs) makes it hard to exhaustively test code by varying object parameters such as endianness and native word size.
Tests which use such binary blobs are not ‘minimal’; we cannot be sure that the test isn’t being influenced by some unrelated ELF feature that happens to be present in the blob.

Requirements

The notation should support the notion of abstraction, so that common patterns in ELF and DWARF content can be abstracted out and made reusable as libraries.
The notation should support iteration, so that repeated content fragments can be expressed succinctly.
The notation should support DWARF data types as well as ELF. The notation should be easy to extend to support custom data sections and data types, if needed.

Design Notes

Embedded Domain Specific Language

Instead of ‘interpreting’ YAML-format text as is being currently done by elfc, we could construct an abstract representation of ELF and DWARF information within a general purpose programming language like, say, Python. After this abstract representation has been constructed, it can then be serialized and written out to a file.

import elfc # Types and functions describing ELF.
import sys

def make_elf(ehdr, phdr_table, sections, shdr_table=None):
    """Return an AST node containing the specified ELF header
    and PHDR table, and the section content specified by
    the argument 'sections'.

    Computes the section header table from the content of
    'sections' if argument 'shdr_table' is None.
    """

    elf = elfc.ELF()
    # Prepare the content of the 'elf' AST node
    elf.add_header(ehdr)
    elf.add_phdr_table(phdr_table)
    # [...]

    return elf

def main():
    """Prepare an ELF object."""
    elf = make_elf(...)
    f = open(sys.argv[1], "w")
    f.write(elf.serialize())
    f.close()

With this approach we can reuse the language-level facilities provided by the programming language — data structures, support for abstraction, etc.

Layers

Layers would allow us to describe variants of ELF and DWARF content succinctly.

In the diagram below shows an ELF file constructed using a “base layer” over which two “sparse” layers have been placed. In this diagram, the content in "Sparse layer 2" overrides that in "Sparse layer 1" and the "Base layer". The content in "Sparse layer 1" will in turn override the content of the "Base layer".

These layers would be ‘sparse’, in that they would only cover a subset of an ELF object or DWARF section.

Resources

YAML: A data serialization language.
Pyelftools: A Python library for reading and analyzing ELF and DWARF content.