We're planting a tree for every job application! Click here to learn more

Working with binary data in Clojure

fctorial

25 May 2021

•

5 min read

Working with binary data in Clojure
  • Clojure

Despite being a very low level language, serializing and deserializing binary data is dead simple in C. It's because C is a weak/un typed language. In fact, it started out as a simple non-optimizing frontend for assembly. All constructs in C language have straightforward analogues in assembly.

So if you have a handle to a memory location in your C program, you can just cast it to a pointer to a struct (a construct defining the structure of a region of memory), and start using the fields of that struct:

// struct definition
typedef struct {
    x i32,
    y i32,
    z i32
} Point;

// get the mean z coordinate of first `n` point in the data blob pointed to by `ptr`
i32 avg_z(void* ptr, i32 n) {
    Point* ps = (Point*) ptr;
    i32 sum = 0;
    for (int i=0; i<n; i++) {
        sum += ps[i].z;
    }
    return ((float) sum) / n;
}

parse_struct is a clojure library that allows you to deserialize and serialize binary data using an API as straightforward as the one you would use in C. This page will serve as a guide for this library.

A simple example

I will start by translating the above C program to parse_struct.

Defining the format of your data is the first thing you should do when parsing some data using parse_struct. The definition of the Point type in the C example will look like this:

(ns fctorial.demo
  (:require [parse_struct.common_types :as ct]
            [parse_struct.core :refer :all]))

(def Point_t {:type       :struct
              :definition [[:x ct/i32]
                           [:y ct/i32]
                           [:z ct/i32]]})

(defn Point_Array_t [n]
  {:type    :array
   :element Point_t
   :len     n})

parse_struct.common_types contains all the fundamental data types (1, 2, 4, 8 byte little and big endian, signed and unsigned integers, 4, 8 byte little and big endian floats, and padding). You can combine them using :structs and :arrays to form more complex data types.

You perform the parsing operation using the deserialize function in parse_struct.core:

(defn avg_z [ptr n]
  (let [points (deserialize (Point_Array_t n)
                            ptr)]
    (/ (reduce (fn [res nxt]
                 (+ res (nxt :z)))
               0
               points)
       n)))

The first argument to deserialize is a type definition. The second argument is a sequence of bytes. The performance of deserialize depends on the byte sequence it is given. Byte arrays perform the best and seqs are the worst.

parse_struct also comes with a class ROVec that is a clojure friendly sequence type that performs as fast as a byte array.

A real world parsing example

Let's now write a program that extracts the list of symbols from an elf file. I will target only the elf64 little endian format, but making a program that targets all the formats is not too difficult.

The complete code can be found in the master branch of above linked repo (fctorial.demo namespace).

This is the path we'll follow to find the symbols list:

We will start by defining the aliases used by elf64 specification:

(ns fctorial.demo
  (:require [parse_struct.core :refer :all]
            [parse_struct.common_types :refer :all]
            [clojure.pprint :refer [pprint]]
            [fctorial.utils :refer :all]
            [fctorial.data :refer [obj]]
            )
  (:import (clojure.lang ROVec MMap)))

(def ElfAddr u64)
(def ElfHalf u16)
(def ElfOff u64)
(def ElfWord u32)
(def ElfXword u64)

fctorial.data.obj is a ROVec containing a simple executable (compiled with gcc -c t.c -o data/t.o).

We will start by reading the elf identification segment and verifying that the file is an ELF64LE executable:

(def magic_t {:type       :struct
              :definition [[:ident {:type  :string
                                    :bytes 4}]
                           [:class (assoc i8 :adapter {1 :32 2 :64})]
                           [:data (assoc i8 :adapter {1 :LE 2 :BE})]
                           [:version i8]]})

(def magic (deserialize magic_t obj))

(assert (= (magic :class) :64))
(assert (= (magic :data) :LE))
(assert (= (magic :ident) "\u007FELF"))

Here we see the :adapter feature of parse_struct in action. Each type is a clojure map that can optionally have an entry by the name :adapter. Its value must be a function which will be applied to the parsed value and the result will be returned instead of the original value. Here we use it to map integers to clojure keywords, which are easier to use.

Now we parse the rest of the ELF header.:

(def elf_header_t {:type       :struct
                   :definition [(padding 24)
                                [:shoff ElfOff] ; section header offset
                                (padding 10)
                                [:shentsize ElfHalf] ; section header entry size
                                [:shnum ElfHalf] ; section headers count
                                [:shstrndx (assoc ElfHalf :adapter int)]]})
(def elf_header (deserialize elf_header_t
                             (ROVec. obj 16)))

We are only interested in the section info so we ignore the rest of the data using parse_struct.common_types.padding function. We are also using the ROVec. constructor to slice the original blob at byte number 16. ROVec class has constructor overloads that can be used like the vec function from clojure standard library to slice and dice the blob.

Let's do a sanity check on the data we've extracted. Section headers are always at the very tail of an ELF file:

(assert (= (+ (elf_header :shoff)
              (* (elf_header :shentsize)
                 (elf_header :shnum)))
           (count obj)))

Now we know where the section headers are. Let's parse them:

(def sec_header_t {:type       :struct
                   :definition [(padding 4)
                                [:type (assoc ElfWord :adapter #(get [:SectionType/NULL
                                                                      :SectionType/PROGBITS
                                                                      :SectionType/SYMTAB
                                                                      :SectionType/STRTAB
                                                                      :SectionType/RELA
                                                                      :SectionType/HASH
                                                                      :SectionType/DYNAMIC
                                                                      :SectionType/NOTE
                                                                      :SectionType/NOBITS
                                                                      :SectionType/REL
                                                                      :SectionType/SHLIB
                                                                      :SectionType/DYNSYM]
                                                                     %))]
                                (padding 16)
                                [:offset ElfOff]
                                [:size ElfXword]
                                [:link ElfWord]
                                (padding 20)]})
(def secs (deserialize {:type    :array
                        :len     (elf_header :shnum)
                        :element sec_header_t
                        :adapter vec}
                       (ROVec. obj (elf_header :shoff))))
(def symtab_header (first (filter #(= (% :type) :SectionType/SYMTAB) secs)))
(def symnames_header (secs (symtab_header :link))) ; The link field of a symbol table in the index of symbol names section

(def symnames (deserialize {:type  :string
                            :bytes (symnames_header :size)}
                           (ROVec. obj (symnames_header :offset))))

Deserialization of an array gives back a lazy seq. Adding an :adapter vec will turn it into an eager indexable array.

The symbol names section is a blob of ascii strings concatenated with null terminators. Each symbol table entry contains an index into this blob that points to the start of its name. So we use the :string type to parse it (java ascii strings can contain any characters).

We can now parse the symbol table:

(def sym_t {:type       :struct
            :definition [[:name (assoc ElfWord :adapter (fn [idx]
                                                          (.substring symnames
                                                                      idx
                                                                      (.indexOf symnames 0 idx))))]
                         (padding 2)
                         [:shndx ElfHalf]
                         [:value ElfAddr]
                         [:size ElfXword]]})

(def symbols (deserialize {:type :array
                           :len (/ (symtab_header :size)
                                   (type-size sym_t))
                           :element sym_t}
                          (ROVec. obj (symtab_header :offset))))

Once again, we are using an adapter to attach the symbols to their names. The function type-size is also introduced. It takes a definition and returns the net size of that definition in bytes.
The result (symbols) will look something like this:

({:name "", :shndx 0, :value 0, :size 0}
 {:name "t.c", :shndx 65521N, :value 0, :size 0}
 {:name "", :shndx 1, :value 0, :size 0}
 {:name "", :shndx 3, :value 0, :size 0}
 {:name "", :shndx 4, :value 0, :size 0}
 {:name "", :shndx 5, :value 0, :size 0}
 {:name "count.1913", :shndx 4, :value 0, :size 4}
 {:name "f", :shndx 1, :value 65, :size 7}
 {:name "", :shndx 6, :value 0, :size 0}
 {:name "", :shndx 8, :value 0, :size 0}
 {:name "", :shndx 9, :value 0, :size 0}
 {:name "", :shndx 11, :value 0, :size 0}
 {:name "", :shndx 13, :value 0, :size 0}
 {:name "", :shndx 15, :value 0, :size 0}
 {:name "", :shndx 16, :value 0, :size 0}
 {:name "", :shndx 14, :value 0, :size 0}
 {:name "x", :shndx 5, :value 0, :size 4}
 {:name "y", :shndx 3, :value 0, :size 4}
 {:name "z", :shndx 5, :value 8, :size 8}
 {:name "eho", :shndx 1, :value 0, :size 26}
 {:name "rot", :shndx 1, :value 26, :size 23}
 {:name "_GLOBAL_OFFSET_TABLE_", :shndx 0, :value 0, :size 0}
 {:name "main", :shndx 1, :value 49, :size 16}
 {:name "missing", :shndx 0, :value 0, :size 0})

Serialization

parse_struct can also be used for generating binary data. The api is quite similar to deserialization. The function is parse_struct.core.serialize and it takes two arguments. A type definition and a clojure data type that conforms to that spec:

(def spec {:type    :array
           :len     20
           :element i32be})

(def data1 (range 20))

(def bs (serialize spec data1))

(def data2 (deserialize spec bs))

(assert (= data1 data2))
Did you like this article?

fctorial

See other articles by fctorial

Related jobs

See all

Title

The company

  • Remote

Title

The company

  • Remote

Title

The company

  • Remote

Title

The company

  • Remote

Related articles

JavaScript Functional Style Made Simple

JavaScript Functional Style Made Simple

Daniel Boros

•

12 Sep 2021

JavaScript Functional Style Made Simple

JavaScript Functional Style Made Simple

Daniel Boros

•

12 Sep 2021

WorksHub

CareersCompaniesSitemapFunctional WorksBlockchain WorksJavaScript WorksAI WorksGolang WorksJava WorksPython WorksRemote Works
hello@works-hub.com

Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ

108 E 16th Street, New York, NY 10003

Subscribe to our newsletter

Join over 111,000 others and get access to exclusive content, job opportunities and more!

© 2024 WorksHub

Privacy PolicyDeveloped by WorksHub