Binary Parsing (follow-up)
Dec. 28th, 2019 01:19 amWhile writing the previous rant, it has occurred to me that many binary formats are not, in fact, context-free.
It is often the case that the structure you are trying to read depends on elements already encountered earlier in the stream. This means that no DSL that could be invented for binary parsing cannot escape managing state if it is to be a full-blown parser that produces a usable model, object or otherwise. And that state management can be pretty damn complicated.
One glorious example would be MPEG-TS (MPEG Transport Stream). That format has at least three layers of binary syntax. Lower layers have to be parsed through, then selected bits of them will have to be fed into the next layer's parser, and so on, ad nauseam.
Another example would be the messier varieties of ISOBMFF like Apple QuickTime or Sony camera output. With those, you literally have to check the flags already encountered elsewhere to parse them correctly. Long story short, any DSL sufficiently complex for creating a complete binary parser that spews out an object model implies inventing a whole programming language, control flow and all.
I have struggled with the issue for quite some time, ever since I had first run into binary formats while working at MainConcept. So far I couldn't find a silver bullet. I can only say that the best bet so far is code generation — it saves the effort of writing most of the boilerplate, but would still give reasonable leeway for the crazier cases.