29 septembre 2015

Building a HTML5 control surface part 2 : networking

I started the project with a few critical unknowns - things that must be possible in order to make it actually worth anything:
  • full API access
  • low network latency
  • touch precision, responsiveness and actual practicability for the task
  • fast UI updates

In part 1 I've already shown how to access Live’s API and more importantly, how to make it practical and meaningful. I’ve seen all kind of APIs and I can tell that the ability to interact and experiment is more important than a dump of method signatures.

In this second part I’ll focus on the networking part. Thanks to existing software such as TouchDAW and LK I already knew that acceptable latency was achievable. But I had no idea if I could achieve this on a browser.

Here is a simplified schema showing the communication between components. Promises are a programming concept but it’s really critical to handle all the complexity caused by asynchronous communication, so I present them here as an internal communication bus.

|   Browser       |                                                     
|                 |                                                     
|                 |                                                     
|   Widgets       |                                                     
|     ^           |                                                     
|     |           |                                                     
|     | Promises  |                                                     
|     | + Updates |      +---------------+            +-----------------+
|     v           |      |               | ZeroMQ     |                 |
| Live Proxy <-----------+>   Server   <---------------> Ableton Live   |
|                 | WebRTC               |            |                 |
+-----------------+      +---------------+            +-----------------+

Reality is a bit more complex but parts not shown here are mostly workarounds. 

Message format and messaging pattern

Format of a request:
|    Envelope          |                    Message                                      |
| clientID | messageID | Object reference                        method        parameter |
|  xxxx    |  yyyyy    | tracks.0.devices.1.parameters.2.value   set           100       |
|          |           |                                                                 |

The simplest design would be to send one request at a time, wait for the response, then send the next request

+        +
| Request|
| Request|
| Request|
|  ...   |
+        +

But this approach is really too slow and resources are underused : only either the client, the network or the server is busy at a given time. We want to use all the throughput we can have.

Instead, if we have several requests to make at the same time, we just push them all through the pipe, then we get all the responses back.

+   Request 1    +
| +------------> |
|   Request 2    |
| +------------> |
|   Request 3    |
| +------------> |
|   Reply 1      |
| <------------+ |
|   Reply 2      |
| <------------+ |
|   Request 4    |
| +------------> |
|   Reply 3      |
| <------------+ |
|   Request 5    |
| +------------> |
|   Reply 4      |
| <------------+ |
|   Reply 5      |
+ <------------+ +

This way we use all the bandwidth we can have for requests and we use the CPU as much as we can. At startup, several dozens of API calls are necessary to setup widgets : initial value for knobs ( aka encoders ), clip length and notes, etc. 

Since all used layers are supposed to guarantee order and reliability, we could in principle just use a queue to match each reply to its requester, but somehow I manage to lose replies after a while. I have no idea where it happens but the fact is that, if you don't mark your messages because you expect a perfect sequence, then once the sequence is lost it's a catastrophic failure, you have no choice but to restart all components to a clean state, so nobody keeps a polluted queue. And it's indeed a mess to troubleshoot. Turns out, adding a messageid to match request and replies adds no significant overhead and is easier to program and manage. I lost a few hours refactoring that part, because I wanted to save a few bytes per message, which is ridiculous.

So here is the sequence of events :
  • browser-side
    • widget needs to call the API to get its initial value or to set a new one.
    • widget calls the « API Proxy » with the message that has to be sent to the API
    • the proxy adds a message ID and sends it to the server
  • server-side
    • add a clientId to the request, to know which browser made that request
    • call the API with that request
  • Live API
    • process request, generate answer
    • send reply with the original envelope ( that is, clientid and message id )
  • Server-side
    • read reply’s clientid, send to the corresponding browser
  • browser-side
    • API proxy receives the message, checks the messageId.
    • The corresponding callback is called with the reply content
      ( it’s actually promises, more on that below )

All of this can happen in parallel with multiple messages.

I suggest anyone to always use messageIds even if they expect a strict order, this makes it much more easier to troubleshoot. Most problems happen at high load, and you don’t want to manually count dozens of messages and replies in the logs. Grepping for a unique ID is much simpler.


WebRTC is normally meant for peer-to-peer communication between browsers, allowing realtime video, voice and chat, and the only browsers that implement it are Chrome and Firefox. It’s even more restricted on mobile devices, as Apple only allows its own rendering engine — as a consequence, no browser on iOS supports WebRTC. That means I’m limited to Android and desktops.

So why not just use websockets, which are so easy to setup and widely supported ? It was indeed what I used in my early experiments. Unfortunately you can’t disable nagle's algorithm : small messages are grouped together, which regularly causes a latency of a few tenths of seconds ; that’s totally unacceptable as I want to be able to play notes without any perceptible delay between the touch and the sound coming out of Ableton Live. On the other hand WebRTC can be used on top of UDP, and so far I did not hear any network-induced delay.

So I end up using WebRTC’s datachannel, which can be used as a message-oriented communication, perfect for the purpose. I tried to stream the sound as well but there is way too much latency ; I’m not sure if it can be achieved, maybe streaming it in a more classical way from the server is a better option.

Setting up a WebRTC connection is not straightforward, fortunately it’s possible to find great examples out there, like in this blogpost : http://blog.printf.net/articles/2014/07/01/serverless-webrtc-continued/ . So, during the lifetime of the web application, HTTP is only used to load the usual assets ( html, css and javascript ), then another HTTP call is made to setup the WebRTC channel ( offer/answer in WebRTC terminology ). After that, everything else is WebRTC.

Server-side, each WebRTC connection is associated with an ID that is used in the envelope.

That said, as the node bindings for WebRTC is not stable I'll explain my workaround in the end of the article.


ZeroMQ is ridiculously simple to use but you only grasp its essence once you need it in a project. It was a bit tricky to build it for python2.5, but that’s totally worth it ( details here : http://www.djcrontab.com/2015/09/building-html5-control-surface-for.html#comment-2274750354 ).

There is no « connection refused » with ZeroMQ, it’s just a peer that is not yet there ; indeed you can detect the situation and handle it, but it’s a separate event handler. This forces a clean, focussed, event-oriented code ; setup, data processing and exception handling belong to different places.

Also, while it implements explicit high-level roles such as publisher/subscriber or pull/push, the notion of bind and connect is independent from these roles. For instance, several publishers can connect to a subscriber. Apart from the initialization, none of your code has to care about it.

That’s plenty of details that just makes it easy to experiment until you find the right implementation. I’d like to see an API similar to ZeroMQ on top of WebRTC datachannels, this would make distributed applications involving a browser much more consistent.

ZeroMQ comes with three basic messaging patterns - on top of which more complex schemes can be built : 
  • Request/reply is simpler as it requires only one socket on each side, but the throughput is limited as told above.

    Instead we need two pairs of sockets : one for pipelining requests and another one to pipeline replies.
  • Pub/Sub is an option but the default behavior is to drop packets once the send buffer is full. Also, pub/sub suffers from the « slow joiner » problem ; overall, that pattern is more convenient for one-way event broadcasting that doesn’t involve state change. 
  • Push/pull blocks the call if the send buffer is full, and doesn't suffer from the slow joiner problem. Therefore, pull/push seems the best option for pipelining requests and replies.

Python side ( Ableton Live API )

While asyncio is around since a while, it seems that the most common way to do IO in python, including ZeroMQ, is to just use threads and let IO methods block as needed. This is the way it is done in the official documentation . I'm principally a Python developper and I find this approach idiomatic to python. An alternate implementation, aiozmq, allows using coroutines. This really feels strange and unnatural to me ; I'm not sure if it's a question of habit, or if it doesn't fit with the overall syntax. I'd prefer callbacks or some other inversion of control ; I didn't try but those "yield" calls in the middle of a method feel like it's error prone, actually bringing more complexity or may lead to forget to free resources. Or, maybe, it doesn't follow the "explicit is better than implicit" principle. It's only a feeling.

So, threading and blocking calls are fine, but here lead to a problem : we can't rely on blocking and threading inside an Ableton Live  control surface script. I told in the previous article I did JSON-to-MIDI serialization but this was a temporary solution: serialization is too slow, I have to escape the end-of-sysex byte ( 247 ), and I had to split big messages because of a (totally reasonable) buffer limitation in node-midi ; overall it's not a viable solution.

Here is the trick I use instead. If have two pair of push/pull socket, one for requests, one for replies. When I need to make a request to Live's API :
— send an empty sysex message ( [240,247] ) from the server
— then immediately send the request via the server's push socket
— the sysex message wakes up the control surface script in the midi handler method
— read one message from the control surface's pull socket.
— when request is processed, the control surface script sends the response from its push socket.
— from the server, get the reply from the pull socket

Which just gives, in a simplified version :

class NodeControlSurface(ControlSurface):
    def __init__(self, c_instance):
        self.context = zmq.Context.instance()
        # receiver socket
        self.pull_socket = self.context.socket(zmq.PULL)
        self.pull_socket.RCVTIMEO = 5 # never block longer than 5ms
        # reply socket
        self.push_socket = self.context.socket(zmq.PUSH)
    def receive_midi(self, midi_bytes):
        # this method is called as soon as a sysex message is received
        # ( not called for any other midi message since no mapping has been setup )
        # receive request. This will block at worse 5ms
        clientId, req = self.pull_socket.recv_multipart()
        # send reply

So instead of an explicit event loop blocking on a pull socket, it's just an event handler. It processes one request at a time, that's it. It feels less cluttered than a loop.

Javascript side ( web application )

I didn't mention it yet, but the web application uses aurelia.io . Since the app is very specific, the framework isn't used extensively, but there is one important point : aurelia enforces the use of ES6 ( check ES6 features, it's amazing ). It allowed me to discover what the latest trend in javascript looks like, and it's really, really neat. Python still has a strong points due to its extensive standard library, but when it comes to the language itself, I feel like I'm close to call it my favourite language. It's not about purity, it's about solving today's problems ; it was once the language everyone was forced to use to program browser-side, but it's more than that today, it has a real practical value.

Coroutines ?

There are some attempts at implementing coroutines as well, but I don't like it either. Event handlers are just fine and more importantly match what's really happening ; even when it comes to handle a state, closures are just fine. Indeed some examples show that coroutines can get rid of some clutter, but I feel like the value of removed clutter is not balanced by the added obscurity in real situations. Learning to think in terms of state and events provides more value than using coroutines in my opinion.


Promises are really absolutely critical to allow such an application to even exist. It's not about performance, it's not enabling anything internal to the program. Humans have hard limits as well. Consider complexity as being proportional to the number of arbitrary items you have to hold in your mind to understand something, in order to troubleshoot or modify a program. One trick is to make those items not arbitrary anymore, so they are connected somehow. See this great TED talk about memory : Feats of memory anyone can do. You must organize knowledge to tell a story, present logical relationships, meet expectations from previous experiences, be consistent and meaningful.

One of the most complex part of the application is the clip editor. The initialization depends on plenty of parameters which must be fetched remotely: does a clip exist in the given slot ? If not, create it. Then fetch how long it is, and where it starts, so I can request the notes to display. Also, create listeners for play status, play position and note changes, but not before I ensure the clip exists. Some part of it must be strictly ordered, some part can be in parallel ; some part can fail.

Imagine this on ten levels. Add checks, branches, temporary state variables.

This would be an unmanageable spaghetti of callbacks with no hope to properly pipeline parallelizable calls to make it fast. Indeed one can be creative and come with an original solution but you don't have to : promises to the rescue.

This is the initialization code for the clipslot object ( think of something similar to an active record ):

this.ready = Promise.reduce([
    () => this.get("has_clip"),  // this.get automatically initializes this.has_clip at reply
    () => { 
        if (!this.has_clip) {
            return this.call("create_clip", 4); // create a clip of 4 beats
    () => this.get("clip"), // creates a local reference to the remote "clip" object
    () => Promise.settle([ // wait for all promises to either succeed or fail, in any order
        this.clip.set("looping", true),
        this.clip.listen("playing_status", () => this.get("is_playing")),
        this.clip.listen("notes", () => this.updateDisplay()) //notes have changed
    () => {
        // we now have all the references we need, so now we can listen for changes 
        Object.observe(this.clip, (changes) => {
            var attributes = new Set(changes.map((x) => x.name));
            if (attributes.has("loop_end") || attributes.has("loop_start")) {
                this.updateDisplay(); // update display on change
        this.updateDisplay(); // first update of the display
], (_, c) => c(_), null /* "reduce" needs an initial value and an accumulator function */).catch((error) => {
    throw error;

It's still quite involved indeed, but it combines ordered initialization when needed, parallelization when possible, error tolerance if applicable. All of this being as flat as possible instead of a christmas tree.

  • First notice it's all arrow functions, new in ES6. Not only this clutters less the source code, but also the "this" variable is the same as the englobing scope. Perfect for callbacks. 
  • this.get, this.set, this.call all actually pass a request to the API proxy, and immediately return a promise. On reply, "get" updates the matching attribute of "this", and solves the promise.
  • Promise.reduce executes items in sequence. It expects an array of functions, which either return an immediate value or a promise, in which case the sequence is blocked until it is solved. A promise can solve to another promise, which is really another way of expressing a sequence. Notice it's all functions ; if promises where directly expressed they would be executed, but in our case their execution depend on completion of previous steps in the sequence.
  • "listen" calls a remote function that calls us back on change. Notice an event differs from a promise : a promise is solved once, while an event is called as much as needed. In our specific implementation "listen" can fail if we are already listening, which may happen due to order of initialization with other components
  • "settle" just waits for the completion of an array of promises, either failing or succeeding
  • "observe" is a recent addition to javascript, and is a lightweight way to be notified for changes on an local oject. It's called once per event loop, this is why it receives a "change" array.
Not visible here but important as well : all promises are created with a 30s timeout. That feature made me use bluebird, recommended by Aurelia's documentation, instead of q. This way all delayed actions can be considered as failed after a timeout, no matter the origin, whether it's an ajax call or another communication protocol, or even as a temporary solution to a failure you can't spot in a series of callbacks : no matter what, you'll never get a promise stuck forever. 

WebRTC Workaround

I couldn't get wrtc to run more than 10 minutes without crashing ; I went as far isolating it inside a subprocess, so I'm sure it's the one creating segmentation faults. Strangely enough, that's the only binding I could find for an interpreted language. I didn't take the time so far to get a nice isolated coredump and either open a ticket or try to fix it myself, I knew this would add weeks of delay to my progress. So I went for an option that would get me a 100% chance of working for now: run the webrtc part of the server inside a browser. I get CPU peaks time to time but never a crash.

Chrome's native messaging allows extensions to run a local program. The best wrapper I could find for node is https://github.com/jdiamond/chrome-native-messaging.git . But it does complicate the infrastructure again as Chrome cannot directly talk to any external program, even as an extension : it must be the one executing it. So this extension comes in two part :
  • The frontend of the extension is a background page managing WebRTC connections. I extracted the logic from my original nodejs attempt, it didn't require much rework.
  • The extension launches a native application, which is really just a node script. This native part is needed to talk to the server via ZeroMQ, while it talks to the extension via standard input/output streams.
Setting up the webrtc connection then follows this workflow :
  • a offer/answer exchange must be done in order to setup the WebRTC connection. This is done via HTTP on the main server.
  • The server forwards the offer to the native part of the extensions via ZeroMQ
  • The native extension then talks to the background page via the native messaging
  • Once the WebRTC connection is setup, the extension routes messages to Ableton if it's API commands, or to the server if it's MIDI
It's a bit messy to my taste, especially due to the fact that there is only one communication channel between the native extension and the background page, and the background page is the only one able to talk to the browser via WebRTC. This means the stream between the extension and the native application contains messages of mixed nature : it can be the offer/answer exchange, or midi that must be relayed to the server, or commands to send to Ableton, or replies coming from Ableton, or listener notifications coming from Ableton. Each message has a specific signature in order to be properly routed.

Fortunately, the only purpose of this extension is to route messages. It's also adding the "clientid" to the requests, so it's almost stateless, the only state is connections. Messages and replies contain all the state needed for routing.

I'd like to see that part merged with the server once the WebRTC instability is fixed ; it's a workaround, and  not a proper architecture.


Due to browser limitations we can't have a web application directly talking to ableton. Browser-side, communication protocols are limited; in ableton live, control surface scripts can't rely on threads. This is what justifies the mix of ZeroMQ and WebRTC I presented here. If you intend to build a web application with the lowest network latency you can get, I hope it helped.

Aucun commentaire :

Enregistrer un commentaire