Techniques for faster pixel-level drawing?

Is there any non-obvious way to push multiple, arbitrary pixel values into a bitmap/display that's faster than making setColor and drawPixel calls on each pixel?

My watchface wants to fill a relatively large area of the screen with pixel values that it computes on the fly. Specifically, I have some source grayscale pixel data which I'm scaling, clipping, rotating, and dithering to generate a different image every few minutes.

I've done a lot of work to make my pixel-transforming code reasonably quick, such that now much of the execution time of each frame is just the calls to `Dc.setColor` and `Dc.drawPixel` that actually push pixels into an offscreen bitmap. I've verified that in the Simulator's Profiler view, and also by commenting out those calls and running on the device.

Right now I can generate and draw ~100 pixels on each update before my watchface uses too much time and gets killed. But it seems like there are enough cycles to generate many more pixels than that, if I could get them into a Bitmap or the display more efficiently.

For example:

  • Somehow access the raw memory used by an offscreen bitmap, say as an `Array<Number>`?
  • Put some pixel values in an array and copy (blit) them all at once to a Bitmap?
  • Use a smaller palette, fitting more pixels into each Number, and set multiple pixels at once using a single Number value?
  • Construct a bit mask and use it to fill multiple pixels with the same color in one call?
  • Somehow reduce the overhead of repeatedly mapping the same few Color values to 4-bit raw pixel values that I imagine is eating time?

Lower-level pixel operations like this are often available on performance-constrained platforms, but I can't find anything useful in the docs.

Some things that I've tried that aren't faster:

  • Offscreen Bitmap vs drawing to the display (same speed)
  • Use a palette with just WHITE, LT_GRAY, DK_GRAY, BLACK, and TRANSPARENT (same)
  • Write all pixels of each color in a single pass to avoid most of the setColor calls (the overhead of managing bit vectors, etc. is more than the savings)
  • Batch runs of similar values and draw using drawLine (doesn't really apply to my dithered pixels)

For the time being, I'm splitting the rendering across multiple update cycles, which is pretty tricky to manage, and isn't always responsive enough when the device decides not to call onUpdate when I have drawing to finish.

I realize I may be pushing the platform in a way it wasn't really intended to be pushed, but I really like the results I'm getting, and if such a technique existed, it would unlock lots of interesting possibilities (Doom, anyone?)

  • Another thought about those slow `StringUtil.utf8ArrayToString` calls.

    `utf8ArrayToString` cannot know the necessary string length until it's scanned the bytes.  It has to potentially combine multiple bytes into a single character.  On the other hand, something like `StringUtil.convertEncodedString(myByteArray, { into hex  string })` knows how long the resulting string will be: twice the byte array's length.

    `StringUtil.charArrayToString(` is theoretically fast if it's essentially a memory blit of all the character values from the array's contiguous block of memory into a string's contiguous block of memory.  But it probably does other checks to validate that the array contains only chars.

  • I made a test of covering the screen in 6x6 blocks using a custom font.  One byte in a ByteArray corresponds to one character / one 6x6 block.  A bit different than what we've been discussing here.

    But I was surprised that my vivoactive 4s is letting each watchface update take a whopping 275ms.  Can't imagine what that does to the battery, and I figured it would be killed before that.  Makes sense if the device is counting opcodes, not time, because I'm making a single draw call that eats up most of the time, so I'm keeping my opcode count down.

  • Is your pixel data stored in an Array or a ByteArray?

    I'm generating pixels one bit at a time, effectively. At the moment I generate 10 of them in a Number, use that to index a table of one-character strings, and thereby avoid actually constructing any new Strings in the rendering loop. As far as I can tell, any of the methods of building String instances are slow compared to multiple drawText calls.

  • In general, custom fonts are slower to draw than native fonts.  You can see this when using custom fonts, onPartialUpdate, low power mode and watchface diagnostics.

  • I'm seeing that converting a ByteArray (not an array of numbers) into a string is quite fast.  The time is spent in the draw call.  This might be because the ByteArray is already contiguous bytes in memory, and it does not need to typecheck the values nor convert them from signed ints into unsigned bytes.

    I also create the options dictionary only once.  Constructing an inline dictionary for every call is probably slow.

    Time permitting, I'll update my test to directly compare against the speed of multiple draw calls w/lookup.

    My code is on GitHub so it should be fast enough for anyone else to clone it and run on their device.  github.com/.../garmin-sandbox

  • is any of your code available

    I went ahead and made the repo public:

    https://github.com/mossprescott/moonface/tree/font-rendering

    For the record, in case anyone decides to look at this code in any detail... There are lots of things of things I could do to improve performance in other areas of this code (most notably, avoid repeatedly calculating the position of the sun at every hour of the day, and avoid re-drawing the relatively static background on every update.)

    Nevertheless, actually plotting moon pixels is ~25% of the total time (per the profiler, running in the simulator), with another ~25% for transforming/clipping/dithering. My frustration comes from the fact that the rest of this code is doing a lot of math and other "hard" stuff, while the pixel plotting would be entirely trivial if I had a lower-level API (i.e. blit).

  • Your original post says you can only do about 100 pixels before the watchdog triggers; but experiments below suggest you can do ~10000 dc.drawPoint calls without triggering the watchdog (or presumably about ~5000 setColor/drawPoint pairs). So I'm thinking that its not the drawPoint/setColor thats really causing the issues.

    So I tried generating a byte array as the source data, and then dithering it into a BufferedBitmap, and then using drawBitmap2 to rotate and tint it onto the screen. I found that I could do a 49x49 block (2401 pixels) without triggering the watchdog. 50x50 does trigger it (and yes, I'm sure that by tweaking the code a bit I could get to 50x50 - the point is its comfortably doing 20x more than reported in the original most).

    Here's the code (I started by using the "New Project" command, and created a "complex" data field called Blitter):

    import Toybox.Activity;
    import Toybox.Graphics;
    import Toybox.Lang;
    import Toybox.WatchUi;
    
    const BlockSize = 49;
    
    class BlitterView extends WatchUi.DataField {
      private var image as ByteArray = new [BlockSize * BlockSize]b;
      private var errors as Array<Float> = new Float [BlockSize + 1];
      private var angle as Float = 0.0;
      private var color as Number = Graphics.COLOR_RED;
    
      function initialize() {
        DataField.initialize();
        for (var x = 0, p = 0; x < BlockSize; x++) {
          for (var y = 0; y < BlockSize; y++, p++) {
            image[p] =
              ((x * x + y * y) * 255 + 128) /
              ((BlockSize - 1) * (BlockSize - 1) * 2);
          }
        }
      }
    
      function onLayout(dc as Dc) as Void {}
    
      function compute(info as Activity.Info) as Void {
        var time = info.timerTime;
        if (time == null) {
          time = 0;
        }
        angle = (((time / 500) % 360) * Math.PI) / 180;
        var c = (time / 1000) % 6 + 1;
        color = (c & 1) * 0xff + (c & 2) * 0xff00 + (c & 4) * 0xff0000;
      }
    
      // Display the value you computed here. This will be called
      // once a second when the data field is visible.
      function onUpdate(dc as Dc) as Void {
        var bitmap = Graphics.createBufferedBitmap({
          :width => BlockSize,
          :height => BlockSize,
        });
        var bdc = bitmap.get().getDc();
        for (var x = 0; x <= BlockSize; x++) {
          errors[x] = 0.0;
        }
        for (var y = 0, p = 0; y < BlockSize; y++) {
          var enext = errors[0];
          for (var x = 0; x < BlockSize; p++) {
            var pixel = image[p] + enext;
            var quantized = ((pixel * 3 + 128) / 255).toNumber() * (255 / 3);
            quantized = quantized < 0 ? 0 : quantized > 255 ? 255 : quantized;
            bdc.setColor(quantized * 0x10101, 0);
            bdc.drawPoint(x, y);
    
            var error = (pixel - quantized) / 8;
            errors[x] += error * 3;
            x++;
            enext = errors[x] + error * 4;
            errors[x] = error;
          }
        }
        var transform = new Graphics.AffineTransform();
        transform.rotate(angle);
        transform.translate(-bitmap.getWidth() / 2.0, -bitmap.getHeight() / 2.0);
        dc.drawBitmap2(dc.getWidth()/2, dc.getHeight()/2, bitmap, {
          :tintColor => color,
          :transform => transform,
        });
      }
    }
    

    I realize that 49x49 is likely way too small; but the point is that you can create a BufferedBitmap, dither into it in several 2000+ pixel chunks (via eg Timer.Timer), and then draw the bitmap in your onUpdate once its ready.

    I'll accept that the dithering isn't as good as it would be if we rotated the data first, and then dithered; but it doesn't look too bad to me. It's also possible that I'm still missing some important aspect of what you're trying to do, making this approach useless for your purposes.

  • experiments below suggest you can do ~10000 dc.drawPoint calls without triggering the watchdog (or presumably about ~5000 setColor/drawPoint pairs).

    I tried a similar experiment and confirmed I can make roughly 5000 setColor/drawPoint calls without triggering the watchdog, in the context of everything else that's going on in the app. In raw terms, that's probably just about adequate for my needs (roughly π*30^2 ~= 3,000), so I investigated further...

    Going back to a working state and checking some of my comparative timing, it looks like those calls account for about 15-20% of all the time I spend on rendering these images, not something like 50% as I previously thought. I think I got that impression by taking the profiler's timings on faith.

  • One more update: I tried another approach where I generated a PNG with slices of (colored) pixels, loaded that as a Bitmap, and drew each slice with a single drawOffsetBitmap call. Which seemed clever, except that whatever I try, those drawBitmap calls are much slower than a moderate number of drawPoint or drawText calls.

    The upshot is that although all this discussion and exploration has been fun and educational, it's looking like optimizing the raw plotting of pixels isn't going to make the impact that I hoped. I'll give it some more thought, but this is all making me think it might be more fun and effective to pursue a different approach that doesn't involve dithering on the device. Of course, all of this discussion has given me lots of ideas about where to start.

    Thanks everyone for your suggestions and insights. It has been fun getting into the weeds on this issue, and I'm learning a lot!