Saturday, November 9, 2013

Some Advice When Starting a New Job

I think this advice would be helpful working almost anywhere:

- Make some friends
- Earn some respect
- Ask some questions
- Don't piss off your company's customers

Saturday, October 26, 2013

QtCreator's Python Debug Visualizers

Peter Lohrmann wrote QtCreator debug visualizers in Python for some key classes used by the Linux OpenGL debugger project we've been working on together. He recently blogged about the details here.

So far we've got visualizers for our dynamic_string and vector classes. (Like many/most game devs, we use our own custom containers to minimize our reliance on the C++ runtime and "standard" libraries, but that's another story.) Before, to visualize the contents of vectors in QtCreator, I've had to muck around in the mud with the watch window and type in the object's name, followed by the pointer and the # of elements to view. Our dynamic_string class uses the small string optimization (not the super optimized version that Thatcher describes here, just something basic to get the job done). So it's been a huge pain to visualize strings, or basically anything in the watch/locals window.

The below pic shows the new debug visualizers in action on a vector of vectors containing dynamic_strings.  Holy shit, it just works!

I'm not a big fan of Python, but this is valuable and cool enough to make it worth my while to learn it.

Here's the code. Almost all of this is Peter's work, I've just tweaked the vector dumper to fix some things. I'm a total Python newbie so it's possible I screwed something up here, but this is working much better than I expected already. It's amazing how something simple like this on Linux can make me so happy.

You can find a bunch of QtCreator's debug visualizer code here: ~/qtcreator-2.8.0/share/qtcreator/dumper

In my ~/.gdbinit file:


And here's my /home/richg/dev/raddebugger/src/crnlib/ file:


# This file contains debug dumpers / helpers / visualizers so that certain crnlib

# classes can be more easily inspected by gdb and QtCreator.

def qdump__crnlib__dynamic_string(d, value):

    dyn = value["m_dyn"]
    small = value["m_small"]
    len = value["m_len"]
    small_flag = small["m_flag"]
    buf = dyn["m_pStr"]
    if small_flag == 1:
        buf = small["m_buf"]
    p = buf.cast(lookupType("unsigned char").pointer())
    strPrefix = "[%d] " % int(len)
    str = "'" + p.string(length=len) + "'"
    d.putValue(strPrefix + str)
    with Children(d):
        d.putSubItem("m_len", len)
        with SubItem(d, "m_small"):
            d.putValue( str if small_flag == 1 else "<ignored>")
            with Children(d):
                d.putSubItem("m_flag", small_flag)
                with SubItem(d, "m_buf"):
                    d.putValue(str if small_flag == 1 else "<ignored>")
        with SubItem(d, "m_dyn"):
            d.putValue("<ignored>" if small_flag == 1 else str)
            with Children(d):
                with SubItem(d, "m_buf_size"):
                    d.putValue("<ignored>" if small_flag == 1 else dyn["m_buf_size"])
                with SubItem(d, "m_pStr"):
                    d.putValue("<ignored>" if small_flag == 1 else str)

def qdump__crnlib__vector(d, value):

    size = value["m_size"]
    capacity = value["m_capacity"]
    data = value["m_p"]
    maxDisplayItems = 100
    innerType = d.templateArgument(value.type, 0)
    p = gdb.Value(data.cast(innerType.pointer()))
    d.putValue( 'Size: {} Capacity: {} Data: {}'.format(size, capacity, data ) )
    numDisplayItems = min(maxDisplayItems, size)
    if d.isExpanded():
         with Children(d, size, maxNumChild=numDisplayItems, childType=innerType, addrBase=p, addrStep=p.dereference().__sizeof__):
             for i in range(0,numDisplayItems):
                 d.putSubItem(i, p.dereference())
                 p += 1

Saturday, October 19, 2013

A Shout-Out to QtCreator 2.8.x on Linux

So this is a little post about C/C++ IDE's, which apart from the browser is the key piece of software I live in most of the day. I know a lot of Windows-centric developers who swear by Visual Studio, and up until recently I used to be one of the VS faithful. I'm going to try and sell you on trying something else, especially if you develop on Linux or OSX but it's available for Windows too.

I think I've finally found a reasonable cross platform VS alternative for C/C++ development that doesn't require shelling out hundreds (or thousands) of dollars every time MS tweaks (or totally screws up) the UI or adds some compiler options. I've been using QtCreator full-time now for 6 months and I think it's awesome. I would buy it in a heartbeat, but it's a free download and it's even open source.

A bit of the background behind my need for a VS alternative: For more than a decade I've been using Visual Studio (since VC5 I think), and various other IDE's from Borland/Watcom/MS before that. When I started working on a new Linux OpenGL debugger (about 6 months ago) all the Linux devs around me where using text editors, cgdb, etc. There was no way in hell I was going back to only a text editor (even the goodness that is Sublime) for editing, gdb cmd line for debugging, and another command line for compiling, etc. It's been a long time since my DOS development days and I'm just too old to do that again on the PC. (On embedded platforms I can tolerate crappy or no IDE's, but not on a full-blown modern desktop!) So I began an exhaustive, and somewhat desperate search for a real Linux IDE with a useful debugger that doesn't suck.

I experimented with a bunch of packages (such as CodeBlocks, CodeLite, Eclipse, KDevelop, etc.) and even some stand-alone debuggers (like ddd, cgdb) on some of my open source projects and settled on the amazing QtCreator 2.8.x. It's a full blown C/C++ IDE with surprisingly few rough edges. It's got all the usual stuff you would expect: editor, project manager (with optional support for things like cmake), integrated source control, an Intellisense-equivalent that just works and doesn't randomly slow the IDE to a crawl like in VS, C/C++ refactoring, and nice gdb/lldb frontends that don't require you to know anything about obscure gdb commands. I've been using it to compile with either clang v3.3 (using Mike Sartain's instructions that make it trivial to switch between clang vs. gcc), and with gcc v4.6. The whole product is super polished, and I find myself happier using it than VS and its fleet of unreliable (but pretty much necessary on real projects) 3rd party plugins like Visual Assist, Incredibuild, etc. that make the whole thing a buggy and unstable mess.

QtCreator's name can be misleading. It's not just for Qt stuff, although it's obviously designed to be great for Qt dev/debugging too. I use it to debug command line and OpenGL apps, either starting them from within QtCreator or attaching to the process remotely. It's got built-in support for Mercurial (hg), Git, Perforce, SVN, etc. although I've only used its hg and p4 support.

Visual Studio since 2012 has apparently gone almost completely batshit, so I've been delaying upgrading for as long as possible even before my VS divorce. I was hoping the saner and more tasteful hands at MS would reign in the "modern app" idiocy and fix things, but I've lost hope. Although with Ballmer (who's obviously been completely out of touch) being finally put to pasture maybe they can turn the ship around.

Here are a few more screenshots of QtCreator Linux in action. I'm using KDE Plasma desktop installed under Ubuntu v12.04 x64. (If you've just installed Ubuntu for the first time and have no Linux desktop preferences yet, do yourself a favor and just go reinstall Kubuntu.) If you want to try it out, be sure to download the version from Qt's website (not the Ubuntu software center - it's really outdated the last time I checked). Also check if your distro requires disabling ptrace hardening before you debug anything. Also, I had to change the default terminal used for running/debugging apps to something else, so under Tools->Environment->Terminal: "/usr/bin/xterm -sl 1999999 -fg white -bg black -geometry 200x60 -e"

We've also just added custom debug visualizers for our most important container classes to QtCreator, but I've not had a chance to play with this stuff yet.

Source control configuration:


More debugging:

Sunday, October 13, 2013

The big miniz zip64 merge

Currently merging miniz v1.15 into my in-progress zip64 branch which is being used/tested in the Linux OpenGL debugger engine project I've been working on. The left pane is v1.15, right is the new version with zip64.

Fun times. The new version still needs to be C-ified in a few places. I'm actually liking the purity of C for some strange reason, it's amazing how fast it compiles vs. C++. I'm so used to glacially slow compiles that when I used TCC (Tiny C Compiler) again it appeared to complete so fast that I thought it had silently crashed.

miniz.c: Finally added zip64 support, other fixes/improvements incoming

I finally needed Zip64 support for the Linux OpenGL debugger I've been working on at work. GL traces and state snapshots can be huge (~4-10GB of data for 3-4 minute game runs is not uncommon), and can consist of tons of binary "blob" files for VB's/IB's/shaders/etc. I looked at some other C archive libraries (libzip, minizip, etc.) but they where either ugly/huge messes with a zillion C/H files, or they didn't fully abstract their file I/O, or they didn't support in-memory archives for both reading/writing, or their licenses sucked, or they weren't thread safe (!), or I just didn't trust them, etc.

So screw it, I'll bite the bullet and do this myself. It's certainly possible I missed a really good library out there. I prefer C for this kind of stuff because most C++ libs I find in the wild use features I can't live with for various reasons, such as C++ exceptions, stl, heavy use of heap memory, Boost, or have tons of other lib dependencies, etc.

The original ancient zip file format was OK and kinda elegant for what is was, and the code to parse and write the original headers was nice and easy. But once you add zip64 it becomes an ugly mess full of conditionals, and copying zip header/archive data from one zip to another can be a big pain because you can't just blindly copy the zip64 extended data fields from the source to destination zip. (You've got to kill the old one from the extended data block and add a new one, etc.) Zip64 is now "done", and I've been running a bunch of automated testing on the new code paths, but I'm worried I've bit off more than I can chew given the very limited time I have to work on this feature for the debugger.

I've also renamed a lot of the zip "reader" API's so they can be used in both reading and writing mode. There's no reason why you can't locate files in the central directory, or get file stats while writing, for example, because the entire zip central directory is kept in memory during writing.

I've added full error codes to all zip archive handling functions. You can get the last error, clear the last error, peek at the last error, etc. I went through the entire thing and made sure the errors are set appropriately, so now you can get more info than just MZ_FALSE when something goes wrong. This change alone took several hours.

In zip64 mode I only support a total of UINT_MAX (2^32-1) files in the central dir, and central dirs are limited to a total of UINT_MAX bytes. These are huge increases from the previous limits, so this should be fine. I'm not writing a hard disk backup utility here after all, so I'm not going to support archives that big right now.

Bugwise, the only major bug I'm worried about in the current public release (miniz.c v1.14) that really worries me is the MZ_ZIP_FLAG_DO_NOT_SORT_CENTRAL_DIRECTORY flag (Issue #11 on the Google Code Issue tracker). I doubt anybody really used this flag, so I'm not worried about that, but a few internal API's used it to speed up loading a little and they could fail. It's bad enough that I'm going to patch v1.14 to fix this tonight.

Also, it's definitely time to split up miniz.c into at least two source files. One for Deflate/Inflate/zlib API emulation/crc32/etc. The other file will be for ZIP archive handling and be (of course) optional.

I'll also try to merge in all the fixes/improvements people have either placed on github, on miniz's Google code bug tracker, or have sent to me privately, if time permits.

Let me know if you are dying to try this version out and I can send you a private copy for testing.

Friday, August 23, 2013

LZHAM2 notes

I've been thinking in the background about LZHAM2 again. (LZHAM is a lossless codec I released a while back on Google code, which I'm now calling LZHAM1.) For LZHAM2 I'm aiming for much higher and more consistent decompression throughput, and more control over the various decompression rate vs. ratio tradeoffs that are fixed in stone right now. Here are all of my notes, put here mostly so I don't forget them. (I first sent them to Michael Crogan who's interested in this stuff but realized they where probably more useful here.)

LZHAM2 notes:

- LZHAM1's decompressor can be greatly slowed down from having to rebuild Huffman decode tables too much, especially on nearly uncompressible files (because all the updates are spread around to way too many tables, so the decompressor gets stuck in a ditch constantly updating tables). The codec needs to be way smarter about when tables are updated.

Here's the big hammer approach to this: Support offloading Huffman table construction onto a pool of 1 or more worker threads. This is kinda tricky because the compressor must delay using updated Huffman tables for a while, because of the newly introduced latency of when the decompressor can switch to the new table. Determining how much latency to actually use will be an interesting problem (maybe make it adjustable/overridable by the user).

Worst case scenario, if the main thread needs to switch to an updated table that's not available yet it can not wait and just immediately compute the table itself (obviously wasting some CPU, but who cares because most apps rarely if ever use all available CPU's anyway).

Consider sending a tiny signalling message to the decompressor that indicates when the table must be updated.

Pretty much every modern mobile/desktop/etc. platform supports multiple HW threads, so LZHAM2 should be able to get a bunch of Huffman table updates for "free" if I can make this work well enough.

- SIMD-ify Huffman and/or arithmetic decompression. I'm on the fence about this, but the symbol decompression rate improvements I've heard from others experimenting in this domain are remarkable.

- Really try hard to optimize the Huffman decode table generator. SIMD it, whatever, it's way more important than I thought.

- LZHAM1 uses a simple approach to trigger the resetting of the Huffman table update rate: when the overall compression ratio has a big drop in the last few blocks it can reset the update rates of all the tables. There's a bunch of lag in the current implementation (all the way up at the block level) because the compressor's design is limited to a single pass (streaming) approach, and I didn't want to go back and re-code a whole block during a reset. Try an alternative that uses either more buffering or multiple passes.

- LZHAM1 uses too many Huffman tables, and is sloppy about its higher order contexts (the order-1 contexts are typically just the upper X bits of the previous byte, etc.) There's got to be a smarter way of dealing with this other than just lopping off low order bits. I went too far on using tons of tables and more contexts in order to match LZMA's ratio on large files. The codec needs to be more configurable so it can use less contexts for faster decompression.

- Do a thorough analysis on a wide range of file types and (importantly) file sizes. I just didn't spend much time concentrating on LZHAM1's small file size performance because I thought large solid files would be the more important real-world use case.

- I did a high quality integration of LZHAM directly into 7zip (both the command line tool and the 7z Windows archiver app) for testing, which helped me shake out a few of the remaining higher level API bugs. I didn't release this publicly however, but I did release the API fixes that came from this work. This was a super useful thing to do.

- Charles Bloom made several concise suggestions on how to improve LZHAM on his blog when he compared the codec vs. several others. Some of these suggestions are in the reply section, I need to save them.

- Finally get LZHAM's compressor into Telemetry and figure out how to better thread it. The current approach is really basic and just forks & joins on every block.

- Cloud compression is very interesting from an optimization perspective. I've been limiting myself to 1 machine with X threads and plain streaming compression (with minimal buffering) only. These are some important axes to explore. I used ~200 machines (with 4-12 compile threads on each box) to compile and optimize Portal 2's vertex/pixel shaders, imagine the parsing levels and compression options you can try out on hundreds of machines.

- Switch to cmake and Linux as my primary dev platform. I no longer use Windows as my primary dev platform. Linux is clearly the path forward and Windows is now that thing I port to.

Various things I tried that didn't make it into LZHAM1:

- In the early days I spent quite a bit of time experimenting with Deflate or LZX-style static Huffman tables vs. the dynamic stuff used in LZHAM1. I delta coded the code lengths vs. the previous block's code lengths into the output stream (I think I first saw Bloom do this in his early LZ research codecs). At the time I found the practical constraints on the # of Huffman tables, the # of symbols, etc. this placed on the design seemed too restricting. I hit a wall and couldn't compete against LZMA this way. I think there's still plenty of room to optimize the dynamic table rebuild approach which is why I keep pushing it.

- Match finding using suffix arrays and largest common prefix (LCP) tables
Got it working using the best algorithms I could find back around '07 and then again in '10, but my implementation had perf/memory scaling issues with larger dictionaries. Adding new data into the dictionary (and "sliding out" old data) was extremely expensive because the tables had to be rebuilt. LZMA's matching algorithm was easier to implement and a known thing so I went with that.

- I have a working branch of LZHAM that uses ROLZ (reduced offset LZ). It has a nicely improved ratio, but the sheer complexity of this beast (not to mention the lower decompression throughput due to updating the ROLZ tables) was just too much for me to handle as a side project so I put the whole experiment on ice.

- Early versions of LZHAM1's parser supported favoring matches that it thought would likely be in the decompressor's L2 cache. (It actually had a whole data structure that modeled a basic L2 cache that was used to bias the symbol prices.) This seemed like an important optimization for console CPU's, but I never measured any real benefit on the PC so I removed it and moved on.


- I keep wondering why Google continues to invest in Deflate with Zopfli, etc. when it's clearly ancient stuff (Deflate was introduced 20 years ago). A new open codec that strikes the right balance somewhere in the spectrum between Deflate/LZX/LZMA/LZHAM/etc. would be super useful to a lot of people, and they have the clout and talent to do it. They should have enough data points from existing codecs and internal experience due to Zopfli to have confidence in building a new codec.

An effort like this would make a huge impact across the entire web stack. The gain would be relatively massive compared to the tiny improvements Zopfli's been able to achieve (~5% for 100x increase in cost means it's time to move on). 

If the new codec is made zlib API compatible (like I do in LZHAM and miniz), which is easy, then dropping it into existing codebases would be fairly straightforward.

- Someone needs to write a universal preprocessing/long range match library that properly supports streaming and is trivial to add in front of other codecs. I've been treating preprocessing as a totally separate axes vs. compression, assuming somebody would eventually solve this problem.

It could support various executable formats (dwarf, exe, etc.), xml, json, html, jpeg, mp3, wav, png, raw images, deflate, etc. All the best archivers already do this and the research has been done, but AFAIK it's not available as a single robust library.

- The decompressor can be viewed as a virtual CPU with a very limited but tightly compressed instruction set. I've been wondering what tricks from the CPU world could be effectively applied to LZ. I wonder if there are more useful instructions other than "here's a literal" or "copy X bytes from the dictionary using this offset".

With good parsing it's easy to add more node types to the parse graph. Right now I'm adding only literals (which are coded in various ways depending on previous state), and various matches and truncated versions of these matches.

- There are some deep/fundamental inefficiencies in the entire class of LZMA/LZHAM/etc. style algorithms. Bloom has covered this topic well on his blog, and I also realized this while working on LZHAM. For example, when a match ends, the decompressor has some knowledge from the dictionary about what the next character(s) are likely *not* to be. This knowledge can be used to exclude some dictionary strings in future matches. (However, the parser sometimes purposely truncates matches so it's possible for a match's follower byte to actually match the input but not be used.) There's code space inefficiency all over the place that seems like a big opportunity, but exploiting it seems hard to do efficiently.