There is much lamenting out there about the state of diagramming software architecture.
I've found myself quite frustrated by what George Fairbanks calls the model-code gap (in Just Enough Software
Architecture). I don't want to play around in a graphics program doing repetitive and inefficient dragging,
resizing, etc. -- and then never updating it again, because it's too much trouble. I want a model of my code, without
all the misery. I want new people on a project to more easily grok the overall idea of what's going on. I want it to be
regularly useful in discussions. And I want everyone working on the code to be able to participate in the modeling.
I think I like Simon Brown's C4 model, but it sort of dimisses the code-level
view of things. He calls it level 4 and says you should generate it, not build it yourself. But what do you use to
generate that? Suddenly you're in the land of language-specific tooling that doesn't scale beyond one or maybe a few
languages. I don't want to be confined to one language. I want one tool I can always use. I can stick it in my personal
toolkit and be a better software engineer, no matter what direction my next project takes. It's an investment that
scales.
One big problems with code visualization is that code is hard to introspect. It's hard enough if you focus on one
language, but supporting an arbitrary number of different languages is a huge effort that even large organizations
struggle with.
But what do all languages have in common? Directories, files, and line numbers. Maybe we can work at that level. No, it
won't reveal details of classes, functions, and so on, but in a way that's a good thing: those tend to be very
fine-grained and create an overwhelming diagram that's impossible to love.
Goals:
- The diagram source can live as text along with the code, for version control and freshness. An output image can
be generated in commit hooks or CI, and (hand wave) diffed visually.
- The diagram should be as convenient as possible to maintain, and useful enough to justify the maintenance work.
This is critical for success, because no one wants another pointless burden in their workflow.
- The basic level of granularity will be that of files. Files can be logically organized into groups, and those
files/groups can be organized into layers. These can correspond to the vertical/horizontal partitioning of your
project's logic. Files can be referenced in terms of glob patterns (e.g. src/foo/bar_*.go).
- The layout should be controllable in a straightforward way (through text, of course). This is necessary because
the tool won't know about code dependencies, but it's also a benefit because we wind up with a sensible map that
resembles our mental model, and isn't some insane maze like so many generated diagrams are.
- The layout should be stable, so that the diagram can serve as a consistent map of the code, without
unpredictable layout shakeups when one little thing changes.
- The diagram should exploit visualization techniques to make the code more tangible. First and foremost is
scaling each file based on the actual amount of code in it. A file with 5 lines is simply not worth the same
real estate (either mental or graphical) as a file with 1000 lines.
Longer-term vision and bonus ideas:
- Visualizing code entropy by comparing compressed code size to its normal size (e.g. variable coloring of each
file) and comparing similarity between files.
- Visualizing code volatility by leveraging git history.
- Connecting the diagram to profiling data to visualize performance problems. Maybe piggyback on existing
cross-language tooling like flamegraphs, dtrace, etc.
- Additionally, profiling data can help us construct the interdependencies in the code, since the tool is
blind to code semantics and normally doesn't know how the logic flows.
- Integrate into larger C4 models as the neglected 4th level.
- What other kinds of anaysis can we do about files of text, without speaking their programming language?
Limitations:
- You must maintain the mapping between your source code layout and the diagram. But this is better than
maintaining a completely disconnected diagram.
- The tool is dumb about code: It doesn't understand dependencies, so you must arrange the structure yourself. So
there will be quite a bit of editorializing. But by enforcing matches to actual files on disk, we can make
sure the diagram doesn't completely drift away from reality.
- Integrating with profiler output will only work if it contains filenames and line numbers, which it often does
not.