Introduction
I always spend too much time setting up a new project and thinking how to structure it. I decided to summuraize my experience, to enhance it with a small research and to write down my thoughts on the topic. So I can come back to it myself or reference in the discussion.
Also, it can help me to unify my and maybe neighbouring projects structure, so it's easier to navigate and work on them. In the rest of the post I use plural pronounce, because I think this post can end up as a documentation or a wiki page.
You probably already experienced incoviniences related "naturally grown" project structure (especially in someone's repo you need to work on for some reason). When there is no clear separation between different types of source code: auxiliary, dev, build, primary, tests, environment setup, all mixed up. Randomly firing side effects, warnings, etc, just because the wrong file ended up on the load path. Mental model and reasoning are also suffering.
In this article we discuss a one particular directory structure that scales with project complexity and the rationale behind. The ideas apply whether you are setting up a fresh project or refactoring an existing one.
We expect that a code base may contain multiple programming languages. For the demonstration and examples the primary language is Guile Scheme, but most of the ideas are similar and translate easily to other languages.
We also assume that the source code is stored in text files. Yeah, there are languages, where the code stored on other medias (e.g. IPLD/IPFS, databases) and it is a really cool thing, but in here we focus on old plain files :'( BTW, Scheme standard doesn't specify, where and how Scheme code must be stored, it's implementation specific detail, so it's a cool promising area for RnD.
The approach we propose should work in the most cases. If it doesn't work, we would be curious to hear about your use case.
By the end of the reading you should understand why paths like these are so long and how that length pays for itself and make the life easier:
- src/scheme-common/markdown/parser/nodes.scm
- tests/guile/markdown/parser/core-test.scm
We explain it in three steps: first, how we organize and name modules,
then why an intermediate "language" directory is useful, and finally
why grouping everything under src/, tests/, and friends makes so
much sense.
Modules
This whole section prepares a foundation for implementing a good directory structure. It's more about coding practices, naming files, modules and recapping the concepts like load paths.
Module-Path Correspondence
In many languages (Java, Clojure, Guile Scheme, etc), there is a mechanism called load paths or search paths. In such cases, module name should correspond to file path or vice versa. This allows the language to find and load modules by name whenever they are required, without knowing the absolute path to the file in advance.
The idea is simple: if you have module (my-project markdown parser)
it should be located in parser.scm file, in ./my-project/markdown/
directory relative to the root of a load path. When this module is
imported in the program code, the language look through all the load
path directories and when it finds ./my-project/markdown/parser.scm,
it will try to load a module from it.
We will talk about how to set up load paths in more detail in the section about grouping directories, for now just keep in mind that load paths are a list of directories where the compiler or interpreter looks for code.
The reason this convention matters so much is predictability. Compiler knows where to find a module. An IDE, a new contributor, or a person passing by randomly can easily guess where to look for a particular module.
Namespace (Module) Naming Rules
In the first subsection we mentioned that file path SHOULD correspond to module name, but not MUST. It was intentional. In Guile Scheme you can have a module defined in any file, moreover you can have a module defined in multiple files or have a file without module definition at all.
While all this possible, we recommend not to do so, unless it's really needed for some reason and keep module-path correspondence from previous section.
Now, the question, where (json) module comes from? You own sources,
guile-json library, or somewhere else? Quite unclear. For
(my-project ffi json) it's much clearer that it very likely a json
bindings for C library implemented by my-project. Naming module
with at least 3 elements, is a good idea for multiple reasons:
clarity, minimal chance of name clashes and not too long at the same
time.
First element is usually a project or domain name. In java world they
use reverse domain name notation (org.apache.kafka,
org.apache.spark), it easy to sort it, but we find it hard to read
for humans. So we suggest to use either project name (spark ...) or
a domain (kafka-apache-org ffi json) or (kafka.apache.org ffi json) if your language supports dots in module elements.
Second element is usually up to you, group it whatever way it make sense for the project. We won't give any recommendations here.
Third one is usually straightforward. The question you can probably face is: has it to be singular or plural? The short answer is: singular. Usually module represents some concept/domain, not the entities of the domain themselves. In rare case module is a collection or registry of items (like package definitions), in such cases it may make sense. If you in doubt: use singular.
We figured out the naming, now one more things to address left, a small anti-pattern in the source code inside modules.
Never Top-level Sidefectful Expressions
Back in the days, it was usual to use files as shell scripts, you
probably saw shebangs like #!/usr/bin/python or #!/bin/env guile
at the beginning of the file or even a direct invocation of the file
like guile code.scm. While it makes it easy to run code from CLI
and could be a handy trick for one shot throw away utility, it has a
few serious downsides.
Imagine, your colleague writes serializer-test.scm and executes it like
guile serializer-test.scm. It creates some stub markdown files, prints
progress to stdout, and saves the test results to tests.log file.
Now, they put this file somewhere along the rest of the project's
source code. So now anybody can benifit from project having tests.
Sounds good, right?
Now imagine, how surprised you will be, when after pulling fresh
sources, you compile the project and get a lot of .md and .log
files thrown around source code tree out of nowhere (or something even
worse than that). It happens, because of two reasons:
serializer-test.scmfile contain top-level expressions with side effects.- The file is located on the load paths of the compiler and thus gets loaded automatically, when compiler is invoked.
Auto-firing side effects is why we don't put side-effectful forms in top-level in our source files (and we recommend you to do the same). Moreover, the files in the load path can be loaded in arbitrary order and potentially multiple times, so it's a good idea to expose only constants and function definitions in them. In case somebody wants to fire a particular side effect or call a function, they can do explicitly:
guile -L ./our/load/path -c '((@ (my-project markdown parser) dirty-parse) "./path/to/test.md")'
Also, you probably noticed that the second point we mentioned that the code for testing got mixed into the load paths of the library itself, which can cause other potential problems. For example library source code can accidentially import some helper functions defined in tests modules and those functions can be potentially unsafe (as they supposed to be executed only during development/testing phases and wasn't audited for security carefully). We will explain how to solve it in Top-Level Directory Structure Section, and now let's talk about multiple languages and subprojects.
Language Subdirectories (Not Really)
Don't skip this section even for mono-language projects.
In modern world, it's very likely that you can't stay in the boundaries of one language. Let's imagine we are implementing markdown parser using tree-sitter C bindings, and we use Guile Hoot (Scheme on WebAssembly) for a web frontend renderer and pre-viewer.
It's already 3 languages with different compilers, module machinery (or lack of it), and other things. It would be good to store the source code for them in different subdirectories to reduce the mess and to make it clear which tool uses which source code directories and where to look for the sources of a particular part of the system. So we can split the primary source code of the project into a few units:
guile:: for guile scheme source code, (e.g. parsers, web server).hoot:: for hoot, a scheme dialect compiled to wasm (e.g. web frontend).scheme-common:: records, data models used by both frontend and backend.c:: native binding, high-performance implementation, etc.python:: legacy python markdown parser for benchmarking against.
We can already see that scheme-common is not exactly tied to a
particular language, it's more of a logical grouping. Language can be
a grouping criteria, but not necessary. It just happened to be the
first one comming to our minds.
We could have backend, frontend and shared or c-parser,
scm-parser, scm-veiwer and scm-common. Name it the way it make
sense for the project, but please keep this intermediate level of
directories even if it's only one at the moment. You never know when
you will need to introduce a new language or split the existing system
in a few subsystems.
Now, if we followed this convention, the navigation becomes easy. For
native bindings we look for tree-sitter-helpers.c in c/ directory.
If we look for a frontend code, it's in hoot and some shared between
frontend and backend is in scheme-common. Also, constructing load
paths or pointing compiler to the sources becomes simplier:
GUILE_LOAD_PATH="./guile:./scheme-common" guile -c '((@ (my-project web) server))'
HOOT_LOAD_PATH="./hoot:./scheme-common" guile -c '((@ (my-project hoot) generate-wasm-binary))'
We know that there is no unintended backend code leaked into frontend build. Also, we know that our primary source doesn't rely on any shady util function from a test module, but we know it due to the idea from the next section :)
Top-level Project Directories
How we ensure that no test or dev code end up in the release? We put them in separate directories(!), so it's hard to unintentionally mix them up. There are 4 top-level source code directories we propose.
env/:: code for setting up dependencies and development enviroments.src/:: the primary source code.tests/:: tests, I guess.dev/:: drafts and snippets useful for development, but not going to the final build.
Even in a small personal project you will very likely need all of
them. You need to setup the dev environment (libraries, compilers,
etc) and it's better to persist in, at least as env/setup.sh, but
better something like:
env/guix/my-project/channels.scm:: guix channels for exact guix revision and repositories of packages.env/dev/my-project/packages.scm:: package definitions and collections used for development/testing.env/release/my-project/packages.scm:: packages need for release, to make sure no dev/test/debugging functions leak into final build.
You can see that we follow a grouping pattern similiar to the one described in the previous section. It's a good balance between verbosness and clarity.
For src/ we do the same: apply the intermediate grouping level from
the previous section. The primary source splits into subsystems or
languages, and each subsystem lives in its own subdirectory:
src/guile/:: Guile Scheme backend code.src/hoot/:: Hoot frontend compiled to WebAssembly.src/scheme-common/:: shared data models and records.src/c/:: native bindings.
Constructing release load paths is now simple and explicit:
GUILE_LOAD_PATH="./src/guile:./src/scheme-common"
Apply the same grouping to tests/, mirroring the structure of
src/. A test for src/guile/my-project/markdown/parser.scm lives
at tests/guile/my-project/markdown/parser-test.scm. The mapping is
mechanical: swap the top-level directory and append -test to the
filename. Finding the test for any module takes zero thinking and is
easy to implement on IDE side.
The dev/ directory is for anything useful during development that has
no place in the final build: REPL session snippets, profiling scripts,
one-off data migration helpers, experimental ideas. It follows the
same grouping convention:
dev/guile/my-project/drafts/bench.scm:: benchmarks.dev/guile/my-project/drafts/scratch.scm:: throwaway experiments.
The final dev load paths will look like:
GUILE_LOAD_PATH="./dev/guile:./tests/guile:./src/guile:./src/scheme-common"
Beyond the four source code directories, there are three more, you will likely need:
doc/:: architecture notes, design decisions, onboarding guides. Not inline code comments, but the higher-level texts that explain why the system is the way it is.target/:: build artifacts and generated output. Never committed, always in.gitignore(or similiar). Having a dedicated name avoids thebuild/,out/,dist/,_build/lottery.tmp/:: throwaway files that don't deserve a place even indev/. Also gitignored. Better to have an explicit place than to let the junk and temporary files accumulate in the primary part of the repo.
Putting it all together, the full project tree looks like this:
my-project/
├── dev/
│ └── guile/
│ └── my-project/
│ └── markdown/
│ └── bench.scm
├── doc/
│ └── architecture.md
├── env/
│ ├── dev/
│ │ └── my-project/
│ │ └── packages.scm
│ ├── guix/
│ │ └── my-project/
│ │ └── channels.scm
│ └── release/
│ └── my-project/
│ └── packages.scm
├── src/
│ ├── c/
│ │ └── tree-sitter-helpers.c
│ ├── guile/
│ │ └── my-project/
│ │ └── markdown/
│ │ └── parser.scm
│ ├── hoot/
│ │ └── my-project/
│ │ └── markdown/
│ │ └── viewer.scm
│ └── scheme-common/
│ └── my-project/
│ └── markdown/
│ └── node.scm
├── target/
├── tests/
│ └── guile/
│ └── my-project/
│ └── markdown/
│ └── parser-test.scm
└── tmp/
Slightly verbose, but clear, predictable and scales well. We are pretty happy with it, so we hope you will be happy as well. Or maybe not, anyway, do NOT contact Andrew, he works on fileless and directoryless, content-addressable future for our civilization.