You type docker run nginx.
Half a second later, a fully isolated web server is running on your machine.
No installation wizard. No reboot. No "please wait while we configure your environment." Just, running.
It feels like magic. And for a long time, most developers just accepted that and moved on. The abstraction was good enough. Why dig deeper?
Here is why: once you see what is actually happening, containers stop being a black box. You stop guessing why things break. You start making better decisions about how you build and deploy software.
So let's pull back the curtain.
The 11-Year Foundation You're Standing On
When Docker launched in 2013, the core features that make containers work had been quietly built into the Linux kernel over the previous 11 years.
The first piece - mount namespaces; landed in the Linux kernel in 2002. The last major piece - user namespaces; was finally considered complete and stable in 2013. The exact same year Docker launched.
That timing wasn't a coincidence. Docker arrived the precise moment the kernel foundation was actually ready.
So what are these features? There are three of them, and each one solves a completely different problem.
Feature 1 - Namespaces: The Art of Controlled Blindness
Imagine you're working in a huge open-plan office. You can see everyone, hear every conversation, and access every desk. Now imagine someone builds walls around you and gives you your own private room. It's the same building, on the same floor, but now you only see what's in your room.
That is a namespace.
A namespace is a kernel feature that gives a process a filtered view of the system. The process thinks it sees everything - its own processes, its own network, its own filesystem; but it is actually only seeing a curated slice of reality. The rest of the machine exists, it's just invisible.
Docker uses six types of namespaces for every container:
- PID namespace: The container has its own process IDs. Your Nginx process thinks it's PID 1, the only process in the world. On the host, it's actually PID 4821. Two completely different realities on the same machine.
- Network namespace: The container gets its own network interfaces and its own IP address. It cannot see your host's network stack or the traffic of other containers.
- Mount namespace: The container sees its own filesystem root. When it looks at
/, it sees the container's files, not your host machine's files. - UTS namespace: The container can have its own hostname. Type
hostnameinside a container and it returns the container ID, not your machine's name. - IPC namespace: Isolates inter-process communication. Containers can't accidentally share memory with processes outside of their walls.
- User namespace: A process can be
rootinside a container but map to a regular, unprivileged user on the host. The container thinks it has full power. It doesn't.
Together, these six namespaces create the complete illusion of being alone on a machine.
Why does this matter? Because isolation without namespaces means one buggy app can see every other app's data, files, and network traffic. With namespaces, even if your app is completely compromised, the attacker is locked inside that container's narrow view of reality. They can't see your other containers. They can't see your host filesystem. They are trapped in their own private room.
Feature 2 - cgroups: The Landlord That Sets the Rules
Namespaces solve the visibility problem. But there's a second problem.
Even if a container can't see your other processes, it can still starve them.
Imagine that same office building. Your private room is nice, but if your neighbor cranks up the shared air conditioning to maximum and leaves it on forever, everyone in the building suffers. The isolation was real, but the resource impact wasn't.
cgroups (control groups) is the kernel's answer to this. It lets you define hard limits on how much CPU, memory, disk I/O, and network bandwidth a group of processes can consume.
Set a container to 512MB of memory and 0.5 CPU cores, that is all it gets. Period. If it tries to use more memory, the kernel kills it. Other containers on the same host feel nothing.
This is how cloud providers can run thousands of containers on the same physical server without them interfering with each other. Every tenant has their private room (namespaces) AND a strictly enforced resource quota (cgroups).
When you write this in your Docker Compose file:
deploy:
resources:
limits:
memory: 512m
cpus: '0.5'
You are just writing cgroup configuration in YAML. Docker translates it into entries under /sys/fs/cgroup/ on your host. The kernel reads those entries and mercilessly enforces them for every process in that container.
Feature 3 - OverlayFS: The Library That Never Makes Copies
This is the cleverest feature of all, and it's the reason 10 containers from the same image don't use 10x the disk space.
Think about a public library.
The library has one copy of a book. Hundreds of people read it. The library doesn't print a new copy for each reader; they all read the same one.
But what if a reader wants to scribble notes in the margins? The library gives that specific reader their own blank transparency sheet to lay over the pages they want to write on. Everyone else still reads the original. The original never changes.
That is OverlayFS (Overlay Filesystem) in a nutshell. It's a layered filesystem where:
- The image layers are read-only, shared by every container running from that image.
- Each container gets its own thin writable layer on top.
- When a container modifies a file, that file gets copied up into the writable layer first. The original is never touched.
This is called Copy-on-Write. You only pay the copy cost when you actually write. Read operations are entirely free and shared.
Copy-On-Write (CoW) Mechanism
Before Write
After Process P modifies Page 3
So your 10 Nginx containers don't store 10 × 200MB on disk. They share one 200MB set of read-only layers, and each has a tiny writable layer on top that only holds whatever that specific container has written.
OverlayFS Architecture
Layer
Layer
Layer
Stored once on disk, shared by all containers.
This is also why containers start so fast. There is nothing to copy. The filesystem is already there, shared and ready. The kernel just needs to create a new writable layer and attach it on top.
Think About It
You have a container running Nginx. Inside the container, you run:
echo "hello" > /etc/nginx/test.txt
You just wrote a file to what appears to be the Nginx config directory. Now you stop and delete that container. Then you spin up a brand new container from the exact same Nginx image.
Is that test.txt file there?
No. It is gone completely.
Because that file only existed in the container's writable layer, a thin overlay that gets thrown away when the container is removed. The shared read-only image layers underneath were never touched.
This is why containers are called ephemeral; they are designed to be disposable. Any data you want to survive must be stored outside the container's writable layer, in a Docker volume or a bind mount. The container itself is meant to be replaceable at any moment.
The Call Chain: From Your Keyboard to a Running Process
Now let's connect all three features to what actually happens when you type docker run nginx.
Most people think Docker does everything. It doesn't. Docker is more like a coordinator. It delegates the actual work down a chain of highly specialized tools:
The Execution Chain
Each layer has one job and hands off to the next. Here is what each one actually does:
Docker CLI: Just a client. When you type docker run, it sends an HTTP request to the Docker daemon over a Unix socket. The CLI does no container work itself.
dockerd: The manager. It handles the big picture: is the image available locally? If not, pull it from Docker Hub. Then it hands the actual container creation over to containerd.
containerd: The lifecycle manager. It prepares everything the container needs; the filesystem bundle, the network setup, the configuration, and then asks runc to actually create it.
runc: This tiny binary makes a special system call called clone() with specific flags that tell the kernel to create all six namespaces at once. It sets up cgroups for resource limits. It mounts the OverlayFS layers. Then it executes your process inside that isolated environment.
And then runc does something that surprises almost everyone who learns about it.
runc exits.
It doesn't stick around to monitor the container. Its job is to create the environment and start the process. Once that's done, it's gone. If you look for runc processes on a machine with 20 running containers, you will find none.
So who keeps the containers alive?
The shim. That containerd-shim sitting between containerd and runc. There is one per container, silently running the whole time. The shim holds the container's stdio open, reports the exit code when the container finishes, and most importantly, keeps the container alive even if containerd or dockerd crashes and restarts.
This is why if the Docker daemon dies, your running containers survive. They aren't held by Docker. They are held by their shims.
What This Means for You Practically
-
"Why did my container crash with exit code 137?" The kernel OOM-killed it. Your container hit its cgroup memory limit. Add a memory limit and watch
docker statsyou'll see it coming. -
"Why do containers start so fast?" There is no OS to boot. The kernel is already running. OverlayFS just attaches a writable layer to shared read-only layers. Milliseconds, not minutes.
-
"If the Docker daemon crashes, do my containers die?" No. The shim keeps them alive. You just can't manage them until dockerd restarts.
-
"Why can't I find my file after the container restarts?" Container writable layers are ephemeral. Use volumes for anything you want to persist.
The Three Pillars, One Sentence Each
After all of this, here is the clean summary you can carry in your head:
- Namespaces give each container it's own private view of the world, its own processes, network, and filesystem.
- cgroups enforce hard limits on how much of the machine each container can consume.
- OverlayFS makes images efficient by sharing read-only layers and only copying files when they're actually written.
Docker didn't invent any of these. It built a beautiful, developer-friendly UX on top of them. And now that you know what's underneath.
What's Next
In the next article, we are going deep into Docker images and Dockerfiles,how images are actually built layer by layer, why the order of your instructions matters for build speed, and the multi-stage build pattern that can take your image from 900MB down to 25MB.
If you've ever copied a Dockerfile from Stack Overflow without fully understanding why it was structured that way, that article is for you.
Which of these three; namespaces, cgroups, or OverlayFS surprised you the most? And did you know runc exits immediately after starting a container? Drop your reaction in the comments.



