How to Design a File Storage System like Dropbox [2026]

Q: What is the difference between block storage and object storage?

Block storage (like AWS EBS) divides files into fixed-size blocks and stores them like a virtual hard disk. It supports random read/write access and is used for databases and VMs. Object storage (like S3) stores entire files as objects with metadata and a unique key. It is accessed via HTTP API, optimised for sequential reads, and scales to petabytes easily. For a Dropbox-style system, we use object storage (S3) for file chunks but a relational database for metadata. This is because file metadata (paths, versions, sharing) benefits from SQL joins, while the actual binary data benefits from S3's durability and CDN.

Q: How does chunked upload work and why is it necessary?

Chunked upload splits a file into fixed-size pieces (typically 4MB per chunk) and uploads each chunk independently. This is necessary for three reasons: (1) Large file resilience — if a 2GB upload fails at 90%, only the last chunk needs to be re-uploaded, not the whole file. (2) Parallelism — multiple chunks can be uploaded simultaneously, maximizing bandwidth. (3) Deduplication — if the same 4MB chunk exists in any user's files (e.g., a popular PDF), it only needs to be stored once. The client first calls the API to check which chunks are missing before uploading.

Q: What is delta sync and how does it reduce bandwidth?

Delta sync only uploads the changed portions of a file. When you edit a large Word document, only the modified 4MB chunks are re-uploaded — not the entire 200MB file. The client computes a SHA256 hash of each chunk and compares with the server-side hashes stored for the previous version. Chunks with matching hashes are skipped. For typical office document edits, delta sync reduces upload bandwidth by 90–99%.

Q: How do you handle file conflicts when two users edit simultaneously?

Most file sync systems use last-write-wins (LWW) as the default: the most recently synced version overwrites the older one. To prevent data loss, Dropbox preserves the overwritten version as a "conflicted copy" with the hostname and timestamp in the filename (e.g., "report (John's conflicted copy 2026-06-13).docx"). True operational transform (like Google Docs) requires the file format to support granular change merging — not feasible for arbitrary binary files. For plain text, some systems use a 3-way merge (your changes + their changes + common ancestor).

Q: How do you implement file versioning without excessive storage?

Each file save creates a new version entry pointing to a set of chunk hashes. Because chunks are content-addressable (hash = content), unchanged chunks between versions share the same S3 object — no duplication. Only changed chunks consume additional storage. For example, if you save a 100MB file 10 times with 1MB changes each time, storage is ~109MB (100MB base + 9 × 1MB delta), not 1,000MB. Implement a version retention policy: keep all versions for 30 days, then keep only weekly snapshots, then monthly.

Q: How do you separate the metadata database from the file store?

The metadata DB (MySQL/PostgreSQL) stores: file paths, ownership, folder hierarchy, version history, chunk hash lists, sharing permissions, and sync state. The file store (S3) stores only the raw binary chunks keyed by their SHA256 hash. This separation is critical because: (1) metadata queries (list files, search by name, check permissions) need SQL joins that S3 cannot provide; (2) the metadata DB can be replicated and indexed for fast queries; (3) S3 scales file storage independently of the database.

1. Requirements Clarification

Functional Requirements

Upload, download, and delete files up to 50GB
Sync files across multiple devices automatically
File versioning — restore previous versions (up to 30 days)
Folder sharing and collaboration
Conflict detection and resolution
Offline support — queue operations and sync on reconnect

Non-Functional Requirements

Scale: 100 million users; average 10GB storage per user = 1 Exabyte total
Upload throughput: Resume interrupted uploads; parallel chunk uploads
Sync latency: Changes visible on other devices within 30 seconds
Durability: 99.999999999% (S3 11 nines) for stored files

2. High-Level Architecture

  Client (Desktop / Mobile)
       │
       ├─── Metadata API ──▶  Metadata Service ──▶ MySQL (files, versions, chunks)
       │                              │
       │                      Redis (sync state,
       │                       notifications)
       │
       ├─── Upload Chunks ──▶  Block Service ──▶  S3 (raw binary chunks)
       │                              │
       │                     Chunk DB (SHA256 → S3 key)
       │
       └─── Sync Notification ◀── WebSocket / SSE
                                 (notify other devices of changes)

3. Chunked Upload Design

Files are split into 4MB chunks on the client before upload. This enables resumable uploads, deduplication, and parallel transfer.

Upload Flow

Client splits file: Divide file into 4MB chunks; compute SHA256 hash for each chunk
Check which chunks exist: POST /api/chunks/check with list of SHA256 hashes. Server returns which hashes are already stored (dedup benefit).
Upload missing chunks only: For each missing chunk, PUT /api/chunks/{sha256} — upload the binary data
Commit the file: POST /api/files with file path + ordered list of chunk hashes. Server creates the file record and version entry.
Server notifies other devices via WebSocket: "file X updated, fetch new chunk list"

Client pseudocode — chunked upload CHUNK_SIZE = 4 * 1024 * 1024 # 4MB def upload_file(file_path): chunks = split_into_chunks(file_path, CHUNK_SIZE) hashes = [sha256(chunk.data) for chunk in chunks] # Step 1: Check which chunks server already has missing = api.post('/chunks/check', {'hashes': hashes})['missing'] # Step 2: Upload only missing chunks (parallel) with ThreadPoolExecutor(max_workers=4) as pool: futures = [ pool.submit(api.put, f'/chunks/{h}', chunk.data) for h, chunk in zip(hashes, chunks) if h in missing ] wait(futures) # Step 3: Commit file api.post('/files', { 'path': file_path, 'chunk_hashes': hashes, # ordered list 'size': file_size, 'modified_at': file.mtime })

Content-Addressable Storage (CAS)

Storing chunks by their SHA256 hash (not a UUID or sequential ID) is called content-addressable storage. The benefit: if two users store identical chunks (e.g. a popular PDF, a common video intro), the chunk is physically stored only once in S3. Dropbox reported that CAS reduced their S3 storage by ~40% in 2011. The SHA256 hash is both the key and the integrity check — tampering with the content would change the hash.

4. Delta Sync — Only Upload Changed Chunks

When an existing file is modified, only the changed chunks need to be re-uploaded. The client retrieves the current chunk hash list from the server, compares with the new chunk hashes, and uploads only the diff.

Scenario	File Size	Change Size	Without Delta Sync	With Delta Sync
Edit 1 line in Word doc	200MB	~4KB	Upload 200MB	Upload 1 chunk (4MB max)
Add slide to PowerPoint	50MB	~2MB	Upload 50MB	Upload 1 chunk (4MB max)
Rename file	Any	0 bytes	Upload entire file	Upload 0 bytes (metadata only)
Video file re-encode	2GB	All bytes change	Upload 2GB	Upload 2GB (no savings)

5. Metadata Database Schema

SQL — files, versions, chunks CREATE TABLE files ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, user_id BIGINT UNSIGNED NOT NULL, parent_id BIGINT UNSIGNED NULL, -- NULL = root folder name VARCHAR(255) NOT NULL, type ENUM('file','folder') NOT NULL, is_deleted TINYINT(1) NOT NULL DEFAULT 0, created_at DATETIME NOT NULL, updated_at DATETIME NOT NULL, PRIMARY KEY (id), KEY idx_parent (user_id, parent_id), KEY idx_path_search (user_id, name) ); CREATE TABLE file_versions ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, file_id BIGINT UNSIGNED NOT NULL, version_num INT UNSIGNED NOT NULL, size_bytes BIGINT UNSIGNED NOT NULL, created_by BIGINT UNSIGNED NOT NULL, created_at DATETIME NOT NULL, PRIMARY KEY (id), UNIQUE KEY uk_file_version (file_id, version_num) ); CREATE TABLE version_chunks ( version_id BIGINT UNSIGNED NOT NULL, chunk_index INT UNSIGNED NOT NULL, -- ordering within file chunk_hash CHAR(64) NOT NULL, -- SHA256 hex PRIMARY KEY (version_id, chunk_index), KEY idx_chunk_hash (chunk_hash) -- find which versions use a chunk ); CREATE TABLE chunks ( hash CHAR(64) NOT NULL, -- SHA256 hex s3_key VARCHAR(200) NOT NULL, -- S3 object key size_bytes INT UNSIGNED NOT NULL, ref_count INT UNSIGNED NOT NULL DEFAULT 1, -- for GC created_at DATETIME NOT NULL, PRIMARY KEY (hash) );

6. Conflict Resolution

Conflicts occur when two devices modify the same file while one is offline. There are three common strategies:

Last-Write-Wins (LWW): The version with the most recent modified_at timestamp wins. Simple but loses data. Dropbox uses this as the base strategy.
Conflicted Copy: Both versions are preserved. The "loser" is renamed to filename (Device's conflicted copy YYYY-MM-DD).ext. The user sees both and manually resolves. Dropbox's default behavior.
3-Way Merge: Only viable for plain text. Merge: (your version) + (their version) relative to (common ancestor). Git uses this for code files.

Watch Out — Clock Skew in Last-Write-Wins

Client clocks can be wrong by minutes or hours. If Device A has a clock 5 minutes ahead, its changes will always "win" LWW conflicts — even if Device B made more recent changes by wall-clock time. Solution: use server-assigned timestamps for version ordering, not client-reported modified_at times. The client's timestamp is stored for display purposes only.

7. File Versioning Strategy

Versioning stores the history of changes so users can restore previous versions. Because chunks are content-addressable, versioning is storage-efficient — unchanged chunks between versions are shared automatically.

Every save creates a new file_versions row with the full chunk hash list
Chunk storage is shared: only changed chunks consume new S3 space
Retention policy: keep all versions for 30 days, then delete old versions (but keep chunks if still referenced by other versions)
Chunk GC (garbage collection): when a chunk's ref_count drops to 0, schedule deletion from S3 (run GC job nightly)
Soft-delete files: set is_deleted=1 and keep for 30 days in Trash before permanent deletion

8. Sync Notification Between Devices

When Alice saves a file on her laptop, her phone needs to download the updated version. This requires a real-time notification channel.

After the metadata service commits a new file version, it publishes a file_changed event to Redis Pub/Sub
Each device maintains a long-polling connection or WebSocket to the sync service
The sync service subscribes to Redis and pushes events to connected devices
The event contains: file_id, new version number, changed chunk hashes
The device downloads only the changed chunks and reconstructs the file

How We Research and Update This Guide

We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.

The workflow or formula is tested directly in the tool and compared against independent reference examples.
Examples are kept practical so readers can verify the result without hidden assumptions.
Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.

Frequently Asked Questions — File Storage System Design

What is the difference between block storage and object storage?

How does chunked upload work and why is it necessary?

What is delta sync and how does it reduce bandwidth?

How do you handle file conflicts when two users edit simultaneously?

How do you implement file versioning without excessive storage?

How do you separate the metadata database from the file store?