1. Requirements Clarification
Functional Requirements
- Upload, download, and delete files up to 50GB
- Sync files across multiple devices automatically
- File versioning — restore previous versions (up to 30 days)
- Folder sharing and collaboration
- Conflict detection and resolution
- Offline support — queue operations and sync on reconnect
Non-Functional Requirements
- Scale: 100 million users; average 10GB storage per user = 1 Exabyte total
- Upload throughput: Resume interrupted uploads; parallel chunk uploads
- Sync latency: Changes visible on other devices within 30 seconds
- Durability: 99.999999999% (S3 11 nines) for stored files
2. High-Level Architecture
Client (Desktop / Mobile)
│
├─── Metadata API ──▶ Metadata Service ──▶ MySQL (files, versions, chunks)
│ │
│ Redis (sync state,
│ notifications)
│
├─── Upload Chunks ──▶ Block Service ──▶ S3 (raw binary chunks)
│ │
│ Chunk DB (SHA256 → S3 key)
│
└─── Sync Notification ◀── WebSocket / SSE
(notify other devices of changes)
3. Chunked Upload Design
Files are split into 4MB chunks on the client before upload. This enables resumable uploads, deduplication, and parallel transfer.
Upload Flow
- Client splits file: Divide file into 4MB chunks; compute SHA256 hash for each chunk
- Check which chunks exist:
POST /api/chunks/checkwith list of SHA256 hashes. Server returns which hashes are already stored (dedup benefit). - Upload missing chunks only: For each missing chunk,
PUT /api/chunks/{sha256}— upload the binary data - Commit the file:
POST /api/fileswith file path + ordered list of chunk hashes. Server creates the file record and version entry. - Server notifies other devices via WebSocket: "file X updated, fetch new chunk list"
Content-Addressable Storage (CAS)
Storing chunks by their SHA256 hash (not a UUID or sequential ID) is called content-addressable storage. The benefit: if two users store identical chunks (e.g. a popular PDF, a common video intro), the chunk is physically stored only once in S3. Dropbox reported that CAS reduced their S3 storage by ~40% in 2011. The SHA256 hash is both the key and the integrity check — tampering with the content would change the hash.
4. Delta Sync — Only Upload Changed Chunks
When an existing file is modified, only the changed chunks need to be re-uploaded. The client retrieves the current chunk hash list from the server, compares with the new chunk hashes, and uploads only the diff.
| Scenario | File Size | Change Size | Without Delta Sync | With Delta Sync |
|---|---|---|---|---|
| Edit 1 line in Word doc | 200MB | ~4KB | Upload 200MB | Upload 1 chunk (4MB max) |
| Add slide to PowerPoint | 50MB | ~2MB | Upload 50MB | Upload 1 chunk (4MB max) |
| Rename file | Any | 0 bytes | Upload entire file | Upload 0 bytes (metadata only) |
| Video file re-encode | 2GB | All bytes change | Upload 2GB | Upload 2GB (no savings) |
5. Metadata Database Schema
6. Conflict Resolution
Conflicts occur when two devices modify the same file while one is offline. There are three common strategies:
- Last-Write-Wins (LWW): The version with the most recent
modified_attimestamp wins. Simple but loses data. Dropbox uses this as the base strategy. - Conflicted Copy: Both versions are preserved. The "loser" is renamed to
filename (Device's conflicted copy YYYY-MM-DD).ext. The user sees both and manually resolves. Dropbox's default behavior. - 3-Way Merge: Only viable for plain text. Merge: (your version) + (their version) relative to (common ancestor). Git uses this for code files.
Watch Out — Clock Skew in Last-Write-Wins
Client clocks can be wrong by minutes or hours. If Device A has a clock 5 minutes ahead, its changes will always "win" LWW conflicts — even if Device B made more recent changes by wall-clock time. Solution: use server-assigned timestamps for version ordering, not client-reported modified_at times. The client's timestamp is stored for display purposes only.
7. File Versioning Strategy
Versioning stores the history of changes so users can restore previous versions. Because chunks are content-addressable, versioning is storage-efficient — unchanged chunks between versions are shared automatically.
- Every save creates a new
file_versionsrow with the full chunk hash list - Chunk storage is shared: only changed chunks consume new S3 space
- Retention policy: keep all versions for 30 days, then delete old versions (but keep chunks if still referenced by other versions)
- Chunk GC (garbage collection): when a chunk's
ref_countdrops to 0, schedule deletion from S3 (run GC job nightly) - Soft-delete files: set
is_deleted=1and keep for 30 days in Trash before permanent deletion
8. Sync Notification Between Devices
When Alice saves a file on her laptop, her phone needs to download the updated version. This requires a real-time notification channel.
- After the metadata service commits a new file version, it publishes a
file_changedevent to Redis Pub/Sub - Each device maintains a long-polling connection or WebSocket to the sync service
- The sync service subscribes to Redis and pushes events to connected devices
- The event contains: file_id, new version number, changed chunk hashes
- The device downloads only the changed chunks and reconstructs the file
How We Research and Update This Guide
We test the underlying formula or workflow, compare outputs with reliable references, and revise examples whenever the page content changes.
- The workflow or formula is tested directly in the tool and compared against independent reference examples.
- Examples are kept practical so readers can verify the result without hidden assumptions.
- Pages are revised whenever the interface, calculation flow, or surrounding guidance materially changes.
Frequently Asked Questions — File Storage System Design
Block storage (like AWS EBS) divides files into fixed-size blocks and stores them like a virtual hard disk. It supports random read/write access and is used for databases and VMs. Object storage (like S3) stores entire files as objects with metadata and a unique key. It is accessed via HTTP API, optimised for sequential reads, and scales to petabytes easily. For a Dropbox-style system, we use object storage (S3) for file chunks but a relational database for metadata. This is because file metadata (paths, versions, sharing) benefits from SQL joins, while the actual binary data benefits from S3's durability and CDN.
Chunked upload splits a file into fixed-size pieces (typically 4MB per chunk) and uploads each chunk independently. This is necessary for three reasons: (1) Large file resilience — if a 2GB upload fails at 90%, only the last chunk needs to be re-uploaded, not the whole file. (2) Parallelism — multiple chunks can be uploaded simultaneously, maximizing bandwidth. (3) Deduplication — if the same 4MB chunk exists in any user's files (e.g., a popular PDF), it only needs to be stored once. The client first calls the API to check which chunks are missing before uploading.
Delta sync only uploads the changed portions of a file. When you edit a large Word document, only the modified 4MB chunks are re-uploaded — not the entire 200MB file. The client computes a SHA256 hash of each chunk and compares with the server-side hashes stored for the previous version. Chunks with matching hashes are skipped. For typical office document edits, delta sync reduces upload bandwidth by 90–99%.
Most file sync systems use last-write-wins (LWW) as the default: the most recently synced version overwrites the older one. To prevent data loss, Dropbox preserves the overwritten version as a "conflicted copy" with the hostname and timestamp in the filename (e.g., "report (John's conflicted copy 2026-06-13).docx"). True operational transform (like Google Docs) requires the file format to support granular change merging — not feasible for arbitrary binary files. For plain text, some systems use a 3-way merge (your changes + their changes + common ancestor).
Each file save creates a new version entry pointing to a set of chunk hashes. Because chunks are content-addressable (hash = content), unchanged chunks between versions share the same S3 object — no duplication. Only changed chunks consume additional storage. For example, if you save a 100MB file 10 times with 1MB changes each time, storage is ~109MB (100MB base + 9 × 1MB delta), not 1,000MB. Implement a version retention policy: keep all versions for 30 days, then keep only weekly snapshots, then monthly.
The metadata DB (MySQL/PostgreSQL) stores: file paths, ownership, folder hierarchy, version history, chunk hash lists, sharing permissions, and sync state. The file store (S3) stores only the raw binary chunks keyed by their SHA256 hash. This separation is critical because: (1) metadata queries (list files, search by name, check permissions) need SQL joins that S3 cannot provide; (2) the metadata DB can be replicated and indexed for fast queries; (3) S3 scales file storage independently of the database.