Data Systems
Basics
- The aim is to keep them: Reliable, Scalable and Maintainable
- Data Systems knowledge includes understanding caches, search indexes, stream/batch processing
Reliability
- Keeping data correctness, completeness across systems, performant and resilient.
- Fault vs Failure: Fault is a system failing to adhere to a design spec whereas failure means a system failure and requires failover backup procedures.
- It's better to measure the latency in percentiles.
Scalability
- For many backends, the response time is dependent on the slowest server response.
- Elastic vs manual scaling: one is automated.
The architecture for each large scale project is highly specific. It depends on factors such as:
- Volume of reads
- Volume of writes
- Type of data to stor
- Data complexity
- Response time
- Access patterns
Example: You could expect 100,000 req/s @ 1kB in size vs 3 req/s @ 2GB in size which equates to the same data throughput but very different requirements.
- Load parameters identify which requests are common vs which ones are rare.
Maintainability
Remember these three principles:
- Operability
- Simplicity (- think abstractions like SQL)
- Evolvability
Data Models
- Relational
- Document
- Graph (this is essentially a subtype of relational)
Pros of the Document Model
- Schema flexibility
- Better performance due to locality
- Apps are generally closer to the data structure
Storage + Retrieval
- Big difference in storage engines that are optimized for transactional vs analytical (think OLTP/OLAP)
- Indexes help find keys but generally slow down writes
- In-memory tress: examples red-black trees, AVL trees. We can make the storage engines to use these styles of trees.
- LSM Trees can be slow when looking up keys that don't exist
- Can be countered by "Bloom Filters" -> can tell if the key exists.
B-Trees
- Most common index used by databases.
- While log-structured indexes work in variable-size segments and written sequentially, B-Trees have fixed sized "blocks" or "pages" - this design corresponds to reflecting the underlying hardware.
- B-Tree variant "fractal trees" borrow log-structured ideas to reduce disk seeks.
B-Trees vs LSM-Trees
- B-Tree more mature
- LSM faster for writes
- B faster for reads
- B-Tree must write everything twice (write-ahead log and potential multi-writes to tree pages)
- Log structured indexes also rewrite data multiple times. This is known as "write amplification" and is a concern for SSDs.
- LSM trees compress files better
The downsides of LSM-Trees
- Compaction process can sometimes interfere with ongoing reads/writes.
Other notes
- Secondary index normally not unique (consider the use of
user_id
) - R-Trees normally used for optimised geospatial query efficiency.
- LSM-Trees themselves do not handle updating. They essentially can only append and remove the stale data. B-Trees can update in place.
OLAP vs OLTP
- A transaction needn't adhere to all principles of ACID. "Transaction" processing mean low latency read/writes.
- Commonly used with relational databases.
- Performance can be improved using vectorized processing (SIMD - single instruction multi data)
- Column compression can also make use of the L1 cache.
- Update-in-place like what B-Trees use are not possible with column compression.