Very first thing to do when faced with the problem: clarify the constraints and use case.
Don't panic about the correctness, write some numbers down!
Write on board: the use cases like
(Basic features)
take a url => return a shorter one (shortening)
take a short => return the original (redirection)
(ask some more augmented features?)
customized url?
analytics?
automatic link expiration?
UI? API?
Write on board, constraints. Traffic and Data
scale down to seconds! Estimate the traffic and data
100M new urls per month
What can be cached?? 10% from shortening and 90% from redirection
how long is url?
都是先算几年或者几个月有多少,然后scale到秒!
Ready heavy? Write heavy?
Step 2 - Abstract Design
draw sketch,不要管细节!先把基础功能实现!
Application service layer 和 Data Storage Layer
Step 3 - Understand Bottlenecks
Need a load balancer? Data is so huge that need to distribute the database?
Talk about the tradeoff!!!
这一步要结合1和2,比如1中的data per sec,就可以用来估算数据库是否有效?或者如果每次都要搜遍数据库的话,会不会效率过低?Bottlenecks identified!!!
Step 4 - Actual Design of a Scalable System!
(next section)
whenever get challenged by interviewer about the architectural choices, acknowledge that rarely an idea is perfect, and outline the advantages and disadvantages of your choice
add load balancer is not only for amortize traffic, but also to deal with spiky traffic!! and shut them down when the traffic goes back to normal.
Don't forget to mention we need benchmark, profiling, and load test the architecture
Fundamentals
(Scalability @Harvard)
(Scalability for Dummies)
(Database sharding)
and don't forget about the monitor
sometimes there will be a flash traffic when many people want to watch/read the same thing
Examples
(highscalability blog)
Wrap-up
Practice
Scalability @Havard
Vertical scaling
CPU cores, RAM, disks, etc.
why not a full solution? Real-world constraints...
Use multiple load balancers to avoid single node failure at the load balancer
(from HiredInTech) add load balancer is not only for amortize traffic, but also to deal with spiky traffic!! and shut them down when the traffic goes back to normal. Also, when one server is down, use others (reliability)
Caching
.html - redundancy, no template, must write <header> ... lots of find and replace
MySQL query cache
memcached
memory cache, which is also a server
if read-heavy, then this is very efficient!
DB replication
master-slave, benefit for read-heavy websites
master-master,
load balancer can also generate a single-node failure
use heat beat to detect activity
DB partitioning
different multiple servers: mit.thefacebook.com, harvard.thefacebook.com
A-H to server1 I-Z to server 2, etc.
最后还有一段实例,很好~
Sync between data center?
load balancing on DNS level!
Scalability for Dummies link
part 1, replicas
part 2, database can become the bottleneck. Switch to NoSQL.
part 3, caching - in-memory caching like Redis, not file-based caching!
cache queries (but hard to remove)
cache objects, makes asynchronous processing possible
part 4, asynchronism
async 1: cache the dynamic content, pre-rendered HTML
async 2: (computing-intensive tasks) pass to backend, constantly check if the job is done
虚拟节点技术:该技术常用于分布式数据分片中。具体应用场景是:有一大坨数据(maybe TB级或者PB级),我们需按照某个字段(key)分片存储到几十(或者更多)台机器上,同时想尽量负载均衡且容易扩展。传统的做法是:Hash(key) mod N,这种方法最大缺点是不容易扩展,即:增加或者减少机器均会导致数据全部重分布,代价忒大。于是乎,新技术诞生了,其中一种是上面提到的DHT,现在已经被很多大型系统采用,还有一种是对“Hash(key) mod N”的改进:假设我们要将数据分不到20台机器上,传统做法是hash(key) mod 20,而改进后,N取值要远大于20,比如是20000000,然后我们采用额外一张表记录每个节点存储的key的模值,比如:node1:0~1000000;node2:1000001~2000000。。。。。。这样,当添加一个新的节点时,只需将每个节点上部分数据移动给新节点,同时修改一下这个表即可