Very first thing to do when faced with the problem: clarify the constraints and use case.
Don't panic about the correctness, write some numbers down!
Write on board: the use cases like
(Basic features)
take a url => return a shorter one (shortening)
take a short => return the original (redirection)
(ask some more augmented features?)
customized url?
analytics?
automatic link expiration?
UI? API?
Write on board, constraints. Traffic and Data
scale down to seconds! Estimate the traffic and data
100M new urls per month
What can be cached?? 10% from shortening and 90% from redirection
how long is url?
都是先算幾年或者幾個月有多少,然後scale到秒!
Ready heavy? Write heavy?
Step 2 - Abstract Design
draw sketch,不要管細節!先把基礎功能實作!
Application service layer 和 Data Storage Layer
Step 3 - Understand Bottlenecks
Need a load balancer? Data is so huge that need to distribute the database?
Talk about the tradeoff!!!
這一步要結合1和2,比如1中的data per sec,就可以用來估算資料庫是否有效?或者如果每次都要搜遍資料庫的話,會不會效率過低?Bottlenecks identified!!!
Step 4 - Actual Design of a Scalable System!
(next section)
whenever get challenged by interviewer about the architectural choices, acknowledge that rarely an idea is perfect, and outline the advantages and disadvantages of your choice
add load balancer is not only for amortize traffic, but also to deal with spiky traffic!! and shut them down when the traffic goes back to normal.
Don't forget to mention we need benchmark, profiling, and load test the architecture
Fundamentals
(Scalability @Harvard)
(Scalability for Dummies)
(Database sharding)
and don't forget about the monitor
sometimes there will be a flash traffic when many people want to watch/read the same thing
Examples
(highscalability blog)
Wrap-up
Practice
Scalability @Havard
Vertical scaling
CPU cores, RAM, disks, etc.
why not a full solution? Real-world constraints...
Use multiple load balancers to avoid single node failure at the load balancer
(from HiredInTech) add load balancer is not only for amortize traffic, but also to deal with spiky traffic!! and shut them down when the traffic goes back to normal. Also, when one server is down, use others (reliability)
Caching
.html - redundancy, no template, must write <header> ... lots of find and replace
MySQL query cache
memcached
memory cache, which is also a server
if read-heavy, then this is very efficient!
DB replication
master-slave, benefit for read-heavy websites
master-master,
load balancer can also generate a single-node failure
use heat beat to detect activity
DB partitioning
different multiple servers: mit.thefacebook.com, harvard.thefacebook.com
A-H to server1 I-Z to server 2, etc.
最後還有一段執行個體,很好~
Sync between data center?
load balancing on DNS level!
Scalability for Dummies link
part 1, replicas
part 2, database can become the bottleneck. Switch to NoSQL.
part 3, caching - in-memory caching like Redis, not file-based caching!
cache queries (but hard to remove)
cache objects, makes asynchronous processing possible
part 4, asynchronism
async 1: cache the dynamic content, pre-rendered HTML
async 2: (computing-intensive tasks) pass to backend, constantly check if the job is done
虛拟節點技術:該技術常用于分布式資料分片中。具體應用場景是:有一大坨資料(maybe TB級或者PB級),我們需按照某個字段(key)分片存儲到幾十(或者更多)台機器上,同時想盡量負載均衡且容易擴充。傳統的做法是:Hash(key) mod N,這種方法最大缺點是不容易擴充,即:增加或者減少機器均會導緻資料全部重分布,代價忒大。于是乎,新技術誕生了,其中一種是上面提到的DHT,現在已經被很多大型系統采用,還有一種是對“Hash(key) mod N”的改進:假設我們要将資料分不到20台機器上,傳統做法是hash(key) mod 20,而改進後,N取值要遠大于20,比如是20000000,然後我們采用額外一張表記錄每個節點存儲的key的模值,比如:node1:0~1000000;node2:1000001~2000000。。。。。。這樣,當添加一個新的節點時,隻需将每個節點上部分資料移動給新節點,同時修改一下這個表即可