1.Balls & Bins
- 定义:将m个balls 装入n 个不同的bins
- 球条件独立:m个球之间相互独立进入n个bins 互不影响,第i个球进入第j个盒子的概率为pij=1/n;p_{ij}=1/n;pij?=1/n;
- 盒子条件不独立但同分布:用Z={Z1,Z2,…,Zn}Z=\{Z_1,Z_2,\dots ,Z_n\}Z={ Z1?,Z2?,…,Zn?}表示每个盒子中装有的球数,当一个盒子装得多了,另外一个就少了,自然不独立的但同分布。
2.应用-生日迅论
将球盒模型应用到Birthday Paradox 这一问题上时,n个盒子代表N=365天,而班级的每个成员(m个球)对应的生日可以看作是球,由上图黄色标明,当成员大于盒子的根号N量级时,会以高概率出现双生缘
还有类似的应用:loading balancing 问题,球是多个任务,盒子是多个服务器,我们可以关注负载最大的服务器上的负载的量级
3几个重要结论
-
当m<nm< \sqrt{n}m<n? ,所有球分得很开,多数盒子为空
-
当m>θ(n)m>\theta (\sqrt{n})m>θ(n?),以高概率出现2个球在一个盒子中,即生日悖 论
形式化表示为:
当m?θ(n),?Zi≥2?Y=max?1≤i≤nZi,Y≥2且Pr(Y≥2)≥1??当 m\sim \theta (\sqrt{n}),\exists\quad Z_i\ge2 \\ \exists Y=\max_{1\leq i \leq n}Z_i,Y\ge2且Pr{(Y\ge 2)\ge1-\epsilon}当m?θ(n?),?Zi?≥2?Y=1≤i≤nmax?Zi?,Y≥2且Pr(Y≥2)≥1?? -
当m=n时,关注球数最多的盒子(负载最大的服务器上的负载数量)中球数以高概率服从ln n的数量级,形式化为:
当m=n,令Y=max?1≤i≤nZi,Y?θ(ln?nln?ln?n)withhighprobability当 m=n,\\ 令Y=\max_{1\leq i \leq n}Z_i,\\ Y\sim\theta(\frac{\ln n}{\ln \ln n}) \quad \quad with\quad high \quad probability当m=n,令Y=1≤i≤nmax?Zi?,Y?θ(lnlnnlnn?)withhighprobability -
当m=nln?nm=n\ln nm=nlnn可以说是满足大数定理,即每个盒子中的数量以高概率 为m/n=ln?nm/n=\ln nm/n=lnn,
这点证明参考Coupon Collector’s Problem高级算法设计
证明 当m=n时,关注球数最多的盒子(负载最大的服务器上的负载数量)中球数以高概率服从ln n的数量级,形式化为:
当m=n,令Y=max?1≤i≤nZi,Y?θ(ln?nln?ln?n)withhighprobability当 m=n,\\ 令Y=\max_{1\leq i \leq n}Z_i,\\ Y\sim\theta(\frac{\ln n}{\ln \ln n}) \quad \quad with\quad high \quad probability当m=n,令Y=1≤i≤nmax?Zi?,Y?θ(lnlnnlnn?)withhighprobability
即证Pr?(Y=cln?nln?ln?n)=1?O(1)\Pr(Y=c\frac{\ln n}{\ln \ln n})=1-O(1)Pr(Y=clnlnnlnn?)=1?O(1)
不妨使c=14c=\frac{1}{4}c=41?
即证Pr?(max?Zi≤4ln?nln?ln?n)=1?O(1)(1)\Pr(\max Z_i\leq 4\frac{\ln n}{\ln \ln n})=1-O(1)\quad \quad \quad (1)Pr(maxZi?≤4lnlnnlnn?)=1?O(1)(1)与Pr?(max?Zi≥14ln?nln?ln?n)=1?O(1)(2)\Pr(\max Z_i\geq \frac{1}{4}\frac{\ln n}{\ln \ln n})=1-O(1)\quad \quad \quad (2)Pr(maxZi?≥41?lnlnnlnn?)=1?O(1)(2)
3.1
证(1)等价于证(令t=ln?nln?ln?n令t=\frac{\ln n}{\ln \ln n}令t=lnlnnlnn?)
1?Pr?(max?Zi≤4t)=O(1)1-\Pr(\max Z_i\leq 4t)=O(1)1?Pr(maxZi?≤4t)=O(1)
等价于证
Pr?(max?Zi>4t)=O(1)那么需要求它的上界\Pr(\max Z_i>4t)=O(1)\quad \quad 那么需要求它的上界Pr(maxZi?>4t)=O(1)那么需要求它的上界
等价于证 Pr?(max?Zi>4t)=Pr?(Z1>4t?Z2>4t???Zn>4t)≤∑i=1nPr?(Zi>4t)不妨设第一个盒子球数最多max?Zi=Z1≤nPr?(Z1>4t)=O(1)\Pr(\max Z_i>4t)=\Pr(Z_1>4t\bigcup Z_2>4t\bigcup \dots \bigcup Z_n>4t)\\ \leq\sum_{i=1}^n \Pr(Z_i>4t) \quad 不妨设第一个盒子球数最多\max Z_i=Z_1\\ \leq n \Pr(Z_1>4t) =O(1) Pr(maxZi?>4t)=Pr(Z1?>4t?Z2?>4t???Zn?>4t)≤i=1∑n?Pr(Zi?>4t)不妨设第一个盒子球数最多maxZi?=Z1?≤nPr(Z1?>4t)=O(1)
3.1.1 先证Pr?(Z1>4t)=O(1)\Pr( Z_1>4t)=O(1)Pr(Z1?>4t)=O(1)即求其上界
物理意义为至少有4t+14t+14t+1个球落入第一个盒子,设球的序号为:(j1,j2,j3…,j4t+1)∈m=n(j_1,j_2,j_3\dots,j_{4t+1})\in m=n(j1?,j2?,j3?…,j4t+1?)∈m=n
Pr?(Z1>4t)=?j1…j4t+1∈nPr?(j1,j2...…j4t+1inbin1)≤∑j1…j4t+1∈nPr?(j1,j2...…j4t+1inbin1)=Cn4t+1(1n)4t+1其它的n?4t?1个球进入哪一个盒子并不关心\Pr( Z_1>4t)=\bigcup_{j_1\dots j_{4t+1}\in n} \Pr(j_1,j_2...\dots j_{4t+1} \quad in \quad bin1)\\ \leq \sum_{j_1\dots j_{4t+1}\in n} \Pr(j_1,j_2...\dots j_{4t+1} \quad in \quad bin1)\\= C_{n}^{4t+1}(\frac{1}{n})^{4t+1}\\ \quad \quad 其它的n-4t-1 个球进入哪一个盒子并不关心 Pr(Z1?>4t)=j1?…j4t+1?∈n??Pr(j1?,j2?...…j4t+1?inbin1)≤j1?…j4t+1?∈n∑?Pr(j1?,j2?...…j4t+1?inbin1)=Cn4t+1?(n1?)4t+1其它的n?4t?1个球进入哪一个盒子并不关心
利用公式(nm)m≤Cnm=n(n?1)(n?2)…(n?m+1)m(m?1)(m?2)…(1)≤(nem)m(?)(\frac{n}{m})^m\leq C_{n}^m=\frac{n(n-1)(n-2)\dots(n-m+1)}{m(m-1)(m-2)\dots(1)}\le(\frac{ne}{m})^m\quad (*)(mn?)m≤Cnm?=m(m?1)(m?2)…(1)n(n?1)(n?2)…(n?m+1)?≤(mne?)m(?)
3.1.1有
Pr?(Z1>4t)=Cn4t+1(1n)4t+1≤(1n)4t+1(ne4t+1)4t+1=(e4t+1)4t+1≈1t4t+1代入t=ln?nln?ln?n=(ln?ln?nlnn)4t+1≈(ln?nln?n)4t+1=(ln?n)?2t=e?2tln?ln?n=e?2ln?nln?ln?nln?ln?n=e?2ln?n=1n2\Pr( Z_1>4t)= C_{n}^{4t+1}(\frac{1}{n})^{4t+1}\\ \leq(\frac{1}{n})^{4t+1} (\frac{ne}{4t+1})^{4t+1} \\ = (\frac{e}{4t+1})^{4t+1}\\ \approx \frac{1}{t}^{4t+1} 代入t=\frac{\ln n}{\ln \ln n} \\ =(\frac{\ln \ln n}{\\ln n})^{4t+1}\\ \approx( \frac{\sqrt{\ln n}}{\ln n })^{4t+1}=(\ln n)^{-2t}=e^{-2t\ln \ln n}\\= e^{-2\frac{\ln n}{\ln \ln n}\ln \ln n}=e^{-2\ln n}=\frac{1}{n^2} Pr(Z1?>4t)=Cn4t+1?(n1?)4t+1≤(n1?)4t+1(4t+1ne?)4t+1=(4t+1e?)4t+1≈t1?4t+1代入t=lnlnnlnn?=(lnnlnlnn?)4t+1≈(lnnlnn??)4t+1=(lnn)?2t=e?2tlnlnn=e?2lnlnnlnn?lnlnn=e?2lnn=n21?
于是有
等价于证 Pr?(max?Zi>4t)≤nPr?(Z1>4t)=n1n2=1n=O(1)\Pr(\max Z_i>4t)\leq n \Pr(Z_1>4t) =n\frac{1}{n^2}=\frac{1}{n}=O(1) Pr(maxZi?>4t)≤nPr(Z1?>4t)=nn21?=n1?=O(1)
得证
3.2 证(2)
我们要 换一种思路(令t=ln?nln?ln?n令t=\frac{\ln n}{\ln \ln n}令t=lnlnnlnn?)
Pr?(max?Zi≥14ln?nln?ln?n)=1?O(1)(2)\Pr(\max Z_i\geq \frac{1}{4}\frac{\ln n}{\ln \ln n})=1-O(1)\quad \quad \quad (2)Pr(maxZi?≥41?lnlnnlnn?)=1?O(1)(2)
我们不写成如下的形式来证明,因为下式是在求上界,实际中我们应该求下界才是
Pr?(max?Zi>14t)=Pr?(Z1>14t?Z2>14t???Zn>14t)≤∑i=1nPr?(Zi>14t)不妨设第一个盒子球数最多max?Zi=Z1≤nPr?(Z1>14t)\Pr(\max Z_i>\frac{1}{4}t)=\Pr(Z_1>\frac{1}{4}t\bigcup Z_2>\frac{1}{4}t\bigcup \dots \bigcup Z_n>\frac{1}{4}t)\\ \leq\sum_{i=1}^n \Pr(Z_i>\frac{1}{4}t) \quad 不妨设第一个盒子球数最多\max Z_i=Z_1\\ \leq n \Pr(Z_1>\frac{1}{4}t) Pr(maxZi?>41?t)=Pr(Z1?>41?t?Z2?>41?t???Zn?>41?t)≤i=1∑n?Pr(Zi?>41?t)不妨设第一个盒子球数最多maxZi?=Z1?≤nPr(Z1?>41?t)
3.2.1 求Pr?(Z1>14t)\Pr(Z_1>\frac{1}{4}t)Pr(Z1?>41?t)的下界
Pr?(Z1>14t)=∑k=14nPr?(Z1=k)≥Pr?(Z1=14t+1)=Cn14t+1(1n)14t+1(1?1n)n?14t?1利用(1?1n)n?e?1和公式?≥Cn14t+1(1n)14t+1e?1≥(n14t+1)14t+1(1n)14t+1e?1=(114t+1)14t+1e?1≥(4ln?ln?nln?n+4ln?ln?n)14t+1e?1当n>e2时,有4ln?ln?n>2≥(2ln?n+4ln?ln?n)14t+1e?1再利用4ln?ln?n>2≥(1ln?n)14t+1e?1=(ln?n)?14(ln?nln?ln?n)?1e?1=e(?14(ln?nln?ln?n)?1)ln?ln?ne?1=e(?14(ln?n)?ln?lnn)e?1=n?141ln?ne?1=1n141ln?ne?1有n14?n13≥1n13\Pr(Z_1>\frac{1}{4}t)= \sum_{k=\frac{1}{4}}^n \Pr(Z_1=k)\\ \ge \Pr(Z_1=\frac{1}{4}t+1)=C_{n}^{\frac{1}{4}t+1}(\frac{1}{n})^{\frac{1}{4}t+1}(1-\frac{1}{n})^{n-\frac{1}{4}t-1} \\利用(1-\frac{1}{n})^{n}\sim e^{-1}和 公式*\\ \geq C_{n}^{\frac{1}{4}t+1}(\frac{1}{n})^{\frac{1}{4}t+1}e^{-1}\\ \geq(\frac{n}{\frac{1}{4}t+1})^{\frac{1}{4}t+1}(\frac{1}{n})^{\frac{1}{4}t+1}e^{-1}\\ =(\frac{1}{\frac{1}{4}t+1})^{\frac{1}{4}t+1}e^{-1}\\ \geq(\frac{4\ln \ln n}{\ln n+4\ln \ln n})^{\frac{1}{4}t+1}e^{-1} \quad 当n>e^2 时,有4\ln \ln n>2\\ \geq(\frac{2}{\ln n+4\ln \ln n})^{\frac{1}{4}t+1}e^{-1}\quad 再利用4\ln \ln n>2\\ \geq(\frac{1}{\ln n})^{\frac{1}{4}t+1}e^{-1}=(\ln n)^{-\frac{1}{4}(\frac{\ln n}{\ln \ln n})-1}e^{-1}\\ =e^{(-\frac{1}{4}(\frac{\ln n}{\ln \ln n})-1)\ln \ln n}e^{-1}\\ =e^{(-\frac{1}{4}(\ln n)-\ln ln n)}e^{-1}\\ ={n^{-\frac{1}{4}} \frac{1}{\ln n}}e^{-1}={ {\frac{1}{n^\frac{1}{4}}} \frac{1}{\ln n}}e^{-1} \quad 有n^{\frac{1}{4} }\ll n^{\frac{1}{3} }\\ \geq{\frac{1}{n^\frac{1}{3}}} Pr(Z1?>41?t)=k=41?∑n?Pr(Z1?=k)≥Pr(Z1?=41?t+1)=Cn41?t+1?(n1?)41?t+1(1?n1?)n?41?t?1利用(1?n1?)n?e?1和公式?≥Cn41?t+1?(n1?)41?t+1e?1≥(41?t+1n?)41?t+1(n1?)41?t+1e?1=(41?t+11?)41?t+1e?1≥(lnn+4lnlnn4lnlnn?)41?t+1e?1当n>e2时,有4lnlnn>2≥(lnn+4lnlnn2?)41?t+1e?1再利用4lnlnn>2≥(lnn1?)41?t+1e?1=(lnn)?41?(lnlnnlnn?)?1e?1=e(?41?(lnlnnlnn?)?1)lnlnne?1=e(?41?(lnn)?lnlnn)e?1=n?41?lnn1?e?1=n41?1?lnn1?e?1有n41??n31?≥n31?1?
3.2.2 求Pr?(Y≥14t)\Pr(Y\geq \frac{1}{4}t)Pr(Y≥41?t)的下界
即证 Pr?(max?Zi≥14ln?nln?ln?n)=1?O(1)(2)\Pr(\max Z_i\geq \frac{1}{4}\frac{\ln n}{\ln \ln n})=1-O(1)\quad \quad \quad (2)Pr(maxZi?≥41?lnlnnlnn?)=1?O(1)(2)
定义变量{X1,X2,…,Xn}\{X_1,X_2,\dots, X_n\}{
X1?,X2?,…,Xn?}
Xi={1ifZi>14t0othersX_i=\left\{ \begin{aligned} 1 & & \ if \quad Z_i>\frac {1}{4}t\\ 0 & & \ others\\ \end{aligned} \right. Xi?=????10?? ifZi?>41?t others?
令pi=Pr?(Xi=1)=Pr?(Zi>14t)E(Xi)=piVar(Xi)=pi(1?pi)≤1/4X=X1+X2+?+XnE(X)=E(X1+X2+?+Xn)=nE(Xi)=nPr?(Zi>14t)≥n23Var(X)=Var(X1+X2+?+Xn)≤nVar(Xi)=14n)令p_i=\Pr(X_i=1)=\Pr( Z_i>\frac {1}{4}t) \quad \\ E(X_i)=p_i\\ Var(X_i)=p_i(1-p_i)\leq1/4\\ X=X_1+X_2+\dots +X_n \\E(X)=E(X_1+X_2+\dots +X_n)=nE(X_i)=n\Pr( Z_i>\frac {1}{4}t)\ge n^{\frac{2}{3}}\\ Var(X)=Var(X_1+X_2+\dots +X_n)\leq nVar(X_i)=\frac {1}{4}n) 令pi?=Pr(Xi?=1)=Pr(Zi?>41?t)E(Xi?)=pi?Var(Xi?)=pi?(1?pi?)≤1/4X=X1?+X2?+?+Xn?E(X)=E(X1?+X2?+?+Xn?)=nE(Xi?)=nPr(Zi?>41?t)≥n32?Var(X)=Var(X1?+X2?+?+Xn?)≤nVar(Xi?)=41?n)
证方差时用到的公式是Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)当X,YX,YX,Y负相关时,Var(X+Y)≤Var(X)+Var(Y)Var(X+Y)\leq Var(X)+Var(Y)Var(X+Y)≤Var(X)+Var(Y),上面XiX_iXi?间满足负相关,当Z_i 大时,其它盒子装的球就会少。
证下式 Pr?(max?Zi>14t)=Pr?(Z1>14t?Z2>14t???Zn>14t)=Pr?(X1=1?X2=1???Xn=1)=1?Pr?(X1=0?X2=0???Xn=0)=1?Pr?(∑i=1nXi)=1?Pr?(X=0)=1?O(1)\Pr(\max Z_i>\frac{1}{4}t)=\Pr(Z_1>\frac{1}{4}t\bigcup Z_2>\frac{1}{4}t\bigcup \dots \bigcup Z_n>\frac{1}{4}t)\\ =\Pr(X_1=1\bigcup X_2=1\bigcup \dots \bigcup X_n=1)\\ =1-\Pr(X_1=0\bigcap X_2=0\bigcap \dots \bigcap X_n=0)\\ =1-\Pr(\sum_{i=1}^nX_i) =1-\Pr(X=0)=1-O(1) Pr(maxZi?>41?t)=Pr(Z1?>41?t?Z2?>41?t???Zn?>41?t)=Pr(X1?=1?X2?=1???Xn?=1)=1?Pr(X1?=0?X2?=0???Xn?=0)=1?Pr(i=1∑n?Xi?)=1?Pr(X=0)=1?O(1)
等价于证 Pr?(X=0)=O(1),即证其上界为小O(1)\Pr(X=0)=O(1),即证其上界为小O(1) Pr(X=0)=O(1),即证其上界为小O(1)
X=X1+X2+?+XnX=X_1+X_2+\dots +X_nX=X1?+X2?+?+Xn?
Pr?(X=0)=Pr?(X?E(X)=?E(X))≤Pr?(∣X?E(X)∣=∣E(X)∣)利用切比雪夫不等式≤Var(X)E2(X)(3)\Pr(X=0)=\Pr(X-E(X)=-E(X))\\ \leq \Pr(|X-E(X)|=|E(X)|)\quad 利用切比雪夫不等式\\ \leq \frac{Var(X)}{E^2(X)} \quad \quad \quad (3)Pr(X=0)=Pr(X?E(X)=?E(X))≤Pr(∣X?E(X)∣=∣E(X)∣)利用切比雪夫不等式≤E2(X)Var(X)?(3)
因此有
Pr?(X=0)≤Var(X)E2(X)=n/4n43=141n13?O(1)\Pr(X=0) \leq \frac{Var(X)}{E^2(X)} =\frac{n/4}{n^{\frac{4}{3}}}=\frac{1}{4} \frac{1}{n^{\frac{1}{3}}}\sim O(1)Pr(X=0)≤E2(X)Var(X)?=n34?n/4?=41?n31?1??O(1)
得证