Coupon collecting singeltons with equal probability

Mon 7 January 2019

Coupon collecting singeltons with equal probability

Written by Hongjinn Park in Articles

In collecting $n$ coupons with each being equally likely, what is the expected number of singletons aka loners?

Related Problem 1

Flipping coins, what is the probability you get only one head on the quest to get both a head and a tail? One way to do this is by looking at the four equally possible outcomes of the first two flips.

$$P(E) = \frac{1}{4}\bigg[P(E|HH) + P(E|HT)+ P(E|TH)+P(E|TT)\bigg] = \frac{1}{4}\bigg[0+2+P(E|TT)\bigg]$$

Note that the $P(E|TT) = 1$ because either you get a head on the third flip and win the game or you keep getting tails until you get one head and win the game! Really the only way you are screwed is if you get the outcome HH which has probability $\frac{1}{4}$ so the $P(E) = 1- \frac{1}{4} = \frac{3}{4}$.

An alternative way of solving this problem is to create a tracking array of size 2 which tracks new outcomes only. Therefore 'HHTHT' is 'HT'. Given that the first element of this array is 'H' then you will win as long as long as you don't get another 'H' before you get a 'T'. Create a new tracking array of length 2. Therefore you need the new tracking array to be 'TH'. Since 'HT' and 'TH' are equally likely, given that the first element is 'H' you have a $\frac{1}{2}$ probability of your new tracking array of being 'TH' and therefore your original tracking array if being 'HT(H)' and winning. Next, given that you start with a 'T' you can create a new tracking array of length two. Since this new tracking array can be both 'HT' or 'TH' since either lead to victory, the conditional probability is 1. Therefore, $P(E) = \frac{1}{2}(\frac{1}{2} + \frac{2}{2}) = \frac{1}{2}\frac{3}{2} = \frac{3}{4}$

Related Problem 2

Rolling a dice, what is the probability that you get outcomes 2 through 6 and not get any 1s in the process? Equivalently, as you are rolling a dice what is the probability that once you get a 1, you get 2 through 6 without getting 1 again. Call this event $E_1$.

The probability of this event is $\frac{1}{6}$. First create a tracking array of length $5+1$ for the five you need to collect and add one more for the 1 you already got. You add the 1 you already got because you are concerned about duplicates. The tracking array just tracks new dice outcomes that haven't occurred already. So for example '3325134436' would translate to '325146' and as you can see you don't care about how many rolls it takes to get collect all outcomes. For the number of rolls see the generic coupon collecting problem. This new array of length $5+1=6$ has $6!$ possibilities, and to avoid duplicates of 1 you want 1 at the end of which there are $5!$ arrangements of the tracking array. Therefore $P(E_1) = \frac{5!}{6!} = \frac{1}{6}$.

Related Problem 3

Rolling a dice, what is the probability that you get 1 only once in the quest to get all 6 outcomes? Let's call this event $E_2$. This is different than the other problem because in the other problem once you got a 1, you had a lot of work left (getting 2 through 6). In this problem, once you get 1, you might already have the other numbers in your collection. Therefore the $P(E_1) \le P(E_2)$. You can get the $P(E_2)$ by conditioning on what order you get the first 1 in your tracking array. For example, if in your tracking array the first element is 1, then this reduces to $P(E_2 | O_1) = P(E_1) = \frac{1}{6}$. If you get a 1 as the last part of your tracking array, then $P(E_2 | O_6) = 1$ since there are no more dice rolls and you are assured that there is only one 1. Since each $O_i$ has equal probability $\frac{1}{6}$ the $P(E_2) = \frac{1}{6} \bigg[\frac{1}{6}+P(E_2|O_2)+...+1\bigg]$.

In general, $P(E_2|O_i)$ can be determined by creating a new tracking array. You still have to get $n-i$ outcomes. You don't want to get 1 again in the rolls you have left. Therefore in this new tracking array you want it to end in 1. There are $(n-i+1)!$ arrangements of this new tracking away of which $(n-i)!$ have 1 at the end, therefore $P(E_2|O_i) = \frac{1}{n-i+1}$ and therefore

$$P(E_2) = \frac{1}{6} \bigg[\frac{1}{6}+\frac{1}{5}+...+\frac{1}{1}\bigg] = \frac{49}{120} \approx .4083333$$

This result I verified through Python.

Actual Problem now

Now the actual problem is, in collecting n coupons with each being equally likely, what is the expected number of singletons aka loners?

First create a tracking array of length n with elements $T_1$ through $T_n$. Then, create indicator variables $A_i$ for each of the $T_i$ that equal one if the ith kind of coupon collected ends up being a singleton. Therefore for $X$ which is the total number of loners, $X=A_1 + ... + A_n$. The $E[X] = \sum_{1}^{n} E[A_i] = \sum_{1}^{n} P[A_i = 1] $ where the last equality is because the $A_i$'s are Bernoulli.

To evaluate $P(A_i)$, note that at the time you get type i, you never got type i before this pull. Therefore you just have to worry about not getting type i again before you collect the remaining $n-i$ types. So create a new tracking array of length $n-i+1$ where you add back in the ith type because you care about duplicates. There are $(n-i+1)!$ arrangements of this new tracking array of which $(n-i)!$ are good since the ith one must be at the end. Therefore $P(A_i) = \frac{1}{n-i+1}$. Think about different examples for $i$ and this will make sense to you.

For example $i=n$ then the $P(A_n)=1$ which makes sense. If $i=1$ then $P(A_1=1) = \frac{1}{n}$ which is the same as the dice outcome collecting problem.

Therefore the $E[X] = \sum_{i=1}^{n} \frac{1}{n-i+1} = \sum_{i=1}^{n} \frac{1}{i}$ where the last equality happens if you write out the terms and look at it.

(*) To calculate the variance we will use $Var(X) = E[X^2] - E[X]^2$ and to get that we will calculate $E[X^2]$ from the formula $E[X^2] = 2\sum_{i < j}P(A_iA_j)+E[X]$ which comes from looking at $E[{X\choose2}]$.

We have to determine the probability that ith type and jth type of coupons collected are singletons, ie calculate $P(A_iA_j)$ for $i < j$. We can calculate this by breaking it down into two stages. First, that the ith type of coupon collected was a complete loner until the jth type of coupon was collected, and then that both whatever coupons collected at $T_i$ and $T_j$ don't get seen again after the jth type collected until the remaining $n-j$ types are collected and we have a full set.

To this end we will condition on the event $S_{i,j}$ which is the event that the ith new type of coupon selected is picked only once before the jth type is collected. That is, there are no duplicates of whatever type ($T_i$) was collected at the ith spot of the tracking array before the jth new type. After the jth type then anything goes, for example you can get duplicates of whatever type $T_i$ and $T_j$ after the jth type is collected, ie remember $S_{i,j} \neq A_iA_j$. Therefore,

$$P(A_iA_j) = P(A_iA_j|S_{i,j})P(S_{i,j})+P(A_iA_j|S_{i,j}^c)P(S_{i,j}^c) = P(A_iA_j|S_{i,j})P(S_{i,j})$$

For $P(S_{i,j})$, note that at the moment the ith type of coupon is collected, this coupon has necessarily not shown up at all from the starting pick until the ith type was collected for the first time. Then you have $n-i$ more coupons to collect before the experiment ends, and the question is when will the ith type of coupon show up again? Create a new tracking array with $n-i+1$ types, and since the ith type can show up anywhere with equal probability from being the next type of coupon after you just collected the ith coupon to the very last new coupon out of $n-i+1$, the only thing that is not okay is to have $T_i$ (whatever ith new type of coupon) be duplicated before $T_j$ or the jth new coupon is selected. Therefore $P(S_{i,j}) =1- \frac{j-i}{n-i+1}$. This is because there are $n-i+1$ spots of equal probability and $n-j$ are not okay and would violate the event $S_{i,j}$.

To see this better use examples like $n=12$, $i=4$, $j=8$. There are 3 regions, the first one is guaranteed not to have duplicates of the ith type collected, the very last region doesn't matter for $S_{i,j}$ as you can have duplicates of either the ith newest type or jth type collected, and it is only the middle region that can be an issue. Also look at events like $n=12$, $i=11$, $j=12$ which has $P(S_{11,12}) =1- \frac{1}{2} = \frac{1}{2}$. This makes sense because you just don't want to pick the second to the last type collected any more before the final type is collected, and by symmetry this is one half. Also $n=12$, $i=1$, $j=2$ which has $P(S_{1,2}) = 1-\frac{1}{12} = \frac{11}{12}$. $P(S_{1,2})$ is just the probability that you don't get the very first type picked again, which means the other n-1 types out of n is acceptable.

Now $P(A_iA_j|S_{i,j})$ is the probability that both the ith and jth type of coupon collected don't show up while hunting the last $n-j$ coupons. Create a new tracking array that is length $n-j+2$ (where the $+2$ is from tacking on the ith and jth type of coupons collected along with the remaining ones you don't have yet) and note that there are $(n-j+2)!$ arrangements of which $2(n-j)!$ are acceptable. The 2 comes from the fact that i and j are interchangeable as occurring (as duplicates) after you get the last $n-i$. Therefore,

$$P(A_iA_j|S_{i,j})P(S_{i,j}) =\bigg[1- \frac{j-i}{n-i+1}\bigg]\frac{2}{n-j+1}\frac{1}{n-j+2} = \frac{n-j+1}{n-i+1}\frac{2}{n-j+1}\frac{1}{n-j+2}$$ $$= \frac{2}{(n-i+1)(n-j+2)} = P(A_iA_j)$$

Now you can get the $Var(X)$ by plugging in, check (*) above. Note that the result of plug and chug will have sigma notations.

Articles

Personal notes I've written over the years.