Question about the latent variable in EM algorithm

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












In mixture models, Expectation maximization algorithm (EM) is a commonly used method to estimate the model parameters. Suppose that I have bivariate mixture model with two mixture components, with mixture weights, $pi_1$ and $pi_2$, respectively. EM introduces a $z$ variable which takes 1 if the point is come from the first mixture component, and takes 0 otherwise. These variables are assumed to be i.i.d and are distributed from multinomial ($pi_1$, $pi_2$). Is that correct? If yes, how they can have only 0,1 values and in the sometimes they have multinomial distribution?



Here is the paragraph from the source: here is the link



3.2. EM algorithm



In this section, we describe the EM algorithm (Dempster et al., 1977) to obtain the estimates for the parameters θ in a
mixture of M-component D-vine densities, given the data set and the number of components M. The determination of M
will be discussed later in Section 3.3.
Assume that N observations, say $x_k = (x_k,1, . . . , x_k,N )$ where $k = 1, . . . , d$, drawn randomly from a $M$-component.



Let us denote latent variables $z_n = (z_n1, . . . , z_nm, . . . , z_nM)$, where $z_nm = 1$ if $x_n$ comes from the $m$-th component and
$z_nm$ = 0 otherwise. Assume that $z_n$ is independent and identically distributed from a multinomial distribution, that is,
$z_n ∼ Mult(M, π = (π_1, . . . , π_M))$.



Any help, please?










share|cite|improve this question























  • @Xi'an thank you so much for your answer, I add the link.
    – Maryam
    28 mins ago










  • @Xi'an Ok. Sorry for the typo.
    – Maryam
    26 mins ago
















up vote
2
down vote

favorite












In mixture models, Expectation maximization algorithm (EM) is a commonly used method to estimate the model parameters. Suppose that I have bivariate mixture model with two mixture components, with mixture weights, $pi_1$ and $pi_2$, respectively. EM introduces a $z$ variable which takes 1 if the point is come from the first mixture component, and takes 0 otherwise. These variables are assumed to be i.i.d and are distributed from multinomial ($pi_1$, $pi_2$). Is that correct? If yes, how they can have only 0,1 values and in the sometimes they have multinomial distribution?



Here is the paragraph from the source: here is the link



3.2. EM algorithm



In this section, we describe the EM algorithm (Dempster et al., 1977) to obtain the estimates for the parameters θ in a
mixture of M-component D-vine densities, given the data set and the number of components M. The determination of M
will be discussed later in Section 3.3.
Assume that N observations, say $x_k = (x_k,1, . . . , x_k,N )$ where $k = 1, . . . , d$, drawn randomly from a $M$-component.



Let us denote latent variables $z_n = (z_n1, . . . , z_nm, . . . , z_nM)$, where $z_nm = 1$ if $x_n$ comes from the $m$-th component and
$z_nm$ = 0 otherwise. Assume that $z_n$ is independent and identically distributed from a multinomial distribution, that is,
$z_n ∼ Mult(M, π = (π_1, . . . , π_M))$.



Any help, please?










share|cite|improve this question























  • @Xi'an thank you so much for your answer, I add the link.
    – Maryam
    28 mins ago










  • @Xi'an Ok. Sorry for the typo.
    – Maryam
    26 mins ago












up vote
2
down vote

favorite









up vote
2
down vote

favorite











In mixture models, Expectation maximization algorithm (EM) is a commonly used method to estimate the model parameters. Suppose that I have bivariate mixture model with two mixture components, with mixture weights, $pi_1$ and $pi_2$, respectively. EM introduces a $z$ variable which takes 1 if the point is come from the first mixture component, and takes 0 otherwise. These variables are assumed to be i.i.d and are distributed from multinomial ($pi_1$, $pi_2$). Is that correct? If yes, how they can have only 0,1 values and in the sometimes they have multinomial distribution?



Here is the paragraph from the source: here is the link



3.2. EM algorithm



In this section, we describe the EM algorithm (Dempster et al., 1977) to obtain the estimates for the parameters θ in a
mixture of M-component D-vine densities, given the data set and the number of components M. The determination of M
will be discussed later in Section 3.3.
Assume that N observations, say $x_k = (x_k,1, . . . , x_k,N )$ where $k = 1, . . . , d$, drawn randomly from a $M$-component.



Let us denote latent variables $z_n = (z_n1, . . . , z_nm, . . . , z_nM)$, where $z_nm = 1$ if $x_n$ comes from the $m$-th component and
$z_nm$ = 0 otherwise. Assume that $z_n$ is independent and identically distributed from a multinomial distribution, that is,
$z_n ∼ Mult(M, π = (π_1, . . . , π_M))$.



Any help, please?










share|cite|improve this question















In mixture models, Expectation maximization algorithm (EM) is a commonly used method to estimate the model parameters. Suppose that I have bivariate mixture model with two mixture components, with mixture weights, $pi_1$ and $pi_2$, respectively. EM introduces a $z$ variable which takes 1 if the point is come from the first mixture component, and takes 0 otherwise. These variables are assumed to be i.i.d and are distributed from multinomial ($pi_1$, $pi_2$). Is that correct? If yes, how they can have only 0,1 values and in the sometimes they have multinomial distribution?



Here is the paragraph from the source: here is the link



3.2. EM algorithm



In this section, we describe the EM algorithm (Dempster et al., 1977) to obtain the estimates for the parameters θ in a
mixture of M-component D-vine densities, given the data set and the number of components M. The determination of M
will be discussed later in Section 3.3.
Assume that N observations, say $x_k = (x_k,1, . . . , x_k,N )$ where $k = 1, . . . , d$, drawn randomly from a $M$-component.



Let us denote latent variables $z_n = (z_n1, . . . , z_nm, . . . , z_nM)$, where $z_nm = 1$ if $x_n$ comes from the $m$-th component and
$z_nm$ = 0 otherwise. Assume that $z_n$ is independent and identically distributed from a multinomial distribution, that is,
$z_n ∼ Mult(M, π = (π_1, . . . , π_M))$.



Any help, please?







maximum-likelihood expectation-maximization mixture latent-variable finite-mixture-model






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 27 mins ago

























asked 1 hour ago









Maryam

7810




7810











  • @Xi'an thank you so much for your answer, I add the link.
    – Maryam
    28 mins ago










  • @Xi'an Ok. Sorry for the typo.
    – Maryam
    26 mins ago
















  • @Xi'an thank you so much for your answer, I add the link.
    – Maryam
    28 mins ago










  • @Xi'an Ok. Sorry for the typo.
    – Maryam
    26 mins ago















@Xi'an thank you so much for your answer, I add the link.
– Maryam
28 mins ago




@Xi'an thank you so much for your answer, I add the link.
– Maryam
28 mins ago












@Xi'an Ok. Sorry for the typo.
– Maryam
26 mins ago




@Xi'an Ok. Sorry for the typo.
– Maryam
26 mins ago










2 Answers
2






active

oldest

votes

















up vote
2
down vote



accepted










There is a lot of confusion in the question, confusion that could be reduced by looking at a textbook on the paper, or even the original 1977 paper by Dempster, Laird and Rubin.



Here is an excerpt of our book, Introducing Monte Carlo Methods with R, followed by my answer:




Assume that we observe $X_1, ldots, X_n$, jointly distributed from $g(mathbf x|theta)$ that satisfies
$$
g(mathbf x|theta)=int_cal Z f(mathbf x, mathbf z|theta), textdmathbf z,
$$

and that we want to compute $hattheta = argmax L(theta|mathbf x)= argmax g(mathbf x|theta)$.
Since the augmented data is $mathbf z$, where $(mathbf X, mathbf Z) sim f(mathbf x,mathbf z| theta)$
the conditional distribution of the missing data $mathbf Z$ given the observed data $mathbf x$ is
$$
k(mathbf z| theta, mathbf x) = f(mathbf x, mathbf z|theta)big/g(mathbf x|theta),.
$$

Taking the logarithm of this expression
leads to the following relationship between the complete-data likelihood $L^c(theta|mathbf x,
mathbf z)$
and the observed-data likelihood $L(theta|mathbf x)$. For any value $theta_0$,
$$
log L(theta|mathbf x)= mathbbE_theta_0[log L^c(theta|mathbf x,mathbf Z)]
-mathbbE_theta_0[log k(mathbf Z| theta, mathbf x)],qquad(1)
$$
where the expectation is with respect to $k(mathbf z| theta_0, mathbf x)$. In the EM algorithm,
while we aim at maximizing $log L(theta|mathbf x)$, only the first term on the right side of
(1) will be considered.



Denoting$$
Q(theta |theta_0, mathbf x) = mathbbE_theta_0
[log L^c(theta|mathbf x,mathbf Z)],
$$

the EM algorithm indeed proceeds iteratively by maximizing
$Q(theta |theta_0, mathbf x)$ at each iteration and, if $hattheta_(1)$
is the value of $theta$ maximizing $Q(theta |theta_0, mathbf x)$,
by replacing $theta_0$ by the updated value $hattheta_(1)$. In this manner, a sequence of estimators
$hattheta_(j)_j$ is obtained, where $hattheta_(j)$ is defined as the value of
$theta$ maximizing $Q(theta |hattheta_(j-1), mathbf x)$; that is,$$
Q(hattheta_(j) |hattheta_(j-1), mathbf x)
= max_theta,Q(theta |hattheta_(j-1),
mathbf x).$$
This iterative scheme thus contains both an expectation step
and a maximization step, giving the algorithm its name.



EM Algorithm
Pick a starting value $hattheta_(0)$



Repeat



  1. Compute the E-step
    $$
    Q(theta|hattheta_(m), mathbf x)
    =mathbbE_hattheta_(m) [log L^c(theta|mathbf x, mathbf Z)],,
    $$

    where the expectation is with respect to $k(mathbf z|hattheta_(m),mathbf x)$ and set $m=0$.


  2. Maximize $Q(theta|hattheta_(m), mathbf x)$ in
    $theta$ and take the M-step
    $$
    hattheta_(m+1)=argmax_theta ; Q(theta|hattheta_(m), mathbf x)
    $$

    and set $m=m+1$


until a fixed point is reached; i.e., $hattheta_(m+1)=hattheta_(m)$.



For the normal mixture, using the missing data structure exhibited in previously leads to an objective function
equal to
$$
Q(theta^prime|theta,mathbfx) = -frac12,sum_i=1^n
mathbbE_thetaleft[left. Z_i (x_i-mu_1)^2 + (1-Z_i) (x_i-mu_2)^2 right| mathbfx right].
$$

Solving the M-step then provides the closed-form expressions
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n Z_i x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n Z_i| mathbfx right]
$$

and
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n (1-Z_i) x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n (1-Z_i)| mathbfx right].
$$

Since
$$
mathbbE_thetaleft[Z_i|mathbfx right]=fracvarphi(x_i-mu_1) varphi(x_i-mu_1)+3varphi(x_i-mu_2),,
$$

the EM algorithm can easily be implemented in this setting.




Whatever the mixture involved, the latent variables $Z_i$ are Multinomial $mathcalM_M(1;pi_1,ldots,pi_M)$ which means only one component of the vector $Z_i$ is equal to one and all of the $M-1$ others are zero. (Note the difference with the question in the notations: the original notation $mathcalM(M;pi_1,ldots,pi_M)$ fails to indicate how many draws are taken, that is, what is the sum of the components of $Z_i$.).When $k=2$ as in the above excerpt, $Z_i$ is an integer in $0,1$. There may be a confusion between a Multinomial distribution and the property of a distribution (like some mixtures) to be multimodal. The $Z_i$ do not have a multimodal distribution, taking only two values, even conditional on the $X_i$'s, while the $X_i$'s may, at least unconditionally.






share|cite|improve this answer






















  • Yes, now I understand my problem. Thank you so much. I really learn a new thing.
    – Maryam
    19 mins ago

















up vote
0
down vote













If I correctly read between the lines, your question is about the difference between the distribution of $[z]$ (i.e., the prior distribution of the latent variable), and the distribution of $[z mid y]$ (i.e., the posterior distribution of the latent variable given the data $y$).



Indeed the prior is i.i.d. Bernoulli with probability $pi$. However, the posterior is not i.i.d. because each subject will his/her own probability of belonging in the first component (i.e., the component for which $z_i = 1$) depending on their data $y_i$. Hence, if you do the plot of the posterior probabilities, it can be multimodal.






share|cite|improve this answer




















  • Thank you so much for your help. I have updated my question to make it so clear.
    – Maryam
    32 mins ago






  • 1




    Sorry, I have a typo in my question. Could you please have a look. I meant multinomial, not multimodal.
    – Maryam
    23 mins ago










Your Answer




StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f371080%2fquestion-about-the-latent-variable-in-em-algorithm%23new-answer', 'question_page');

);

Post as a guest






























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










There is a lot of confusion in the question, confusion that could be reduced by looking at a textbook on the paper, or even the original 1977 paper by Dempster, Laird and Rubin.



Here is an excerpt of our book, Introducing Monte Carlo Methods with R, followed by my answer:




Assume that we observe $X_1, ldots, X_n$, jointly distributed from $g(mathbf x|theta)$ that satisfies
$$
g(mathbf x|theta)=int_cal Z f(mathbf x, mathbf z|theta), textdmathbf z,
$$

and that we want to compute $hattheta = argmax L(theta|mathbf x)= argmax g(mathbf x|theta)$.
Since the augmented data is $mathbf z$, where $(mathbf X, mathbf Z) sim f(mathbf x,mathbf z| theta)$
the conditional distribution of the missing data $mathbf Z$ given the observed data $mathbf x$ is
$$
k(mathbf z| theta, mathbf x) = f(mathbf x, mathbf z|theta)big/g(mathbf x|theta),.
$$

Taking the logarithm of this expression
leads to the following relationship between the complete-data likelihood $L^c(theta|mathbf x,
mathbf z)$
and the observed-data likelihood $L(theta|mathbf x)$. For any value $theta_0$,
$$
log L(theta|mathbf x)= mathbbE_theta_0[log L^c(theta|mathbf x,mathbf Z)]
-mathbbE_theta_0[log k(mathbf Z| theta, mathbf x)],qquad(1)
$$
where the expectation is with respect to $k(mathbf z| theta_0, mathbf x)$. In the EM algorithm,
while we aim at maximizing $log L(theta|mathbf x)$, only the first term on the right side of
(1) will be considered.



Denoting$$
Q(theta |theta_0, mathbf x) = mathbbE_theta_0
[log L^c(theta|mathbf x,mathbf Z)],
$$

the EM algorithm indeed proceeds iteratively by maximizing
$Q(theta |theta_0, mathbf x)$ at each iteration and, if $hattheta_(1)$
is the value of $theta$ maximizing $Q(theta |theta_0, mathbf x)$,
by replacing $theta_0$ by the updated value $hattheta_(1)$. In this manner, a sequence of estimators
$hattheta_(j)_j$ is obtained, where $hattheta_(j)$ is defined as the value of
$theta$ maximizing $Q(theta |hattheta_(j-1), mathbf x)$; that is,$$
Q(hattheta_(j) |hattheta_(j-1), mathbf x)
= max_theta,Q(theta |hattheta_(j-1),
mathbf x).$$
This iterative scheme thus contains both an expectation step
and a maximization step, giving the algorithm its name.



EM Algorithm
Pick a starting value $hattheta_(0)$



Repeat



  1. Compute the E-step
    $$
    Q(theta|hattheta_(m), mathbf x)
    =mathbbE_hattheta_(m) [log L^c(theta|mathbf x, mathbf Z)],,
    $$

    where the expectation is with respect to $k(mathbf z|hattheta_(m),mathbf x)$ and set $m=0$.


  2. Maximize $Q(theta|hattheta_(m), mathbf x)$ in
    $theta$ and take the M-step
    $$
    hattheta_(m+1)=argmax_theta ; Q(theta|hattheta_(m), mathbf x)
    $$

    and set $m=m+1$


until a fixed point is reached; i.e., $hattheta_(m+1)=hattheta_(m)$.



For the normal mixture, using the missing data structure exhibited in previously leads to an objective function
equal to
$$
Q(theta^prime|theta,mathbfx) = -frac12,sum_i=1^n
mathbbE_thetaleft[left. Z_i (x_i-mu_1)^2 + (1-Z_i) (x_i-mu_2)^2 right| mathbfx right].
$$

Solving the M-step then provides the closed-form expressions
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n Z_i x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n Z_i| mathbfx right]
$$

and
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n (1-Z_i) x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n (1-Z_i)| mathbfx right].
$$

Since
$$
mathbbE_thetaleft[Z_i|mathbfx right]=fracvarphi(x_i-mu_1) varphi(x_i-mu_1)+3varphi(x_i-mu_2),,
$$

the EM algorithm can easily be implemented in this setting.




Whatever the mixture involved, the latent variables $Z_i$ are Multinomial $mathcalM_M(1;pi_1,ldots,pi_M)$ which means only one component of the vector $Z_i$ is equal to one and all of the $M-1$ others are zero. (Note the difference with the question in the notations: the original notation $mathcalM(M;pi_1,ldots,pi_M)$ fails to indicate how many draws are taken, that is, what is the sum of the components of $Z_i$.).When $k=2$ as in the above excerpt, $Z_i$ is an integer in $0,1$. There may be a confusion between a Multinomial distribution and the property of a distribution (like some mixtures) to be multimodal. The $Z_i$ do not have a multimodal distribution, taking only two values, even conditional on the $X_i$'s, while the $X_i$'s may, at least unconditionally.






share|cite|improve this answer






















  • Yes, now I understand my problem. Thank you so much. I really learn a new thing.
    – Maryam
    19 mins ago














up vote
2
down vote



accepted










There is a lot of confusion in the question, confusion that could be reduced by looking at a textbook on the paper, or even the original 1977 paper by Dempster, Laird and Rubin.



Here is an excerpt of our book, Introducing Monte Carlo Methods with R, followed by my answer:




Assume that we observe $X_1, ldots, X_n$, jointly distributed from $g(mathbf x|theta)$ that satisfies
$$
g(mathbf x|theta)=int_cal Z f(mathbf x, mathbf z|theta), textdmathbf z,
$$

and that we want to compute $hattheta = argmax L(theta|mathbf x)= argmax g(mathbf x|theta)$.
Since the augmented data is $mathbf z$, where $(mathbf X, mathbf Z) sim f(mathbf x,mathbf z| theta)$
the conditional distribution of the missing data $mathbf Z$ given the observed data $mathbf x$ is
$$
k(mathbf z| theta, mathbf x) = f(mathbf x, mathbf z|theta)big/g(mathbf x|theta),.
$$

Taking the logarithm of this expression
leads to the following relationship between the complete-data likelihood $L^c(theta|mathbf x,
mathbf z)$
and the observed-data likelihood $L(theta|mathbf x)$. For any value $theta_0$,
$$
log L(theta|mathbf x)= mathbbE_theta_0[log L^c(theta|mathbf x,mathbf Z)]
-mathbbE_theta_0[log k(mathbf Z| theta, mathbf x)],qquad(1)
$$
where the expectation is with respect to $k(mathbf z| theta_0, mathbf x)$. In the EM algorithm,
while we aim at maximizing $log L(theta|mathbf x)$, only the first term on the right side of
(1) will be considered.



Denoting$$
Q(theta |theta_0, mathbf x) = mathbbE_theta_0
[log L^c(theta|mathbf x,mathbf Z)],
$$

the EM algorithm indeed proceeds iteratively by maximizing
$Q(theta |theta_0, mathbf x)$ at each iteration and, if $hattheta_(1)$
is the value of $theta$ maximizing $Q(theta |theta_0, mathbf x)$,
by replacing $theta_0$ by the updated value $hattheta_(1)$. In this manner, a sequence of estimators
$hattheta_(j)_j$ is obtained, where $hattheta_(j)$ is defined as the value of
$theta$ maximizing $Q(theta |hattheta_(j-1), mathbf x)$; that is,$$
Q(hattheta_(j) |hattheta_(j-1), mathbf x)
= max_theta,Q(theta |hattheta_(j-1),
mathbf x).$$
This iterative scheme thus contains both an expectation step
and a maximization step, giving the algorithm its name.



EM Algorithm
Pick a starting value $hattheta_(0)$



Repeat



  1. Compute the E-step
    $$
    Q(theta|hattheta_(m), mathbf x)
    =mathbbE_hattheta_(m) [log L^c(theta|mathbf x, mathbf Z)],,
    $$

    where the expectation is with respect to $k(mathbf z|hattheta_(m),mathbf x)$ and set $m=0$.


  2. Maximize $Q(theta|hattheta_(m), mathbf x)$ in
    $theta$ and take the M-step
    $$
    hattheta_(m+1)=argmax_theta ; Q(theta|hattheta_(m), mathbf x)
    $$

    and set $m=m+1$


until a fixed point is reached; i.e., $hattheta_(m+1)=hattheta_(m)$.



For the normal mixture, using the missing data structure exhibited in previously leads to an objective function
equal to
$$
Q(theta^prime|theta,mathbfx) = -frac12,sum_i=1^n
mathbbE_thetaleft[left. Z_i (x_i-mu_1)^2 + (1-Z_i) (x_i-mu_2)^2 right| mathbfx right].
$$

Solving the M-step then provides the closed-form expressions
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n Z_i x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n Z_i| mathbfx right]
$$

and
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n (1-Z_i) x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n (1-Z_i)| mathbfx right].
$$

Since
$$
mathbbE_thetaleft[Z_i|mathbfx right]=fracvarphi(x_i-mu_1) varphi(x_i-mu_1)+3varphi(x_i-mu_2),,
$$

the EM algorithm can easily be implemented in this setting.




Whatever the mixture involved, the latent variables $Z_i$ are Multinomial $mathcalM_M(1;pi_1,ldots,pi_M)$ which means only one component of the vector $Z_i$ is equal to one and all of the $M-1$ others are zero. (Note the difference with the question in the notations: the original notation $mathcalM(M;pi_1,ldots,pi_M)$ fails to indicate how many draws are taken, that is, what is the sum of the components of $Z_i$.).When $k=2$ as in the above excerpt, $Z_i$ is an integer in $0,1$. There may be a confusion between a Multinomial distribution and the property of a distribution (like some mixtures) to be multimodal. The $Z_i$ do not have a multimodal distribution, taking only two values, even conditional on the $X_i$'s, while the $X_i$'s may, at least unconditionally.






share|cite|improve this answer






















  • Yes, now I understand my problem. Thank you so much. I really learn a new thing.
    – Maryam
    19 mins ago












up vote
2
down vote



accepted







up vote
2
down vote



accepted






There is a lot of confusion in the question, confusion that could be reduced by looking at a textbook on the paper, or even the original 1977 paper by Dempster, Laird and Rubin.



Here is an excerpt of our book, Introducing Monte Carlo Methods with R, followed by my answer:




Assume that we observe $X_1, ldots, X_n$, jointly distributed from $g(mathbf x|theta)$ that satisfies
$$
g(mathbf x|theta)=int_cal Z f(mathbf x, mathbf z|theta), textdmathbf z,
$$

and that we want to compute $hattheta = argmax L(theta|mathbf x)= argmax g(mathbf x|theta)$.
Since the augmented data is $mathbf z$, where $(mathbf X, mathbf Z) sim f(mathbf x,mathbf z| theta)$
the conditional distribution of the missing data $mathbf Z$ given the observed data $mathbf x$ is
$$
k(mathbf z| theta, mathbf x) = f(mathbf x, mathbf z|theta)big/g(mathbf x|theta),.
$$

Taking the logarithm of this expression
leads to the following relationship between the complete-data likelihood $L^c(theta|mathbf x,
mathbf z)$
and the observed-data likelihood $L(theta|mathbf x)$. For any value $theta_0$,
$$
log L(theta|mathbf x)= mathbbE_theta_0[log L^c(theta|mathbf x,mathbf Z)]
-mathbbE_theta_0[log k(mathbf Z| theta, mathbf x)],qquad(1)
$$
where the expectation is with respect to $k(mathbf z| theta_0, mathbf x)$. In the EM algorithm,
while we aim at maximizing $log L(theta|mathbf x)$, only the first term on the right side of
(1) will be considered.



Denoting$$
Q(theta |theta_0, mathbf x) = mathbbE_theta_0
[log L^c(theta|mathbf x,mathbf Z)],
$$

the EM algorithm indeed proceeds iteratively by maximizing
$Q(theta |theta_0, mathbf x)$ at each iteration and, if $hattheta_(1)$
is the value of $theta$ maximizing $Q(theta |theta_0, mathbf x)$,
by replacing $theta_0$ by the updated value $hattheta_(1)$. In this manner, a sequence of estimators
$hattheta_(j)_j$ is obtained, where $hattheta_(j)$ is defined as the value of
$theta$ maximizing $Q(theta |hattheta_(j-1), mathbf x)$; that is,$$
Q(hattheta_(j) |hattheta_(j-1), mathbf x)
= max_theta,Q(theta |hattheta_(j-1),
mathbf x).$$
This iterative scheme thus contains both an expectation step
and a maximization step, giving the algorithm its name.



EM Algorithm
Pick a starting value $hattheta_(0)$



Repeat



  1. Compute the E-step
    $$
    Q(theta|hattheta_(m), mathbf x)
    =mathbbE_hattheta_(m) [log L^c(theta|mathbf x, mathbf Z)],,
    $$

    where the expectation is with respect to $k(mathbf z|hattheta_(m),mathbf x)$ and set $m=0$.


  2. Maximize $Q(theta|hattheta_(m), mathbf x)$ in
    $theta$ and take the M-step
    $$
    hattheta_(m+1)=argmax_theta ; Q(theta|hattheta_(m), mathbf x)
    $$

    and set $m=m+1$


until a fixed point is reached; i.e., $hattheta_(m+1)=hattheta_(m)$.



For the normal mixture, using the missing data structure exhibited in previously leads to an objective function
equal to
$$
Q(theta^prime|theta,mathbfx) = -frac12,sum_i=1^n
mathbbE_thetaleft[left. Z_i (x_i-mu_1)^2 + (1-Z_i) (x_i-mu_2)^2 right| mathbfx right].
$$

Solving the M-step then provides the closed-form expressions
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n Z_i x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n Z_i| mathbfx right]
$$

and
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n (1-Z_i) x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n (1-Z_i)| mathbfx right].
$$

Since
$$
mathbbE_thetaleft[Z_i|mathbfx right]=fracvarphi(x_i-mu_1) varphi(x_i-mu_1)+3varphi(x_i-mu_2),,
$$

the EM algorithm can easily be implemented in this setting.




Whatever the mixture involved, the latent variables $Z_i$ are Multinomial $mathcalM_M(1;pi_1,ldots,pi_M)$ which means only one component of the vector $Z_i$ is equal to one and all of the $M-1$ others are zero. (Note the difference with the question in the notations: the original notation $mathcalM(M;pi_1,ldots,pi_M)$ fails to indicate how many draws are taken, that is, what is the sum of the components of $Z_i$.).When $k=2$ as in the above excerpt, $Z_i$ is an integer in $0,1$. There may be a confusion between a Multinomial distribution and the property of a distribution (like some mixtures) to be multimodal. The $Z_i$ do not have a multimodal distribution, taking only two values, even conditional on the $X_i$'s, while the $X_i$'s may, at least unconditionally.






share|cite|improve this answer














There is a lot of confusion in the question, confusion that could be reduced by looking at a textbook on the paper, or even the original 1977 paper by Dempster, Laird and Rubin.



Here is an excerpt of our book, Introducing Monte Carlo Methods with R, followed by my answer:




Assume that we observe $X_1, ldots, X_n$, jointly distributed from $g(mathbf x|theta)$ that satisfies
$$
g(mathbf x|theta)=int_cal Z f(mathbf x, mathbf z|theta), textdmathbf z,
$$

and that we want to compute $hattheta = argmax L(theta|mathbf x)= argmax g(mathbf x|theta)$.
Since the augmented data is $mathbf z$, where $(mathbf X, mathbf Z) sim f(mathbf x,mathbf z| theta)$
the conditional distribution of the missing data $mathbf Z$ given the observed data $mathbf x$ is
$$
k(mathbf z| theta, mathbf x) = f(mathbf x, mathbf z|theta)big/g(mathbf x|theta),.
$$

Taking the logarithm of this expression
leads to the following relationship between the complete-data likelihood $L^c(theta|mathbf x,
mathbf z)$
and the observed-data likelihood $L(theta|mathbf x)$. For any value $theta_0$,
$$
log L(theta|mathbf x)= mathbbE_theta_0[log L^c(theta|mathbf x,mathbf Z)]
-mathbbE_theta_0[log k(mathbf Z| theta, mathbf x)],qquad(1)
$$
where the expectation is with respect to $k(mathbf z| theta_0, mathbf x)$. In the EM algorithm,
while we aim at maximizing $log L(theta|mathbf x)$, only the first term on the right side of
(1) will be considered.



Denoting$$
Q(theta |theta_0, mathbf x) = mathbbE_theta_0
[log L^c(theta|mathbf x,mathbf Z)],
$$

the EM algorithm indeed proceeds iteratively by maximizing
$Q(theta |theta_0, mathbf x)$ at each iteration and, if $hattheta_(1)$
is the value of $theta$ maximizing $Q(theta |theta_0, mathbf x)$,
by replacing $theta_0$ by the updated value $hattheta_(1)$. In this manner, a sequence of estimators
$hattheta_(j)_j$ is obtained, where $hattheta_(j)$ is defined as the value of
$theta$ maximizing $Q(theta |hattheta_(j-1), mathbf x)$; that is,$$
Q(hattheta_(j) |hattheta_(j-1), mathbf x)
= max_theta,Q(theta |hattheta_(j-1),
mathbf x).$$
This iterative scheme thus contains both an expectation step
and a maximization step, giving the algorithm its name.



EM Algorithm
Pick a starting value $hattheta_(0)$



Repeat



  1. Compute the E-step
    $$
    Q(theta|hattheta_(m), mathbf x)
    =mathbbE_hattheta_(m) [log L^c(theta|mathbf x, mathbf Z)],,
    $$

    where the expectation is with respect to $k(mathbf z|hattheta_(m),mathbf x)$ and set $m=0$.


  2. Maximize $Q(theta|hattheta_(m), mathbf x)$ in
    $theta$ and take the M-step
    $$
    hattheta_(m+1)=argmax_theta ; Q(theta|hattheta_(m), mathbf x)
    $$

    and set $m=m+1$


until a fixed point is reached; i.e., $hattheta_(m+1)=hattheta_(m)$.



For the normal mixture, using the missing data structure exhibited in previously leads to an objective function
equal to
$$
Q(theta^prime|theta,mathbfx) = -frac12,sum_i=1^n
mathbbE_thetaleft[left. Z_i (x_i-mu_1)^2 + (1-Z_i) (x_i-mu_2)^2 right| mathbfx right].
$$

Solving the M-step then provides the closed-form expressions
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n Z_i x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n Z_i| mathbfx right]
$$

and
$$
mu_1^prime = mathbbE_thetaleft[ sum_i=1^n (1-Z_i) x_i |mathbfx right]
bigg/ mathbbE_thetaleft[ sum_i=1^n (1-Z_i)| mathbfx right].
$$

Since
$$
mathbbE_thetaleft[Z_i|mathbfx right]=fracvarphi(x_i-mu_1) varphi(x_i-mu_1)+3varphi(x_i-mu_2),,
$$

the EM algorithm can easily be implemented in this setting.




Whatever the mixture involved, the latent variables $Z_i$ are Multinomial $mathcalM_M(1;pi_1,ldots,pi_M)$ which means only one component of the vector $Z_i$ is equal to one and all of the $M-1$ others are zero. (Note the difference with the question in the notations: the original notation $mathcalM(M;pi_1,ldots,pi_M)$ fails to indicate how many draws are taken, that is, what is the sum of the components of $Z_i$.).When $k=2$ as in the above excerpt, $Z_i$ is an integer in $0,1$. There may be a confusion between a Multinomial distribution and the property of a distribution (like some mixtures) to be multimodal. The $Z_i$ do not have a multimodal distribution, taking only two values, even conditional on the $X_i$'s, while the $X_i$'s may, at least unconditionally.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited 24 mins ago

























answered 41 mins ago









Xi'an

50.4k686335




50.4k686335











  • Yes, now I understand my problem. Thank you so much. I really learn a new thing.
    – Maryam
    19 mins ago
















  • Yes, now I understand my problem. Thank you so much. I really learn a new thing.
    – Maryam
    19 mins ago















Yes, now I understand my problem. Thank you so much. I really learn a new thing.
– Maryam
19 mins ago




Yes, now I understand my problem. Thank you so much. I really learn a new thing.
– Maryam
19 mins ago












up vote
0
down vote













If I correctly read between the lines, your question is about the difference between the distribution of $[z]$ (i.e., the prior distribution of the latent variable), and the distribution of $[z mid y]$ (i.e., the posterior distribution of the latent variable given the data $y$).



Indeed the prior is i.i.d. Bernoulli with probability $pi$. However, the posterior is not i.i.d. because each subject will his/her own probability of belonging in the first component (i.e., the component for which $z_i = 1$) depending on their data $y_i$. Hence, if you do the plot of the posterior probabilities, it can be multimodal.






share|cite|improve this answer




















  • Thank you so much for your help. I have updated my question to make it so clear.
    – Maryam
    32 mins ago






  • 1




    Sorry, I have a typo in my question. Could you please have a look. I meant multinomial, not multimodal.
    – Maryam
    23 mins ago














up vote
0
down vote













If I correctly read between the lines, your question is about the difference between the distribution of $[z]$ (i.e., the prior distribution of the latent variable), and the distribution of $[z mid y]$ (i.e., the posterior distribution of the latent variable given the data $y$).



Indeed the prior is i.i.d. Bernoulli with probability $pi$. However, the posterior is not i.i.d. because each subject will his/her own probability of belonging in the first component (i.e., the component for which $z_i = 1$) depending on their data $y_i$. Hence, if you do the plot of the posterior probabilities, it can be multimodal.






share|cite|improve this answer




















  • Thank you so much for your help. I have updated my question to make it so clear.
    – Maryam
    32 mins ago






  • 1




    Sorry, I have a typo in my question. Could you please have a look. I meant multinomial, not multimodal.
    – Maryam
    23 mins ago












up vote
0
down vote










up vote
0
down vote









If I correctly read between the lines, your question is about the difference between the distribution of $[z]$ (i.e., the prior distribution of the latent variable), and the distribution of $[z mid y]$ (i.e., the posterior distribution of the latent variable given the data $y$).



Indeed the prior is i.i.d. Bernoulli with probability $pi$. However, the posterior is not i.i.d. because each subject will his/her own probability of belonging in the first component (i.e., the component for which $z_i = 1$) depending on their data $y_i$. Hence, if you do the plot of the posterior probabilities, it can be multimodal.






share|cite|improve this answer












If I correctly read between the lines, your question is about the difference between the distribution of $[z]$ (i.e., the prior distribution of the latent variable), and the distribution of $[z mid y]$ (i.e., the posterior distribution of the latent variable given the data $y$).



Indeed the prior is i.i.d. Bernoulli with probability $pi$. However, the posterior is not i.i.d. because each subject will his/her own probability of belonging in the first component (i.e., the component for which $z_i = 1$) depending on their data $y_i$. Hence, if you do the plot of the posterior probabilities, it can be multimodal.







share|cite|improve this answer












share|cite|improve this answer



share|cite|improve this answer










answered 42 mins ago









Dimitris Rizopoulos

1,890110




1,890110











  • Thank you so much for your help. I have updated my question to make it so clear.
    – Maryam
    32 mins ago






  • 1




    Sorry, I have a typo in my question. Could you please have a look. I meant multinomial, not multimodal.
    – Maryam
    23 mins ago
















  • Thank you so much for your help. I have updated my question to make it so clear.
    – Maryam
    32 mins ago






  • 1




    Sorry, I have a typo in my question. Could you please have a look. I meant multinomial, not multimodal.
    – Maryam
    23 mins ago















Thank you so much for your help. I have updated my question to make it so clear.
– Maryam
32 mins ago




Thank you so much for your help. I have updated my question to make it so clear.
– Maryam
32 mins ago




1




1




Sorry, I have a typo in my question. Could you please have a look. I meant multinomial, not multimodal.
– Maryam
23 mins ago




Sorry, I have a typo in my question. Could you please have a look. I meant multinomial, not multimodal.
– Maryam
23 mins ago

















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f371080%2fquestion-about-the-latent-variable-in-em-algorithm%23new-answer', 'question_page');

);

Post as a guest













































































Comments

Popular posts from this blog

What does second last employer means? [closed]

Installing NextGIS Connect into QGIS 3?

One-line joke