Latent Semantic Analysis (SVD)

In the Text Explorer platform, latent semantic analysis is centered around computing a partial singular value decomposition (SVD) of the document term matrix (DTM). This decomposition reduces the text data into a manageable number of dimensions for analysis. Latent semantic analysis is equivalent to performing principal components analysis (PCA).

The partial singular value decomposition approximates the DTM using three matrices: U, S, and V′. The relationship between these matrices is defined as follows:

DTM ≈ U * S * V′

Define nDoc as the number of documents (rows) in the DTM, nTerm as the number of terms (columns) in the DTM, and nVec as the specified number of singular vectors. Note that nVec must be less than or equal to min(nDoc, nTerm). It follows that U is an nDoc by nVec matrix that contains the left singular vectors of the DTM. S is a diagonal matrix of dimension nVec. The diagonal entries in S are the singular values of the DTM. V′ is an nVec by nTerm matrix. The rows in V′ (or columns in V) are the right singular vectors.

The right singular vectors capture connections among different terms with similar meanings or topic areas. If three terms tend to appear in the same documents, the SVD is likely to produce a singular vector in V′ with large values for those three terms. The U singular vectors represent the documents projected into this new term space.

Latent semantic analysis also captures indirect connections. If two words never appear together in the same document, but they generally appear in documents with another third word, the SVD is able to capture some of that connection. If two documents have no words in common but contain words that are connected in the dimension-reduced space, they map to similar vectors in the SVD output.

The SVD transforms text data into a fixed-dimensional vector space, making it amenable to all types of clustering, classification, and regression techniques. The Save options enable you to export this vector space to be analyzed in other JMP platforms.

The DTM, by default, is centered, scaled, and divided by nDoc minus 1 before the singular value decomposition is carried out. This analysis is equivalent to a PCA of the correlation matrix of the DTM.

You can also specify Centered or Uncentered in the Specifications window.

• If you specify Centered, the DTM is centered and divided by nDoc minus 1 before the singular value decomposition. This analysis is equivalent to a PCA of the covariance matrix of the DTM.

• If you specify Uncentered, the DTM is divided by nDoc before the singular value decomposition. This analysis is equivalent to a PCA of the unscaled DTM.

The SVD implementation takes advantage of the sparsity of the DTM even when the DTM is centered.

Want more information? Have questions? Get answers in the JMP User Community (community.jmp.com).