Graph-regularized concept factorization for multi-view document clustering

We propose a novel multi-view document clustering method with the graph-regularized concept factorization (MVCF). MVCF makes full use of multi-view features for more comprehensive understanding of the data and learns weights for each view adaptively. It also preserves the local geometrical structure of the manifolds for multi-view clustering. We have derived an eﬃcient optimization algorithm to solve the objective function of MVCF and proven its convergence by utilizing the auxiliary function method. Experiments carried out on three benchmark datasets have demonstrated the eﬀectiveness of MVCF in comparison to several state-of-the-art approaches in terms of accuracy, normalized mutual information and purity.


Introduction
The matrix factorization-based approaches have become popular in document clustering [1,2]. Nonnegative matrix factorization (NMF) [3] and concept factorization (CF) [1] have produced impressive results. Generally, CF mainly strives to overcome the limitations of NMF while inheriting all its strengths. CF models each concept as a linear combination of the data points, and each data point as a linear combination of the concepts. It aims to interpret the product of the two sets of linear coefficients as an approximation of the original data points. The cluster label of each data point can be easily derived from the obtained linear coefficients. However, CF does not consider the local manifold geometry but the global Euclidean geometry only. Cai et al. proposed LCCF [4] which uses a graph regularization term to capture the local geometry of the document sub-manifold. As data are often sparse, Liu et al. enforces a locality constraint onto CF and proposed LCF [5] to achieve sparsity and locality simultaneously. Taking the advantage of semi-supervised learning, Liu et al. incorporated additional information into CF and proposed CCF [6], which ensures the data points sharing the same label be grouped together. However, CCF is not able to handle the data points with different labels, which should be grouped into different clusters to the maximum extent. To achieve this goal, He et al. proposed PCCF [7] to group data which are of must-link in the same cluster and of cannot-link in the different clusters.
Essentially, all CF methods mentioned above are developed to handle a single view (feature). However, in many real world applications, data are often collected from diverse domains or obtained from different feature extractors [8][9][10]. For example, a document may be translated into multiple languages, a web page may be presented by its contents or a hyperlink, and a user may use heterogenous social networks. In these examples, each view alone would be insufficient for clustering without complementary information from the other views. Some methods have been proposed to deal with this situation. Bickel et al. proposed an Co-EM based framework [9] for multi-view clustering in mixture models. It computes expected values of hidden variables in one view and uses the values in the M-step for other views, and vice versa. This process is repeated until a suitable stopping criteria is met. The algorithm however often fails to converge. Relying on the eigen-decomposition technique, the spectral clustering methods can guarantee a global optima thus achieves better clustering performances. Kumar et al. proposed a multiview spectral clustering method, CRSC [11], to ensure that corresponding data points in each view have the same cluster membership. Later, Xia et al.
proposed RMSC [12] that explicitly handles possible noises in the multi-view input data and recovers a shared transition probability matrix via low rank and sparse decomposition. However, the limitation of the spectral clustering methods is that the negative values appearing in eigen-factorization make the factorization hard to interpret and that the obtained eigen-vectors have no direct relationship with the semantic structure of dataset [13,14].
Recently, NMF-based multi-view clustering has received great attentions due to its better semantic interpretation [13,14]. Liu et al. proposed Multi-NMF [15] to obtain a common consensus matrix, which is designed to reflect the latent clustering structure shared by different views. However, Multi-NMF fails to preserve the locally geometrical structure of the data space. To tackle this problem, Zhang et al. proposed MMNMF [16]. However, the method needs to assign weights for each view individually, and it is often nontrivial to decide these weights. A graph-regularized NMF-based multi-view clustering method [17] was proposed, which extends the Liu et al.'s algorithm [15] with a graph regularization. In this study we take the advantage of CF and propose a novel method, called multi-view CF (MVCF). The overall approach and advantages of MVCF are as follows: 1. MVCF finds intrinsic coefficient matrices for each view, and then incorporates them with a multi-manifold regularizer to preserve the locally geometrical structure of the multi-view data space. 2. MVCF learns the weights of each view automatically. This saves the cost of setting weights individually and truly reflects the importance of each view. 3. A new updating rule is developed to efficiently and effectively solve the associated optimization problem. The proof of convergence of the rule is also provided.
The outline of the rest of the paper is as follows. In the section 2, we briefly review CF and local invariance, and then propose our MVCF. In the section 3, the optimization algorithm for solving the objective function of MVCF is proposed with proof of convergence. The experimental results on document datasets are discussed in the section 4. Finally we draw the conclusion and future work in the section 5.

Concept factorization
Given n data X = [x 1 , x 2 , · · · , x n ] ∈ R d×n , each data x i is represented by a d-dimensional feature vector. NMF aims to find a d × k matrix U and a n × k matrix H where the product of these two factors is an approximation to the original matrix, represented as X ≈ UH T . Each column vector of U, u c , can be regarded as a basis and each data point x i is approximated by a linear combination of these k bases, weighted by the components of data representation matrix H T = [h ic ]: x i ≈ k c=1 u c h ic . The speciality of NMF is that it enforces that all entries of the factor matrices must be non-negative. This brings two limitations. One is that the non-negative requirement is not applicable to applications where the data involves negative numbers. The other is that it is not clear how to effectively perform NMF in the transformed data space so that the powerful kernel method can be applied.
Concept Factorization (CF) is proposed to address the above problems while inheriting NMF's strengthes. CF models each base (cluster center) u c by a linear combination of the data points u c = n i=1 w ic x i where w ic ≥ 0. Let W = [w ic ] ∈ R n×k , CF tries to decompose the data matrix which satisfies the following condition The basic form of CF utilizes Frobenius norm to qualify the approximation, and CF tries to optimize the following problem, min W≥0,H≥0 To make the solution of (2) unique [1], we require that This requirement of normalizing W is given by, Accordingly, H is adjusted so that WH T does no change. This is achieved by,

Local Invariance
If two data points x i and x j are spatially close in the intrinsic geometry of the data distribution, the corresponding data representations h i and h j are also close to each other [18][19][20][21], which is called the local invariance. Recent studies on the spectral graph theory [22] and the manifold learning theory [23] have demonstrated that the local geometric structure can be effectively modeled through a nearest neighbor graph on a scattering of data points. Considering a graph with n vertices where each vertex corresponds to a document in the corpus, we can define the edge weight matrix S as follows: If two data points x i and x j are close to each other, (S) ij is set approximately to one. Then, the new low-dimensional representations H can be obtained by minimizing the following term [18], where L = D − S is the graph Laplacian [22], (D) ii = j (S) ij and Tr(·) denotes the trace operator.

Objective Function
Let X v ∈ R dv×n denote the features in v th view, W v ∈ R n×k and H v ∈ R n×k be the coefficient matrices in v th view, respectively. Given n v types of heterogeneous features, v = 1, 2, · · · , n v , we integrate all these views together with the combination of the problems (2) and (7), and have the following objective function: where α v is the weight of v th view which represents the importance of the view and needs to be set separately.
Apparently, it is hard to specify the weights α v for (8) without prior knowledge.
However, if we add a regularization term to (8) as, the weights of each view can be learned adaptively, which reflects the importance of the corresponding views. Besides, the term in (9) helps avoid the situation that one view's weight may be learned and assigned to be one while the weights of other views are all zero. Considering α v is the parameter that controls the two terms in (8) and not limited to a fixed range, i.e., nv v=1 α v = 1, We extend the range to a constant λ and have Combining (8) and (10), we propose the final objective function: where γ is a parameter of the last term.
In the following section, we describe a novel updating rule to obtain the local optima for solving the objective function in (11). The rule guarantees the objective function is non-increasing with each iteration.

Algorithm Derivation
We optimize (11) with the following two steps. The first step is to fix α v , and update W v and H v for each view independently. Then, (11) becomes, (12): where I ∈ R n×n is an identity matrix.
The partial derivatives of (14) with respect to W v and H v are Following the Karush-Kuhn-Tucker conditions [24] ψ ic w ic = 0 and φ ic h ic = 0, we obtain the following equations: For non-negative data matrices, the equations (17) and (18) lead to the updating rules: The second step is to fix W v , H v , and update α v . Then, (11) becomes: Denoting where α = [α 1 , α 2 , · · · , α nv ] T , and The Lagrangian function of (22) is where η and β are the Lagrangian multipliers.
According to the Karush-Kuhn-Tucker condition [24], it can be verified that the optimal solution α is Hence, α v for all n v views are learned through (24) and the value of α v is learned according to the corresponding f v in each view.
The entire algorithm for solving the problem (11) is summarized in the Algorithm 1.

8:
Normalize W v and H v by (4) and (5).  (12) is non-increasing under the updating rules in (19) and (20). The objective function is invariant under these updates if and only if W v and H v are at a stationary point.
Notably, the updating rules of W and H in each view v do not depend on another view in (12). So we use X, W, H and α to represent X v , W v , H v and α v for brevity in the rest of this section.
We use an auxiliary function as used in the expectation maximization algorithm [25,26] to prove the convergence of (11). The definition of the auxiliary function is given by the following Definition 1.
The auxiliary function is helpful because of the following Lemma 1: Lemma 1. If G is an auxiliary function of F , then F is non-increasing under the update w t+1 = arg min where t is the number of iteration times. Proof.
The equality F (w t+1 ) = F (w t ) holds only if w t is a local minimum of G(w, w t ). By iterating the updates in (25), the sequence of estimates will converge to a local minimum w min = arg min w F (w). Now that the minimum of the objective function O 1 in (12) is exactly our update rules with Lemma 1 and proper auxiliary functions, Theorem 2 can be proved.
First, we prove the convergence of the update rule in (19). Given an element w ic in W, we use F ic to denote the part that is only relevant to w ic in O 1 . Since the update is essentially element-wise, it is sufficient to show that each F ic is non-increasing under the update step of (20). Let F ic denote the first order derivative of O 1 with respective to W, we define the auxiliary function G for F ic as follows.
is an auxiliary function for F ic , which is a part of O 1 and relevant to w ic only.
Proof. Obviously, G(w, w) = F ic (w). According to the definition of auxiliary function, we only need to prove that G(w, w t ic ) ≥ F ic (w). To do so, we compare (27) with the Taylor series expansion of F ic (w): where F ic is the second order derivative with respect to W. It is not difficult to check, Subsituting eqreff1 and (30) into (28) and with (27) To prove the inequality above, we have Thus, (31) holds and G(w, w t ic ) ≥ F ic (w). According to (25), the optimum w t+1 can be obtained by calculating the first order derivative (27) with respect to w, i.e., Then we define an auxiliary function for the update rule in (20). Similarly, for any element h ic in H, let F ic denote the part of O 1 that is relevant to h ic only. The auxiliary function for the objective function with regard to variable h ic is defined as follows.

Lemma 3. The function
is an auxiliary function for F ic , which is a part of O 1 and relevant to h ic only.
Proof. It is equivalent to prove To prove the above inequality, we have and Since (35) is an auxiliary function, F ic is non-increasing under this update rule according to Lemma 3.
Until now, we have proved Theorem 2. Again, since the objective function O 2 in (22) is a convex optimization problem, the Theorem 1 is also proved.

Algorithm for Data with Negative Values
For data matrices which contains negative values, our multiplicative updating algorithm is based on the following Theorem 3 proposed by Sha et al. [27].
where A is an arbitrary n × n symmetric semi-positive matrix and b is an arbitrary n × 1 vector. The iterative solution is expressed in terms of the positive component A + and negative component A − of the matrix A in (38), It is easy to find that A = A + − A − . The solution y that minimizes (38) can be obtained by the following updating rule, Proof. The function is an auxiliary function for f (y i ).
We can obtain g(y i , y t i ) ≥ f (y i ) according to [27]. Then, the (42) is minimized by setting its derivative to zero with respect to y i , leading to the updating rule in (41).
It can be seen that O 1 is a quadratic form of W or H from (13), so (41) can be applied to solve the objective function O 1 and the corresponding A and b need to be identified. By fixing H, the part b for quadratic form of O 1 (W) can be obtained by (29) at W = 0, and the part A for quadratic form of O 1 (W) can be obtained by (30).
Substituting A and b into (41), we obtain the update rule of w ic , where Q + = K + WH T H, Q − = K − WH T H, K + and K − denotes the nonnegative matrices with elements, and It is easy to derive that K = K + − K − . Similarly, we can obtain the updating rule of h ic , where P + = αHW T K + W + DH and P − = αHW T K − W + SH.

Datasets
In this paper, we test our method, MVCF, on three benchmark multi-view datasets.
3-Sources 1 is constructed from three well-known online news sources: BBC, Reuters, and Guardian. In total there are 948 news articles covering 416 distinct news stories from the period February to April 2009. Of these stories, 169 were reported in all three sources. Each story was manually annotated with one of the six topical labels: business, entertainment, health, politics, sport and technology.
Cora 2 contains 2708 documents over seven labels (neural networks, rule learning, reinforcement learning, probabilistic methods, theory, genetic algorithms, and case based). In this paper, two views, content and cites, are used. The documents are described by 1433 words in the content view, and by 5429 links between them in the citations views.
Cornell [28] contains 195 documents over five labels (student, project, course, staff, faculty). In this paper, we use two views, content and cites. The documents are described by 1703 words in the content view, and by the 569 links between them in the citations views.

Baseline algorithms
We compared MVCF with the state-of-the-art methods, including 1. CTSC [29]: It is multi-view spectral clustering approach using the idea of co-training. Under the assumption that the true underlying clustering would assign a point to the same cluster irrespective of the view, CTSC learns the clustering result in one view and then use the results to label the data in other views so as to modify the graph structure (similarity matrix). 2. CRSC [11]: It applies the centroid based co-regularization scheme to the multi-view spectral clustering. To uncover the data structure shared by different views, CRSC enforces the view-specific eigenvectors to look similar by regularizing them towards a common consensus, and then optimizes individual clusterings as well as the consensus by utilizing a joint cost function. 3. MultiNMF [15]: It aims to search for a factorization that gives compatible clustering solutions across multiple views, requiring coefficient matrices learnt from factorizations of different views to be regularized towards a common consensus. 4. RMKMC [30]: It simultaneously performs the clustering using each view of features and unifies their results based on their importance to the clustering task. 2,1 -norm is employed to improve the robustness. 5. RMSC [12]: For each view, it constructs a corresponding transition probability matrix, which is then used for recovering a low-rank transition probability matrix. The standard Markov chain method is then utilized for processing before clustering is conducted. 6. GRNMF [17]: This is a graph-regularized NMF-based multi-view clustering method. The GRNMF extends the Liu et al.'s algorithm [15] with a graph regularization.

Evaluation metric
Three metrics, the clustering accuracy (ACC), the normalized mutual information (NMI) [31] and the purity [32] are used to evaluate the performances in this work. For each metric, a higher value indicates better clustering quality. These measurements are widely used by comparing the obtained label of each sample with that provided by the datasets in different clustering approaches.
ACC measures the percentage of correct labels obtained, which is defined as where n denotes the total number of documents, l i denotes the ground-truth label, r i denotes the obtained cluster label, δ denotes the delta function and map(r i ) is the optimal mapping function that permutes clustering labels to match the ground-truth labels. The best mapping can be found by using the Kuhn-Munkres algorithm [33].
NMI is used to measure the similarity between the cluster assignments and the pre-existing input labeling of the classes. Let n c be the number of objects in cluster m c (1 ≤ c ≤ k) obtained by using the clustering algorithms andñ s be the object number of cluster g s (1 ≤ s ≤ k) in the ground-truth labels. NMI is defined as NMI = where n c,s is the number of object which are in the intersection of cluster m c and g s . NMI varies from 0 for a totally wrongly clustering dataset to 1 for a perfectly clustering dataset.
Purity is given by where n is the number of data points belonging to k clusters. m i represents the i th obtained clusters, and g i implies the i th ground-truth clusters.

Experimental setup
The parameters of all comparing methods to be compared with are tuned to achieve the best results, according to the parameter settings in original papers where the approaches were first proposed. For MVCF, two parameters involved, γ and λ, are fixed as γ = 5 and λ = 4.5 for all datasets. We initialize W v and H v randomly within the range [0,1], and α v is initialized by λ nv . Then, we run the experiment for t times until the objective function converges to obtain the new data representation H v . The convergence criteria applied is where O t is the objective function value in the t-th iteration of each algorithm. Finally, we obtain the optimal data representations by adding the product of the data representation matrix H v and its weight in each view together. kmeans is applied to the optimal data representation for document clustering, which is repeated 10 times. The average result in terms of the cost function of k-means is noted. Finally, we compare the obtained clusters with the grouth truth to compute the ACC, NMI, and purity.

Comparisons of performance
The average performance for the three datasets are shown in Tables 1, 2 and 3. In each column of the tables, the best results are highlighted in boldface, and the second best are highlighted in italic.
For all three datasets, the performances of the proposed MVCF are better than other methods. Specifically, Table 1 shows that ACC by MVCF are 4.57%, 7.16% and 7.18% higher than the second best results on the three datasets, respectively. Table 2 presents MVCF produces the highest NMI, outperforming the second best results with a significant margin, especially on 3-Sources and Cornell. In addition, the corresponding purity is increased with 15.98%, 8.20%, and 4.61% with MVCF as shown in Table 3. Note that MVCF achieves the largest improvements on 3-Sources Dataset, which contains the most views. This is due to a fact that MVCF utilizes the multiple views efficiently according to the importance of each view and the results of NMFbased clustering approaches have better semantic interpretation [1,2,13,14].

Study of convergence
To demonstrate the convergence of MVCF, Fig. 1 illustrates the convergence speed on all three datasets. In each sub-figure, the x-axis and the y-axis denote the iteration number and the corresponding objective function value, respectively. We can see that the value of the objective function decreases sharply within five iterations and then becomes steadily afterwards. This indicates that MVCF converges efficiently sufficient.

Conclusion and future work
In this paper, we proposed a multi-view document clustering method using the MVCF. The method fully exploits multi-view feature information and reduces the data dimension to achieve better clustering performance. With MVCF, high dimensional data points from different views are reduced to low dimensional data representations with more locally consistent structure. Both the new representation matrices and view weights are learned by MVCF. The clustering labels are obtained by running k-means on the lowdimensional data. We theoretically proved the convergence of our algorithm, and this is in accordance with our experiments. The experimental results showed that MVCF achieves higher performance than the state-of-the-art methods in terms of clustering accuracy, normalized mutual information and clustering purity. In the future, we will bring the sparse regulation into MVCF to obtain a more accurate data representation matrix, with which better clustering performance can be expected.