Stochastic Perturbations on Low-Rank Hyperspectral Data for Image Classification

Hyperspectral images (HSI) of an object or a scene are typically acquired by electro-optical remote sensors. This is achievable because the materials in various objects in a scene inherently reflect, absorb and emit electromagnetic radiation. Modern hyperspectral sensors can capture a portion of the electromagnetic spectrum from the visible region (0.4 – 0.7 um) through the near-infrared region (approximately 2.4 um) in hundreds of narrow contiguous bands of about 10 nm wide calibrated to within 1 nm [1,2]. The radiation detected by these sensors over a sufficiently broad spectral band creates a distinct spectral signature that can be used to characterize and identify a certain kind of material. An example is shown in Figure 1. The image on the left is the image of an urban area in the Washington DC mall. It consists of three bands out of 210 bands collected by the sensor. The image on the right is the spectral plots of three data pixels as a function wavelength by spectral band number, each of which shows the corresponding material (vegetation, roof and water).


I. Introduction
Hyperspectral images (HSI) of an object or a scene are typically acquired by electro-optical remote sensors. This is achievable because the materials in various objects in a scene inherently reflect, absorb and emit electromagnetic radiation. Modern hyperspectral sensors can capture a portion of the electromagnetic spectrum from the visible region (0.4 -0.7 um) through the near-infrared region (approximately 2.4 um) in hundreds of narrow contiguous bands of about 10 nm wide calibrated to within 1 nm [1,2]. The radiation detected by these sensors over a sufficiently broad spectral band creates a distinct spectral signature that can be used to characterize and identify a certain kind of material. An example is shown in Figure 1. The image on the left is the image of an urban area in the Washington DC mall. It consists of three bands out of 210 bands collected by the sensor. The image on the right is the spectral plots of three data pixels as a function wavelength by spectral band number, each of which shows the corresponding material (vegetation, roof and water). Hyperspectral imagery (HSI) contains hundreds of narrow contiguous bands of spectral signals. These signals, which form spectral signatures, provide a wealth of information that can be used to characterize material substances. In recent years machine learning has been used extensively to classify HSI data. While many excellent HSI classifiers have been proposed and deployed, the focus has been more on the design of the algorithms. This paper presents a novel data preprocessing method (LRSP) to improve classification accuracy by applying stochastic perturbations to the low-rank constituent of the dataset. The proposed architecture is composed of a low-rank and The spatial and spectral information of an HSI dataset is typically organized as a data cube, similar to what is depicted in Figure 2. The face and the depth display the spatial coordinates and the spectral bands, respectively. Each band belongs to a narrowband image of the surface covered by the field of view of the sensor [3]. Therefore, every image pixel is a vector pixel where each element is a member of a spectral band. The vector provides a spectral signature characterizing the materials within the pixel. Applications of hyperspectral imagery span across many disciplines including but not limited to resource management, agriculture, mineral exploration and environmental monitoring [4][5][6][7]. In recent years machine learning has been used to classify and exploit HSI datasets. Many excellent HSI classifiers have been proposed and deployed, such as Class Feature Weighted Hyperspectral Image Classification (CFW-HSIC), Contextual Information Subspace Based Hyperspectral Image Classification (CIS-HSI), Semi-supervised HSI Classification using Self Training and Data Editing (RDE_self-training), Multiple Features and Nearest Regularized Subspace (MFNRS), Discriminative Marginalized Least Squares Regression (DMLSR) and Iterative Support Vector Machine (ISVM). CFW-HSIC [8] deals with class variability by computing the inter-class feature probabilities as class weights to generalize commonly used overall accuracy from the extracted features from classes of interest. CIS-HSI [9] uses decision fusion at a super-pixel level to generate the classification result. A super-pixel is formed by exploiting the contextual information among the spatial features in the data. RDE_self-training [10] is a semi-supervised learning algorithm with data editing. It revises the labels of mislabeled samples based on a nearest neighbor voting rule. The training set is then enlarged to include some unlabeled samples selected from a pool of candidates drawn from some ordered probability of the revised samples. MFNRS [11] proposes a residual fusion strategy with multiple features that include local binary patterns, Gabor features and the original spectral signatures to represent the test pixel from different perspectives to enhance the classifier's discriminative ability. DMLSR [12] enhances the discriminative power by taking the class separability and data reconstruction ability simultaneously. It employs Fisher criterion and imposes data reconstruction constraints to avoid overfitting and to improve classification performance. ISVM [13] is an iterative version of SVM applied to HSI classification. It extracts spatial information iteratively via feedback loops. A Gaussian filter is used to obtain the spatial information of the classification map to combine with the currently processed hyperspectral cube for the next iteration.
These approaches tend to focus on the design of the classifiers to improve the performance. While performance improvement is also our motivation, we are taking a different approach by appropriately applying preprocessed data rather than the original data to the algorithm. In this paper we propose a novel data preprocessing method to improve classification accuracy by perturbing the statistical nature of the HSI data. Perturbations are achieved by adding stochasticity to the low-rank component of the data, hence the name Low-Rank Stochastic Perturbation (LRSP). The new dataset will then be presented to an HSI classifier. We will show that off-the-shelf state-of-the art HSI classifiers can produce higher classification results by using LRSP-generated datasets rather than the original HSI datasets.

II. Proposed Method
Due to the nature of the dataset, many of the problems in HSI image classification can be attributed to the gradients that are trapped in local optima. The idea behind the LRSP method is to slightly perturb the convexity of the data by adding some stochasticity. It should be sufficiently meaningful that it can push the gradients out of a local minimum as conceptually depicted in Figure 3. At the same time, it should be relatively small so it does not change the underlying structure of the dataset. LRSP consists of three parts: LRS decomposition, degradation function and CLS filtering. Figure  4 shows the design flow of the proposed method. The original HSI dataset is first decomposed into its low-rank and sparse components using the low-rank and sparse (LRS) decomposition method. The low rank data matrix is the support matrix that represents the union of multiple subspaces where each subspace is occupied by most of the spatial-spectral energy of the associated endmember. The sparse data matrix contains the fine approximation providing detailed spatial-spectral resolution to the endmember data. An image degradation function based on an atmospheric turbulence model is applied to the low rank matrix. This function adds controlled randomness to the HSI data. The constrained least squares (CLS) filter is used to recover the signal information. This filter performs a deconvolution procedure to remove any corruption caused by the atmospheric turbulence while simultaneously preserving the stochasticity that has been imputed to the data. The last step is to add the sparse data back to produce the final perturbed HSI dataset. This dataset has the appropriate amount of perturbations for classification accuracy improvement. We will now briefly describe the mathematical formulations for each of the steps.

A. Low-Rank and Sparse Decomposition
Since most signal information in high-dimensional data generally has low intrinsic dimensionality, it is desirable to model real-world data as a mixture of several low-rank subspaces [14]. Principal Component Analysis (PCA) can be cast as a constraint optimization problem where a data matrix is assumed to be composed of a low-rank matrix and a small perturbation (error) matrix whose entries are i.i.d. Gaussian random variables. However, the result of PCA may be quite inaccurate if the entries in the perturbation matrix are arbitrarily large. To address this problem, an approach known as Robust Principal Component Analysis (RPCA) was developed [11]. The Lagrangian formulation is given by: where is a regularization parameter, ∈ ℝ × is the original high-dimensional data, ∈ ℝ × is the low-rank matrix, ∈ ℝ × is the sparse error matrix, and ‖ ‖ 0 is the 0 -norm of matrix .
Although this optimization problem is highly non-convex, it has been shown that the nuclear norm, which is the sum of the singular values, is an effective surrogate for the rank of a matrix and the 1norm for the 0 -norm. Moreover, data drawn from multiple subspaces should be represented as a union of the subspaces ⋃ = { : ∈ , 1 ≤ ≤ } =1 . In other words, is a set of -dimensional vectors strictly drawn from a union of subspaces [12,13]. The objective of subspace recovery for data matrix = + is to recover the row space of . It is still a convex optimization problem given by: where is a dictionary that linearly spans the data space, ‖ ‖ * is the nuclear norm of matrix and ‖ ‖ 2,1 is the 2,1 -norm of matrix . The 2,1 -norm of a data matrix is the sum of the Euclidean norm of each column vector. The minimizer * is the lowest rank representation with respect to dictionary . Since ( * ) ≤ ( ), * also represents a low-rank recovery of the original data, which is equivalent to matrix in (1).
Since off-the-shelf solvers are not sufficiently powerful to solve such a problem, a number of excellent algorithms have been developed over the years. The one used for the experiments in this paper is the Inexact Augmented Lagrange Multiplier [18]. It has been demonstrated that this algorithm is quite stable and highly accurate with fast convergence time.

B. Degradation Function
An image can fluctuate in an unpredictable manner both in intensity and in position as it degrades through a turbulence in the atmosphere. The exact level of degradation varies randomly with time and field angle [19]. Degradation modeling has been used for many years as a solution that allows successful image restoration. The model frames environmental conditions and physical characteristics of atmospheric turbulence as a transfer function ℎ( , ) where ( , ) is the 2D spatial coordinates [20].
Assuming the degradation function is linear and position invariant, the degraded image in the spatial domain is given by: where is ( , ) the original image and is ( , ) Gaussian noise. Convolution is denoted by * .
Applying the spatial Fourier transform, (3) can be expressed in the frequency domain as: The form of the transfer function ( , ) is similar to that of a Gaussian low pass filter and is given by: where is a hyper-parameter for the blurring effect associated with the nature of the turbulence.
To be compatible with the degradation function, the HSI dataset must be reshaped from a 3D data matrix to a 2D data matrix. The original dataset is composed of ( × ) pixel vectors of dimension , which is the number of spectral bands. The new data matrix has vectors of dimension where = × .

C. Constrained Least Squares Filter
The CLS algorithm does not require the power spectra of an undegraded image be known in advance. Using it as a filter will restore the image and minimize any sensitivity caused by the degradation function. This algorithm can be expressed as a constraint optimization problem [15].
The constraint in (6) is essentially the algebraic manipulation of (4) in vector-matrix form where ̂ is the estimate of the undegraded image instead of the original image. The Laplacian of the image, ∇ 2 , is added to alleviate a noise sensitivity problem that inherently exists in the transfer function . ∇ 2 for two variables is defined as: ∇ 2 ( , ) = ( + 1, ) + ( − 1, ) + ( , + 1) + ( , − 1) − 4 ( , ) The frequency domain solution to (6) is given by: where is a parameter that must be adjusted to satisfy the constraint in (6). ( , ) is the Fourier transform of (7). In matrix form, it can be expressed as: The best estimate of the undegraded image can be achieved by simply taking the inverse Fourier transform of (8).

III. Experiments
The datasets for our experiments consist of a synthetic dataset and two real-world HSI datasets (Salinas-A and Indian Pines) [21]. We employed two commonly used multi-class classifiers for HSI data: Subspace Projection Multinomial Logistic Regression (MLRsub) and Linear Discriminant Analysis (LDA). MLRsub [22] is a subspace projection-based classifier that represents hyperspectral data as a linear mixture model. The spectral response of a pixel is a linear combination of the spectral signatures of the endmembers weighted by the corresponding abundance fractions. This approach works well for cases with prevalent nonlinear class separability where projection onto lower subspaces is required to discover the manifold. LDA [23] is a multi-variate generative classifier. It attempts to maximize the posterior probabilities governed by the Bayes rule from the data statistical properties learned during training. This method is best used when linear decision boundaries can easily be applied for class separation in the input space.
All the experiments were conducted using MATLAB simulations running on the ThinkPad X1 Yoga Gen 4 with 1.6 GHz i7 Processor and 16 GB of RAM.

A. Datasets
The synthetic dataset has six endmembers. Six pixel vectors are taken from the Salinas-A scene at randomly chosen locations to form an 80 × 40 × 204 HSI cube. Small Gaussian random noise was added. The spectral signatures are shown in Figure 5.

B. Results
The effect of adding randomness in the dataset is quantified in terms of the dissimilarity measure. It is computed based on the Frobenius norm of the difference between the original and the perturbed datasets.
where and ̂ are the original and the perturbed datasets, respectively. The denominator is included as a normalizing factor.
As depicted in Table 1, the blurring effect k has a much greater influence on the dissimilarity measure than the variances of the zero-mean Gaussian noise. This is expected since (4) and (5) show that changing the value of k causes non-linear and exponential perturbations on the data while altering the statistics of the Gaussian noise only produces an additive effect. However, the influence of the noise components should not be underestimated either. Larger variance means more outliers in the data. Although it may not significantly change the statistical characteristics of the dataset, it may cause larger classification errors.  Table 2 shows the overall classification accuracy results if the Gaussian noise variance is fixed at 0.1. It is generally accepted that the different classifiers may produce different results. Nevertheless, if the same classifier is used, consistent results are observed. In other words, if the right amount of stochasticity is added to the original dataset, the LRSP method will always produce higher classification accuracy for the same classifier. The best results are indicated in bold associated with k=1e-5. The highest improvement corresponds to the Indian Pines dataset using the MLRsub classifier from 0.5910 to 0.8850 (49.8%). There is hardly any improvement for Salinas-A with LDA (0.4%) because the accuracy is already very high with the original dataset. Further observations show that Salinas-A and Indian Pines datasets contain classes that are linearly separable whereas the synthetic dataset does not. This is the reason why LDA performs better on the former and MLRsub on the latter.  Figure 10a) and dramatic degradation (Figures 9a and 10b). But in all cases for all the datasets, the right value of k (k = 1e-5) combined with small variances ( 2 < 1) always produces better classification accuracy for the perturbed datasets compared to the original datasets. These values define the operating limit of LRSP. Classification maps for the two real-world HSI datasets are shown in Figures 11 and 12. It can be easily concluded by visual inspection that the maps in Figures 11b, 11d, 12b and 12d are cleaner than those in Figures 11a, 11c, 12a and 12c, respectively, since they are associated with higher overall classification accuracies. As listed in tables 3, 4 and 5, the per-class classification accuracies for the perturbed datasets are higher than those of the original datasets. This is again consistent with the overall accuracy results.

IV. Conclusion
In this paper we proposed a novel low-rank stochastic perturbation method to improve the classification accuracy for hyperspectral images. Applying an atmospheric turbulence degradation function followed by constraint least squares filtering to the low-rank component of the hyperspectral data causes the statistical nature of the data to be perturbed in a way that is conducive to producing a higher classification accuracy. The experimental results confirm the effectiveness of this method for synthetic and real-world datasets. Popular state-of-the-art HSI classifiers produce better results if the LRSP method is used to preprocess the data. Choosing the appropriate the hyperparameter value for the blurring effect and keeping the variances small can yield quite an improvement of up to 49.8%.