Entropyweighted featurefusion method for headpose estimation
 XiaoMeng Wang^{1}Email author,
 Kang Liu^{1} and
 Xu Qian^{1}
DOI: 10.1186/s1364001601523
© The Author(s). 2016
Received: 2 March 2016
Accepted: 2 December 2016
Published: 9 December 2016
Abstract
This paper proposes a novel entropyweighted Gaborphase congruency (EWGP) feature descriptor for headpose estimation on the basis of feature fusion. Gabor features are robust and invariant to differences in orientation and illuminance but are not sufficient to express the amplitude character in images. By contrast, phase congruency (PC) functions work well in amplitude expression. Both illuminance and amplitude vary over distinctive regions. Here, we employ entropy information to evaluate orientation and amplitude to execute feature fusion. More specifically, entropy is used to represent the randomness and content of information. For the first time, we seek to utilize entropy as weight information to fuse the Gabor and phase matrices in every region. The proposed EWGP feature matrix was verified on Pointing’04 and FacePix. The experimental results demonstrate that our method is superior to the state of the art in terms of MSE, MAE, and time cost.
Keywords
EWGP Headpose estimation Entropy weighted Gabor Phase congruency Feature fusion1 Review
1.1 Introduction
Visual focus of attention (VFoA) is emphasized to estimate at what or whom a person is looking and is highly correlated with headpose estimation [1]. To study headpose estimation, threedimensional orientation parameters from human head images are explored. Head poses convey an abundance of information in natural interpersonal communication (NIC) and humancomputer interaction (HCI) [2]; therefore, an increasing number of researchers is seeking more effective and robust methodologies to estimate head pose. Head poses also play a critical role in artificial intelligence (AI) applications and reveal considerable latent significance of personal intent. For example, people nod their heads to represent understanding in conversations and shake their heads to show dissent, confusion, or consideration. Head orientation with a specific fingerpointing direction generally indicates the place that a person wants to go. The combination of head pose and hand gestures is used to assess the target of an individual’s interest [3]. Mutual orientation indicates that people are involved in discussion. If a person shifts the head toward a specific direction, it is highly likely that there is an object of interest in this direction. Therefore, the study of VFoA as an indicator of conversation target in humancomputer interaction and facialexpression recognition is increasingly of interest.
Analyzing head poses is a natural capability of humans but is difficult for AI. However, headpose estimation has been researched for years, and the state of the art in headpose estimation can contribute greatly to bridging the gap between humans and AI [4, 5]. Headpose estimation is generally interpreted as the capability to infer orientation relative to the observation camera. For example, head pose is exploited to determine the focus point on the screen based on the gaze direction [6]. The factors influencing the estimation of head pose and their relationships have been introduced in detail, and the crucial significance of head pose was emphasized in [7]. These factors are mostly related to the surroundings, including camera calibration, head features, glasses, hair, beard, illuminance variations, and image transformations.
To address the shortcomings of existing methods, we concentrate on regional feature extraction based on entropy information. We aim to utilize an information entropy model to assess randomness and content as feature metrics for a specific region for the first time. We then employ the more adaptive feature to represent the virtual region. In addition, the normalized entropy information is regarded as a weight metric to fuse the ultimate feature matrix. The experimental results demonstrate that our feature matrix is superior to the stateoftheart.
This paper is structured as follows: Section 1.2 provides an exhaustive overview of previous related work in headpose estimation. Section 1.3 presents the proposed methodology step by step, including a skin model for face detection using Gabor features, PC features, and entropyweighted Gabor phase congruency (EWGP). Section 1.4 describes the experiments using the Pointing’04 dataset. Finally, in Section 2, we present our conclusions and discuss future work.
1.2 Related work
In general, headpose estimation approaches can be classified into two types: coarse level and fine level [5]. The former commonly employ algorithms to calculate a few discrete head orientations, such as left, right, and looking up. The latter generally utilize methodologies to compute the continuous pose in accurate angles. Here, we redefine the coarse level and fine level: coarselevel approaches recognize the headorientation variations using discrete estimation and accurate computation, and finelevel approaches indicate the intentions or interests of the experimental subjects. The computational approaches of both layers can be divided into statistical and nonstatistical types based on their dependence on statistical methods or not.
1.2.1 Statistical approaches
The most classical statistical method is to exploit classifiers or regression methods to recognize specific discrete head poses. Multiclassification tools such as a support vector machine (SVM) are utilized to estimate discrete head poses. SVM has been employed to locate the iris centers in approximately detected eye regions [10] and to distinguish frontal and lookup headpose variations in a Carnegie Mellon University (CMU) face dataset [11]. Support vector regression (SVR) is an alternate version that is used for the continuous problem. The differences in headpose estimation between SVM and SVR have been described in detail [12]. SVR performs well for either horizontal or vertical headpose variations, whereas SVM performs better for vertical variations than for horizontal. If the search range is not extensive, the combination of SVM and SVR is a good option. In addition, whenever the number of classes changes, the SVMs must be retrained from scratch.
Regression is another typical statistical method that is available for both discrete and continuous headorientation angle estimation. Examples of regression approaches include the aforementioned SVR and multilayer perceptrons (MLP). Regression approaches are classified as linear and nonlinear based on the causal relationships between independent variables and dependent variables. An MLP can also be trained for fine headpose estimation over a continuous pose range. In this configuration, the network has one output for each DOF. The activation of the output is proportional to its corresponding orientation [13–15]. The high dimension of an image presents a challenge for some regression tools. More specially, regression methods cannot resolve the need for long, sophisticated training and are highly sharply sensitive to head localization. In summary, dimension reduction via principle component analysis (PCA) [16] or its nonlinear kernel version (KPCA) [17] or localized gradientorientation histograms [18] is necessary during the above procedure.
Instead of comparing images to a large set of discrete class labels or a series regression values, the probe image can be measured by a detector array that is also trained on many images with supervised learning methods. Detector array methods are well suited for both high and lowresolution images. In addition, they are superior in subregional operations. Most importantly, these methods do not require separate head detection and localization. The drawbacks of these schemes are the necessary scale of training, binary output of detectors, and low accuracy; in practice, a maximum of 12 different detectors can be formed, which limits the poseestimation definition to less than 12 states [5].
Highdimensional image samples can lie on a lowdimensional manifold that is constrained to meet the pose variations. Manifoldembedding methodologies, including isometric feature mapping (Isomap) [19, 20], locally linear embedding (LLE) [21], and Laplacian eigenmaps (LE) [22], have shown promise for headpose estimation by mapping highdimensional data into lowdimensional space. Such lowdimensional spaces can be formed by classification or regression. However, the limitation of typical PCA is not averted for nonlinear headpose variations. Since unsupervised methods are utilized during the classification or regression, these methods are not available to incorporate the class labels during headpose training. Most importantly, the aforementioned techniques cannot ensure that each class is expressed as a single label.
1.2.2 Nonstatistical approaches
Experimental results have revealed considerable differences between statistical methods and nonstatistical measurements [23–26]. The former mainly focus on appearancebased measurements, whereas the latter usually consider geometric relationship cues, such as the deviation of the nose from the midline and the deviation between the new head pose and the original state. In nonstatistical methods, flexible models, geometric information, and motion trajectory are employed to estimate head pose.
Flexible models seek to fit nonrigid models with facial features and contribute to the exploration of the facial structure in both discrete and continuous head orientations. Among flexible models, active shape models (ASM) [27, 28] and active appearance models (AAM) exhibit higher accuracy and robustness [29]. These approaches permit the direct prediction of head pose when an inherent 3D model constrains the fitting of 2D points. Combination of the 3D model and 2D points enables direct headpose computation using structurefrommotion algorithms. In summary, flexible models have great potential for both high accuracy and good robustness in headpose estimation, but these qualities are strictly correlated with the relative extracted feature positions and image definition. Additionally, geometric methods exploit relative feature positions to estimate head pose; however, the accuracy is highly related to the featurepoint extraction [30]. Importantly, the highest accuracies of the presented approaches are at least 1–2 pixels. Unfortunately, each pixel error generally relates to an angle error of approximately 5°. Consequently, geometric measurements cannot serve as precise headpose estimates in cases of limited featurepoint detection. The use of motiontrajectory tracking methods between subsequent video frames outperforms the other aforementioned methods [27, 31, 32]. In previous work, we employed a SIFT featurepoint and bio compoundeye mechanism to explore objecttracking measurements with superior robustness and accuracy [33]. Tracking methods operate in a bottomup manner, following lowlevel facial landmarks from frame to frame. Typically, the subject must maintain a frontal pose before the system started. The track system must be reinitialized whenever the object of interest is lost. As a result, geometrical approaches often rely on manual initialization or a camera view in which the subject’s neutral head pose is forwardlooking and easily reinitialized with a frontal face detector [5]. Recently, a number of hybrid approaches have been proposed [33–38] that integrate the remarkable advances in the above statistical and nonstatistical methods to provide the best accuracy and robustness in headpose estimation.
Our proposed method is a hybrid approach, and we seek to estimate head poses on the coarse level to compute the orientation angles using some machine classifiers and geometrical information. Information entropy is a good indicator of information representation with respect to randomness and content. Histogram of gradient (HoG), Gabor, and phase congruency (PC) are effectively and commonly used in direction estimation. However, the dimensionality of these feature matrices is usually too high for image representation. With the development of technology, imagedefinition has increased abruptly. More specifically, dimensiondisaster frequency has clearly risen. This paper presents an entropyweighted method to fuse Gabor and PC features and exploits entropy as a weight metric to reinforce randomness for the first time. Additionally, entropy plays an important role in dimension reduction and image annotation. The experimental results prove that our solution is effective in reducing the dimension and shows good accuracy and robustness to variations of head pose.
1.3 Proposed methodology
1.3.1 Skincolor model
1.3.2 Gabor features
1.3.3 Phase congruency
A _{ i } indicates the amplitude of the ith Fourier component, φ _{ i }(x) represents the ith local phase of the components, and φ(x) is the weighted mean of all local phase angles at the objective location. Additionally, for each frequency w _{ i }, A _{ i } is the amplitude of the cosine wave, and φ _{ i }(x)−φ(x) is the phase offset of that wave. The term T is related to the size of the image window, and we will assume it a value of 1. It is important to assume that phasecongruency features differ from one another when dealing with different headorientation probe images. Consequently, it is necessary to distinguish which filter orientation is more effective in pose estimation. In our case, the Pointing’04 headpose dataset was utilized to evaluate the phasecongruency features after face detection by the eclipse skin model. To this end, binaryedge images were collected.
1.3.4 EWGP feature fusion
A larger value of D represents a closer relationship with the probe representation. We employ Jeffrey’s entropy as the weight to construct the new feature matrix. Meanwhile, dimensionreduction operations are utilized to optimize the Gabor feature matrix and PC feature matrix, such as PCA and SVD. In summary, in our case, the advantages of Gabor features and PC features are combined for the first time to estimate head pose in Eq. (9), where W is the normalized entropy weight for Gabor and PC in the i×jth subregion.
1.4 Experiments and analysis of results
Estimation results on Pointing’04
Features  MSE (SVR)  CVA (SVM) % 

HOG  3.13  85.88 
Gabor  3.32  25.92 
PC  3.24  86.88 
EWGP  0.93  96.79 
Comparison of headpose estimation results on the Point’04 database
Method  MAE  

Yaw  Pitch  
EWGPSVM  1.03°  1.00° 
EWGPSVR  1.12°  1.31° 
MLDwJ  4.24°  2.69° 
Kernel PLS  5.02°  3.54° 
Kernel SVM  6.83°  5.91° 
Kernel SVR  6.89°  6.59° 
2 Conclusions
In this study, a novel entropyweighted Gabor and phasecongruency (EWGP) feature matrix was built on the condition of feature fusion. We successfully applied EWGP in multiclassification for headpose estimation in still imagery and a realtime video stream with homogeneous and heterogeneous data. Our experimental results demonstrated that the proposed EWGP method outperforms stateoftheart when estimating head pose in terms of MSE, CVA, MAE, and time cost. Unfortunately, head pose only describes the direction in which a person is looking and does not provide information on the object of interest. Therefore, it is necessary to focus on additional information, such as visual saliency in head orientation, gaze direction, and hand gestures. In future works, we plan to expand headpose estimation to include gaze estimation and obtain a better understanding of the object of an individual’s interest.
Abbreviations
 3D:

Three dimensional
 AI:

Artificial intelligence
 DOF:

Degree of freedom
 EWGP:

Entropyweighted Gabor and phase congruency
 HCI:

Humancomputer interaction
 Isomap:

Isometric feature mapping
 KPCA:

Kernel principle component analysis
 LE:

Laplacian eigenmaps
 MLP:

Multilayer perceptrons
 NIC:

Natural interpersonal communication
 PC:

Phase congruency
 PCA:

Principle component analysis
 SVM:

Support vector machine
 SVR:

Support vector regression
 VFoA:

Visual focus of attention
Declarations
Acknowledgements
Thanks to the partners for providing their test results. Gratitude is also expressed to my mentor Xu Qian for his valuable and helpful suggestions.
Authors’ contributions
XMW proposed the main idea to fuse Gabor and phasecongruency features using information entropy and participated in carrying out the experiments using Pointing’04 datasets to verify the proposed method. KL performed some experiments to verify the relationship between the skincolor model and face extraction and participated in extracting face regions from the Pointing’04 and FacePix datasets. XQ helped with the statistical analysis and drafting of the manuscript. All authors have read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 SO Ba, JM Odobez, Multiperson visual focus of attention from head pose and meeting contextual cues. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1), 101–116 (2011)View ArticleGoogle Scholar
 JJ Magee, M Betke, EyeKeys: A realtime vision interface based on gaze detection from a lowgrade video camera. IEEE Comput. Soc. Conf. Comput. Vis. Pattern. Recognit. Workshops. 2004, 1–8 (2004).
 M Pateraki, H Baltzakis, P Trahanias, Visual estimation of pointed targets for robot guidance via fusion of face pose and hand orientation. Computer Vision and Image Understanding. 120, 1–13 (2014)View ArticleGoogle Scholar
 H Huttunen et al., Computer vision for head pose estimation: review of a competition. Image Analysis. 19th Scandinavian Conference, SCIA 2015, 1517 June 2015 2015, 65–75 (2015)Google Scholar
 E MurphyChutorian, MM Trivedi, Head pose estimation in computer vision: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 607–626 (2009)View ArticleGoogle Scholar
 DW Hansen, J Qiang, In the eye of the beholder: a survey of models for eyes and gaze. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(3), 478–500 (2010)View ArticleGoogle Scholar
 CG Healey, JT Enns, Attention and visual memory in visualization and computer graphics. IEEE Transactions on Visualization and Computer Graphics 18(7), 1170–88 (2012)View ArticleGoogle Scholar
 W Xiaomeng, L Kang, Q Xu, A survey on gaze estimation. 10th International Conference on Intelligent Systems and Knowledge Engineering (ISKE). Proceedings, 2015, pp. 260–7Google Scholar
 MV Sireesha, PA Vijaya, K Chellamma, A survey on gaze estimation techniques. Lecture Notes in Electrical Engineering, v 258 LNEE, 2013, pp. 353–361Google Scholar
 MH Nguyen, J Perez, FDL Torre, Facial feature detection with optimal pixel reduction SVM. 2008 8th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2008, September 17, 2008  September 19, 2008, (2008).
 T Sim, S Baker, M Bsat, The CMU pose, illumination, and expression (PIE) database. 5th IEEE International Conference on Automatic Face Gesture Recognition, FGR 2002, May 20, 2002  May 21, 2002, 2002, pp. 53–58Google Scholar
 G Guodong et al., Head pose estimation: classification or regression? 19th International Conference on Pattern Recognition, ICPR 2008, 811 Dec. 2008, 2008, p. 4Google Scholar
 Y Xinguo et al., Head pose estimation in thermal images for human and robot interaction. 2010 2nd International Conference on Industrial Mechatronics and Automation (ICIMA 2010), 3031 May 2010, 2010, pp. 698–701Google Scholar
 E Seemann, K Nickel, R Stiefelhagen, Head pose estimation using stereo vision for humanrobot interaction. Proceedings  Sixth IEEE International Conference on Automatic Face and Gesture Recognition FGR 2004, May 17, 2004  May 19, 2004, 2004, pp. 626–631Google Scholar
 R Stiefelhagen, J Yang, A Waibel, Modeling focus of attention for meeting indexing based on multiple cues. IEEE Transactions on Neural Networks. 13(4), 928–938 (2002)View ArticleGoogle Scholar
 Y Li, S Gong, H Liddell, Support vector regression and classification based multiview face detection and recognition. 4th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2000, March 28, 2000  March 30, 2000, 2000, pp. 300–305Google Scholar
 R Rosipal, M Girolami, L Trejo, Kernel PCA feature extraction of eventrelated potentials for human signal detection performance. Computing and Information Systems. 7(1), 20–3 (2000)Google Scholar
 E MurphyChutorian, A Doshi, MM Trivedi, Head pose estimation for driver assistance systems: a robust algorithm and experimental evaluation. 10th International IEEE Conference on Intelligent Transportation Systems, ITSC 2007, September 30, 2007  October 3, 2007, 2007, pp. 709–714Google Scholar
 B Raytchev, I Yoda, K Sakaue, Head pose estimation by nonlinear manifold learning. Proceedings of the 17th International Conference on Pattern Recognition, 2326 Aug. 2004, 2004, pp. 462–6Google Scholar
 VN Balasubramanian, S Krishna, S Panchanathan, Personindependent head pose estimation using biased manifold embedding. EURASIP J. Adv. Signal Process. 15, 1–15 (2008).
 ST Roweis, LK Saul, Nonlinear dimensionality reduction by locally linear embedding. Science. 290(5500), 2323–6 (2000)View ArticleGoogle Scholar
 M Belkin, P Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation. 15(6), 1373–96 (2003)View ArticleMATHGoogle Scholar
 K Woo Won et al., Automatic head pose estimation from a single camera using projective geometry. 2011 8th International Conference on Information, Communications & Signal Processing (ICICS 2011), 1316 Dec. 2011, 2011, p. 5Google Scholar
 AV Puri, H Kannan, P Kalra, Coarse head pose estimation using image abstraction. 2012 Canadian Conference on Computer and Robot Vision, 2830 May 2012, 2012, pp. 125–30Google Scholar
 S Sheikhi, JM Odobez, Combining dynamic head posegaze mapping with the robot conversational state for attention recognition in humanrobot interactions. Pattern Recognition Letters. 66, 81–90 (2015)View ArticleGoogle Scholar
 S Sheikhi, JM Odobez, Investigating the midline effect for visual focus of attention recognition. 14th ACM International Conference on Multimodal Interaction, ICMI 2012, October 22, 2012  October 26, 2012, 2012, pp. 221–224Google Scholar
 J Min et al., Head pose estimation based on Active Shape Model and Relevant Vector Machine. 2012 IEEE International Conference on Systems, Man and Cybernetics (SMC2012), 1417 Oct. 2012, 2012, pp. 1035–8Google Scholar
 Y Chen et al., A method of head pose estimation based on Active Shape Model and stereo vision. 2014 33rd Chinese Control Conference (CCC), 2830 July 2014, 2014, pp. 8277–82Google Scholar
 N Mahmoudian Bidgoli, AA Raie, M Naraghi, Probabilistic principal component analysis for texture modelling of adaptive active appearance models and its application for head pose estimation. IET Computer Vision 9(1), 51–62 (2015)View ArticleGoogle Scholar
 HR Wilson et al., Perception of head orientation. Vision Research 40(5), 459–72 (2000)View ArticleGoogle Scholar
 G Zhibo et al., A fast algorithm face detection and head pose estimation for driver assistant system. 2006 8th International Conference on Signal Processing, 1620 Nov. 2006, 2006, p. 5Google Scholar
 G Garau et al., Investigating the use of visual focus of attention for audiovisual speaker diarisation. 17th ACM International Conference on Multimedia, MM’09, with Colocated Workshops and Symposiums, October 19, 2009  October 24, 2009, 2009, pp. 681–684Google Scholar
 XM Wang et al., Moving object detection based on bionic compound eye. 2014 International Conference on Materials Science and Computational Engineering, ICMSCE 2014, May 20, 2014  May 21, 2014, 2014, pp. 3563–3567Google Scholar
 P Yao, G Evans, A Calway, Using affine correspondence to estimate 3D facial pose. IEEE International Conference on Image Processing (ICIP), October 7, 2001  October 10, 2001, 2001, pp. 919–922Google Scholar
 MAA Dewan et al., Adaptive appearance model tracking for stilltovideo face recognition. Pattern Recognition. 49, 129–51 (2016)View ArticleGoogle Scholar
 Sujono, AAS Gunawan, Face expression detection on Kinect using Active Appearance Model and fuzzy logic. Procedia Computer Science 59, 268–74 (2015)View ArticleGoogle Scholar
 M Linna, J Kannala, E Rahtu, Online face recognition system based on local binary patterns and facial landmark tracking, 2015, pp. 403–414Google Scholar
 C Luo et al., Video based face tracking and animation. 8th International Conference on Image and Graphics, ICIG 2015, August 13, 2015  August 16, 2015, 2015, pp. 522–533Google Scholar
 F Bazyari, Y Tzimiropoulos, An active patch model for real world appearance reconstruction. Computer Vision  ECCV 2014 Workshops, 612 Sept. 2014, 2014, pp. 443–56Google Scholar
 K Otsuka, Multimodal conversation scene analysis for understanding people’s communicative behaviors in facetoface meetings. Human Interface and the Management of Information: Interacting with Information  Symposium on Human Interface 2011, Held as Part of HCI International 2011, July 9, 2011  July 14, 2011, 2011, pp. 171–179Google Scholar
 C Garcia, G Tziritas, Face detection using quantized skin color regions merging and wavelet packet analysis. IEEE Transactions on Multimedia. 1(3), 264–77 (1999)View ArticleGoogle Scholar
 M Soriano et al., Adaptive skin color modeling using the skin locus for selecting training pixels. Pattern Recognition. 36(3), 681–690 (2002)MathSciNetView ArticleGoogle Scholar
 RL Hsu, M AbdelMottaleb, AK Jain, Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5), 696–706 (2002)View ArticleGoogle Scholar
 G Xin, X Yu, Head pose estimation based on multivariate label distribution. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2328 June 2014, 2014, pp. 1837–42Google Scholar
 M Bingpeng et al., CovGa: a novel descriptor based on symmetry of regions for head pose estimation. Neurocomputing. 143, 97–108 (2014)View ArticleGoogle Scholar
 D ByungOk Han, HS Yang, Head pose estimation using image abstraction and local directional quaternary patterns for multiclass classification. Pattern Recognition Letters 45, 145–53 (2014)View ArticleGoogle Scholar
 F Jacob, VK Asari, A twolayer framework for piecewise linear manifoldbased head pose estimation. Int J Comput Vis 101, 270–287 (2013)MathSciNetView ArticleGoogle Scholar