Print Сite this

Speaker Recognition: Multispeaker


It has always been a problem determining the number of speakers required for a quality sound system especially in cases where speech separation of an individual speaker from a multispeaker signal is required. In this regard, various theoretical approaches such as the subjective threshold hypothesis for the detection of the number of sources whose mixed signals collect an array of passive sensors of the covariance matrix of the observation vector were proposed.

We will write a
custom essay
specifically for you

for only $16.05 $11/page
308 certified writers online
Learn More

Also, on vocal source segmentation, various researchers suggest the use of wavelet transform as opposed to Fourier spectrum methodology in transienting signal representation. In concluding our literature review, researchers propose methodologies evidenced to increase speaker performance and provide critical analysis of the same. The proposed speaker recognition methodologies combined previous studies on the same topic, proposed for new approaches and suggested solutions to reduce the imbalances experienced by various applications.

Literature Review from Previous Studies

In automatic speaker recognition, Campbell (1997) employs a biometrics methodology in facilitating network access-control applications. Campbell (1997) specifically concentrates on verification and identification method in Automatic speaker verification (ASV) of voice machines. Automatic speaker recognition works by arranging speakers in different analogies which then identify who the person is and the group the person belongs to.

Routinely applied in encrypted smart card containing identification information for supervised visitations, Campbell (1997) argue that text dependent recognition can not be guaranteed for authentication. Here, the weak point of automatic speaker recognition is that the accuracy and content duration is not guaranteed. In replication of this analysis, Campbell (1997) states “human and environmental factors contribute to accuracy errors, generally outside the scope of” (p.2). These analyses conclude by adding that human errors such as misreading and misspeaking are likely to affect speaker performance despite the speaker quality (Campbell 1997, p.2).

Campbell (1997) proposed for ASV and ASI methodologies as the most natural and economical experimental design methods for solving problems. The detailed distinction of speaker classifications makes it difficult to make meaningful comparison between text independent approaches and difficult to perform text-independent tasks.

Further Campbell (1997) proposes for a new methodology as “text-independent approaches such as “Reynold’s Gaussian Mixture Model and Gish’s segmental Gaussian model to be applied in test materials to deal with problems such as sounds and articulation rather than applied in training” (p.3). Lastly, areas that remained uncovered in this analysis were comparisons between binary choice verification task and the multiple choice identification tasks making it difficult to propose for future improvements (Campbell 1997, p.3).

Due to increased security problems with regard to speaker verification application, Campbell 1997 proposes for extensive research approaches on substantial speaker-verification applications that will reduce fraudulent transactions and crime. Also, automatic speaker-verification systems errors such as false acceptance of invalid user (FA or Type 1), acceptance of false user (FR or Type II) and false acceptance error were not extensively covered and limited speech collected in speaker–discrimination criteria proving difficult in interpreting theoretical measures used (Campbell 1997).

Get your
100% original paper
on any topic

done in as little as
3 hours
Learn More

Campbell (1997) mentions vector features applied in measuring speech signal sequence to include pattern recognition paradigms such as; feature extraction and selection, pattern matching, and classification convenient for designing system components. Though effective in segmentation and classification, Campbell (1997) points out that “the components are prone to false demarcation that may subsequently lead to suboptimal designs when interacting with in real-world systems” (p.9).

Campbell (1997) further proposes for the ROC straight line methodology as stated as “the product of the probability of FA and the probability of FR is a constant for this hypothetical system” (p.6). Evidently, ROC straight line method shows inconsistency in the sense that the product used could not be equal to the square of equal error rate (EER) (Campbell 1997). Another gap of this research is evident in the Campbell (1997, p.19) Fig 20 that demonstrates a complete signal acquisition stage which is indeed unnecessary since speech signal is already provided in digital from YOHO CD-ROM.

In multi-media application, audio segmentation based on multi-scale audio classification requires accurate segmentation and most convectional algorithms based on small scale feature classification. Audio segmentation here is applied to both audio and video content analysis. Zhang & Zhou (2004) classifies four scale levels as; (large-scale, small-scale, tiny-scale and huge-scale) and further splits them into two categories; training and testing. In this regard, the researchers group audio classification correctly and accurately resulting to good classification hence reliable results. On the other hand, Zhang & Zhou (2004) fail to provide of the same systems with a discriminator in cases of more audio classes and as a backup in improving performance on segmentation algorithm for effective audio, video and index content analysis.

Lastly, in simplified Early Auditory model with Application in Speech/Music Classification, Chu and Champagne (2006) use speech/music classification as a way of evaluating classification performance and uses a support vector machine (SVM) as a classifier.Evidently, Chu and Champagne (2006) present useful information on audio and video classification and segmentation. Previous studies have classified speech and music segmentations to include features as many as 13. To name just a few, Scheier and Slaney (1997) classify these features to include; ZRC, spectral flux, 4Hz modification energy and e.t.c. In doing classification and segmentation of audio speakers, Zhang and Kuo (2001) propose for an automatic approach of audiovisual data using energy function, ZCR, spectral peak attacks and fundamental frequency.

On the other hand, Lu and his colleagues (2002) use a different methodology by employing a two-stage robust approach capable of classifying audio stream into silence, speech, environmental sound and speech. To replicate this analysis, Panagiotakins and Tziritas (2005) propose for an algorithm methodology using mean signal amplitude distribution and ZCR features in classification and segmentation. In this regard, algorithm methodology employed clean test sequence training which was argued as ineffective in terms of actual testing sequence as it’s evidenced to have background noise with certain SNR levels. Another methodology proposed by Wang and Sharma (1994) that employed an early auditory model was also reported to be ineffective as it was evidenced to be experiencing self-normalisation effect, a defect property attributed from noise suppression.

Scheier and Slaney (1997) evidences that early auditory model has recently received excellent performance based on classification and segmentation. On the other however, Chu and Champagne (2006) argue that early audition model requires high computation compared to nonlinear processing. Conclusively, to reduce computational complexity and increase the performance of the proposed classification and segmentation methodologies mentioned here, Chu and Champagne (2006) proposes for a linear simplified version of early auditory model would be effective in noisy test cases (Chu and Champagne 2006, p.1,4).

In speaker recognition, Kumara and his colleagues (2007) address the issue of determining the number of speakers from multiple speaker speech signals collected from a pair of spatial microscopes. Constantly striving to provide optimum sound solutions, Kumara et al (2007) opinion on sound quality production is at odds with the desired aesthetic provided my various researchers on this particular topic.

We will write a custom
for you!
Get your first paper with
15% OFF
Learn More

Based on their comprehensive analysis, Kumara et al (2007) argue that spatial separated microscopes results in time delay in arrival of speech signals from a given speaker due to relative spacings from significant excitation of the vocal tract system. They link this time delay to be as a result of unchanged excitations in the direct components of the speech signals of the two microphones. Kumara et al (2007) point out the problem to be in signal spacing due to underestimated number of sources from multisensory data.

For high quality performance of multi-speaker data, Kumara and his colleagues (2007) argue that determining the number of speakers, then localising and tracking them from a signal collected using a number spatially distributed microphones is required to guarantee even coverage, speech intelligibility and high-quality sound. There has always been a challenge in determining number of speakers required for quality sound system.

Some of the problems experienced are Kumara and his colleagues (2007) to include; difficulties in separating speech of an individual speaker from a multispeaker signal and problems in collecting signal in practical environment, a room with background noise and reverberation of multi-speaker data. In solving this problem, Kumara and his colleagues (2007) propose a theoretical approach for detecting the number of sources whose mixed signals are collected by an array of passive sensors as eigenvalues of the covariance matrix of the observation vector. They use a nested sequence hypothesis test that employs subjective judgement for deciding threshold level i.e. the likelihood statistic ratio. The gap of this research is that the literature review introduces different methodologies without careful consideration of their impacts.

For example, a nested sequence hypothesis should not have been applied in estimating the number of sources as Kumara and his colleagues (2007) argue that this methodology employs a subjective threshold. They further recommend a testing hypothesis applied in minimum description length (MDL) in estimating the number of sources. To reaffirm their findings Kumara et al (2007) state that “MDL can only be used as a test for determining the multiplicity of the smallest eigenvalues and therefore suitable for this kind of application” (p.1). In my opinion, researchers should have carefully researched on the methodologies to be used by doing a test hypothesis.

Kumara and his colleagues (2007) also argue that the proposed method based application on multiplicity that concentrates on the smallest eigenvalues are not robust and therefore not suitable for tracking signals. They attribute the reason to be as a result of deviations from the assumed model of the additive noise process. They add that providing solutions to smallest eigenvalue problem would require exploiting prior knowledge and multidimensional numerical research of these systems. Kumara et al (2007) recommend providing a number of steering vectors and use of robust estimators in situations where sensor noise levels are spatially inhomogeneous.

Kumara et al (2007) also argue that methods used in estimating number of sources used in multidimensional speakers such as information theoretic criterion assumes mixed signal vector methodology that uses artificially generated mixed signals such as multispeaker signals that have more variability due to noise and reverberations causing delays and decay of direct sound as a result of increased distance of the microscope from the speaker (Kumara et al 2007, p.5).

In a multispeaker multimicroscophone scenario that assumes speakers are stationed with respect to the microphone, Kumara et al (2007) quickly points out fixed time delays often experienced in arrival of speech signals between every pair of microphones for a given speaker. In this regard, Kumara et al (2007) proposes cross-correlation function formular for calculating time delays of multispeaker signals.

Though effective in computing time delays, cross-relation function does not show unambiguous prominent peaks at the time of delays. Kumara et al (2007) states the reason for this miscalculation to be “the damped sinusoidal components in the speech signal due to resonances of the vocal tract, and because of the effects of reverberation and noise since speech signal exhibits relatively high signal-to-noise ratio (SNR) and high signal-to-reverberation ratio (SRR)” (p.2). Kumara et al (2007) suggest a solution to reduce the effects of such imbalances to be done by exploiting the characteristics of excitation sources of the vocal tract and doing a pre-processing of multispeaker signals to re-emphasize the regions of high SNR and SRR.

Need a
100% original paper
written from scratch

by professional
specifically for you?
308 certified writers online
Learn More

Methods of estimating time delays

Normally vocal tract system becomes excited by quasi-periodic sequence of impulse-like excitation produced from voiced speech. The excitation occurs at glottal closure (GCI) with each pitch period and remains unchanged when microphones are placed at speech signals. The sequences can only differ as a result of fixed delays corresponding to the relative distances of the microphones from the speakers. Vicinity of the instants of significant excitations in speech display high SNR compared to other regions as a result of damping of the impulse response of the vocal tract system.

In order to highlight the high SNR region in speech signal, Kumara et al (2007) recommend using linear prediction (LP) methodology that uses autocorrelation method resulting to large amplitude fluctuations around the instant significant excitation. Another method suggested is the cross-correlation function that receives signals from two microscopes. Characterised by large amplitude fluctuations, cross-correlation function does not guarantee strong peaks due to random polarity around the GCIs (Kumara et al, 2007, p.2).

When determining numbers of speakers to be used in collecting signals from spatial microscopes, Kumara and his colleagues (2007) propose cross-correlation function of the Hilbert envelope (HE) and the linear prediction (LP) formulars as methodologies for collecting signals from multispeaker. However, this research fails to recognise relatively large number of small positive values used by Hilbert envelope (HE) that could result to spurious peaks in the cross-correlation function. In summary, the methodology used here attempts to equalise the number of prominent peaks to the number of speakers which is often not the case in real world application.

First, this methodology may not be practical since all speakers may not equally contribute to voice sounds in the segment used for computing cross-correlation function, secondly, spurious peaks in the cross-correlation function may not correspond to the delays due to a speaker. As stated earlier, quality of sound system performance relies entirely on delays due because the most prominent peaks occur in the cross-correlation function (Kumara et al, 2007, p.2).

Experimental analysis conducted using different multispeaker signals consisting of three, four, five and six speakers and data collected simultaneously using two microscopes separated by about 1m in a laboratory environment of a frequency range of 05-3.5 kHZ and a reverberation time of about 0.5s. Kumara and his colleagues (2007) gathered findings from these studies that concluded locations of peaks corresponded to time delays due to different speakers.

Therefore since direct component of signals are dominated over the reflected/reverberation components, the method applied for speech signals does not show specific or arbitrary distribution relative to the microscope position. To overcome this problem, Kumara et al (2007) suggest using pairs of several spatially distributed microphones, use of several microphones to reduce weak signals and ensure time delays are nearly constant. Conclusively, Kumara et al (2007) suggests tracking of time variation delays to determine the number of speakers required in a scenario where speakers are moving.

In speaker discrimination power, it’s evidenced that technology used here is predominantly based on statistical modelling of short-time features extracted from acoustic speech signals. Chan and his colleagues (2007) state that “speaker discrimination power uses vocal sources related features and convectional vocal tract features in speaker recognition technology” (p.1). On recognition performance, the effectiveness of statistical modelling technique and discrimination power of acoustic features are the most important components of speaker discriminating power. In a number of applications, the amount of training for speakers modelling and test data for recognition is limited.

Chan and his colleagues (2007) however suggest a test features performance to ensure good discrimination power of acoustic features regardless of the amount of speech data being processed especially in cases for speaker segmentation for telephone conversation, which often encounter short speech segments making statistical speaker modelling less reliable Chan et al, 2007, p.1).

Chan and his colleagues (2007) research analyses newly proposed acoustic features and their speaker discrimination power in regard to their training and testing conditions. They provide that automatic speech recognition (ASR) uses the Mel-frequency cepstral coefficients (MFCCS) and linear predictive cepstral coefficients (LPCCs) as the most common acoustic features. Their primary goal was to identify different speech sounds and provide pertinent cues for phonetic classification. Chan et al (2007) comprehensive analysis did however conclude that cepstral features extracted from a spoken utterance are closely related to their linguistic content.

On Vocal source, WOCOR uses Fourier spectrum methodology discredited inadequacy in computing time-frequency properties of pitch pulses in the residual signal. Chan et al (2007) rather suggests use of wavelet transform well known for its efficiency in transienting signal representation. They proposed WOCOR feature to employ extraction based on wavelet transform as opposed to the tradition Fourier transform of the residual signal. The second formular uses WOCOR features from a pitch-synchronous segment of residual signal with different values of k. Here, the signal is analysed with different time-frequency resolution.

Telephone conversations uses a frequency band of 300 to 3400 Hz , and for this case all the coefficients of a sub-band are combined into a single feature parameter that requires no temporal information detainment (Chan et al 2007). A large M used is also used and each coefficient acts as an individual feature parameter. Here, a lot of unnecessary temporal details are included, and the feature vector tends to be noisy and less discriminative. WOCOR features captures spectro-temporal characteristics of residual signal-very important for speaker characterization and recognition, however, the research fails to realise that speaker recognition performance would not significantly improve as M increased beyond 4 (Chan et al 2007, p.3).

MFCC in vocal tract features have widely been used for speech recognition and speaker recognition. When examining standard procedure of extracting MFCC on a short time frame basis, WOCOR may not as effective as MFCC in speaker recognition. Aiming at characterising two physiologically distinct components in speech production, WOCOR and MFCC contain complementary information for speaker discrimination (Chan et al 2007, p.4).

The major problem with speaker segmentation is the task of dividing an input speech signal into homogeneous segments and turning points in terms of time instants. Chan et al (2007) also classifies speech segments separated by turning points while speaker segmentations algorithms are commonly based on statistical modelling of MFCC. Here, little knowledge about the speakers is evident since researchers fail to establish speaker model first hand. Chan et al (2007) recommends that speaker models need to be built from preliminary hypothesised segments in the speech signal being processed while keeping in mind that WOCOR is more discriminative than MFCC as it has routinely been utilised in most speaker recognition applications.

Statistical models trained by MFCC describe speakers’ voice characteristics, at the same time, model the variation of speech. Here, content mismatch between test speech and the training speech is evident. The study also uses WOCOR as a representative of vocal source related features and demonstrates its effectiveness in speaker recognition. On the other hand however, Chan et al ( 2007) argue that WOCOR is discriminative when amount of training data is small as it’s built on limited amount of text-independent data, a characteristic this research fails to put into consideration (Chan et al 2007, p.5)

In pre-processing stage for speaker diarization, speaker diarization system attempts to assign temporal speech segments in a conversation to the appropriate speakers, and non-speech segments to non-speech. Literary, speaker diarization attempts to answer “who spoke then?” On of the major problem with system unable to handle co-channel or overlapped speech detection and separation. Guterman and Lapidot (2009) statement on algorithms argues that “high computational complexity that consume both time and frequency domain analysis of the audio data” (p.2).

Secondly, speaker diarization system attempts to assign temporal speech segments in a conversation to the appropriate generating source. One disadvantages of this system its inability to handle overlapped speech (OS) segments when multiple speakers take part. Normally, the current state of the art systems are assigned to one system in the conversation, thus, generating unavoidable diarization errors (Guterman and Lapidot 2009, p.2).

Audio segmentation addresses the problem of media series such as TV, movies and e.t.c. These segmentations are characterised with various segments of various lengths with quite portion of short lines. Yunfeng et al (2007) classifies unsupervised audio segmentation to include what they state as “segmentation stage to detect potential acoustic changes, and a refinement stage to refine candidate changes by a tri-model Bayesian Information Criterion (BIC)” (p.1). Audio segmentation also known as acoustic change is stated by Yunfeng et al (2007) as one that “detects partitions audio stream into homogenous segments by detecting of speaker identity, acoustic class or environmental condition” (p.2).

Here, Yunfeng et al (2007) categorises audio segmentations as model based approach of acoustic classes. Yunfeng et al (2007) further add that “it’s important to have pre-knowledge of speakers and acoustic classes” (p.2), contrary to the application applied here. Therefore in this scenario, the model based approach is often unsupervised and lacks many applications. The second audio segmentation approach applied in this computation is the metric-based approach. Yunfeng et al (2007) states “metric-based approach determines changes by threshold on the basis of distance computation for the input audio stream.

The distance measured comes from statistical modelling framework like the Kullback-Leibler distance, a generalised likelihood ratio” (p.4). Audio segmentation incorporates metric-based method for computing segmentation and a clustering procedure to obtain training data, which is then followed by model based pre-segmentation. BIC operates as threshold and uses two-stage segmentation, first, segmentation audio stream by distance measure and secondly, refine changes by BIC sequentially (false alarm compensation).

However, the evaluated audio normally consists of relative long acoustic segments of >2s or 3s and short segments of 1-3s, which are often neglected in this type of segmentations because they are difficult to be detected. Short segments use unsupervised audio segmentation approach as an emphasis to detect the short segmentation. They are frequent in practical media such as movies, TV series, interviews and phone conversations. Therefore, detecting short segments becomes a challenge when applying audio segmentation into real applications (Yunfeng et al 2007).

In system frameworks, Yunfeng et al (2007) employs a model based approach consisting of five modules; refinement, post-processing, segmentation, feature extraction and pre-processing. Depending on the property of input audio data, pre and post-processing modules alternate, which is down-sampled into 16kHz. The methodology uses algorithm of recall rate (RCL), precision (PRC) and F-measure in determining the best segmentation performance.

Here, target-change that corresponds with one of the two boundaries employs a vigorous methodology in computing changes in false alarms gap. From this analysis, the process of detecting acoustic changes including silence seems insensitive and inaccurate, especially in the short segments. For example, Yunfeng et al (2007) illustrates that recall rates for short segments are between 30.9% and 37.6%, however the overall statistics presented here shows that recall rates could be near 75% if short segments are successfully detected.

Secondly, there is an average mismatch (AMM), which reflects the accuracy for the computed segment boundaries, which indicates that the approach used here does not have a very good resolution in segment boundary location. Aimed at processing real-word media such as broadcast news consisting segments of various durations, the study concludes that tri-model BIC approach shows better segmentation performance and higher resolution of segment boundary location and the refinement stage of the approach shows efficiency by experiments (DU).

In Speech/Music discriminator segment, Theodoros (2006) research states that “speech and sound discriminator for radio recordings at segmentation stage to be based on the detection of changes in the energy distribution of the audio signal” (p.1). He adds that “automatic discrimination of speakers works on speeches, musical genre classification and speaker speech recognition” (p.1). Theodoros (2006) concludes by stating that “the task of discrimination of speech and music is distributed in the Bayesian Networks adopted in order to combine k-Nearest Neighbor classifiers trained on individual features” (p.1).

This methodology has been successfully tested on real Internet broadcasts of BBC radio stations as a real time speech discriminator for the automatic monitoring of radio channels based on energy contour and Zero crossing rate (ZCR). The incremental approach of audio segmentation is divided into non-overlapping segments using segmentation algorithm. Normally, segments grow from one step at a time and when a segment’s extension is halted forcing it to reach predefined length, an abrupt transition from music to speech is experienced resulting into poor segmentation boundaries (Theodoros 2006, p.1).

Dutta and Haubold (2009) analysis on audio-based classification begins by stating that “human voice contains non-linguistic features that are indicative to various speaker demographics with characteristics of gender, ethnicity and nativity” (p.1). They further add that “helpful cues for audio/video classifications help in content search and retrieval” (p.1). In male/female classification using linear kernel support vector machine, automatic speech recognition (ASR) provides good content search cues conducted through filtering speech segments.

Experimental analysis conducted by Dutta and Haubold (2009) to filter out any short fixed audio sample window employed absolute maximum amplitude A methodology for a given speaker segment was noted to be affected by audio quality in an individual speaker use of microphones. It was also noted that in feature selection, the sample used in determining data was excess resulting to overfitting.

This unrestricted sample selection makes it hard to provide comparable classification accuracy and is in the long run more computationally expensive. Secondly, on the demographic experiment of multi-class classification where different classes of people were identified, the sample containing 600 samples for each group (African American, Hispanics, Caucasians and South East Asians) were too large to obtain accurate analysis. In addition, the linear kernel methodology did not provide effective classification accuracies. Classification accuracy in group classification was not determined since both native and non-native English speakers were not grouped according to their demographic class as researchers used different levels of features applied to classification of speech (Dutta and Haubold 2009).

Plumpe et al (1999) states in Hosseinzadeh and Krishnan (2007) special features in combining vocal source and MFCC features for speaker recognition to include “Spectral band energy (SBE), spectral bandwidth (SBW), spectral centroid (SC),spectral crest factor (SCF), Renyi entropy (RE), spectral flatness measure (SFM) and Shannon entropy (SE)” (p.1). Hosseinzadeh and Krishnan (2007) further add that “to evaluate their performance in terms of spectral features, experimental analysis using Mel frequency cepstral coefficients (MFCC) or LPCC using text-independent cohort Gaussian mixture model (GMM) speaker identification will be employed” (p.1). It’s however important to note that speaker identification is a biometric tool for resources that can be accessed via telephone or internet (Hosseinzadeh and Krishnan 2007).

Special feature recognition enhances performance by combining vocal sources and MFCC features by modelling the entire speech with time-varying excitement and a short time-varying filter. Speech signals are thereby modelled by linear convolution. MFCC and LPCC methodologies prove to be effective in speaker recognition but they do not provide complete description of the speaker’s speech system in terms of pitch, harmonic structure and spectral energy distribution (Hosseinzadeh and Krishnan 2007, p.4).

Secondly, the linear model used when calculating MFCC and LPCC is not entire accurate since vocal source signal may be predictable for certain vocal tract configurations and may not be applicable to others. Good recognition performance is proven by Hosseinzadeh and Krishnan (2007) experimental analysis to be achieved by GMM based systems. The research also adds that MFCC is very affective in characterising vocal tract configuration though the analysis fails to provide complete success of speaker’s speech system.

The proposed spectral features are expected to increase identification accuracy of MFCC based systems since they provide some information about the vocal source. The SMF feature for measuring flatness measure of spectral is useful for discriminating between voiced and unvoiced components of speech but in this particular analysis, the feature did not perform well because its quantified characteristics were not well defined in speech signals.

Also, the tonality measurement of the sub-band characteristic was difficult to define speech spectrum since its energy was distributed across many frequencies. The SC, SCF and SBE provide vocal source information since it relates to harmonic structure, pitch frequency and spectral energy distribution, while entropy features quantify the spectrum in terms of voiced and unvoiced speech. Experimental results showed that the proposed spectrum features improved the performance of MFCC based features although to fails to provide a comprehensive analysis to quantify the results (Hosseinzadeh and Krishnan 2007).

In unsupervised audio classification, Huang and Hansen (2006) mentions partitioning and labelling in an input audio stream into speech, music, commercials or acoustic conditions as the main objective of audio segmentation and classification. Their analysis add that unsupervised audio classification requires effective large vocabulary continuous speech recognition (LVCSR), audio content analysis and understanding, audio transcription, audio clustering and indexing applications (Huang and Hansen 2006). The features used in feature-based methods considered as extended-time features represented in the time domain (ZCR, energy) or the frequency domain (subband power, low-short-time energy ratio) are all mentioned as not suitable for training a statistical model , especially with diagonal covariance based GMM.

The research should have opted for short-time features (spectral-based MFCC and perceptual minimum variance distortion response (PMVDR), features that are de-correlated and highly independent across the feature vector, suitable for training statistical model (Hosseinzadeh and Krishnan 2007). Also, short term features such as MFCCs encode phoneme level information are inappropriate for speech and non-speech classification.

Essentially, effective audio/speaker speech recognition features are different from those applied in automatic speech recognition (ASR) (Hosseinzadeh and Krishnan 2007). In this regard, Hosseinzadeh and Krishnan (2007) argue that, “feature processing methods and modelling concepts successful for ASR may not be necessarily appropriate for segmentation” (p.3). The traditional MFCC features used for ASR, may not be as effective for speakers segmentation. The research should consider alternative features such as one considered by Hosseinzadeh and Krishnan (2007) as line spectral pair LSP features used as multifeature set consisting MFCC, LSP and pitch features to detect change points and then applied the Bayesian fusion model to combine segmentation results (Hosseinzadeh and Krishnan 2007, p.3)

Hosseinzadeh and Krishnan (2007) comprehensive study considered advances in unsupervised audio classification for LVCSR and speaker segmentation for multispeaker change detection by proposing two new extended-time features; VSF and VZCR for audio classification and a novel classification algorithm, WGN. The study did however fail to show VSF and VZCR robustness and their efficiency in speech/nonspeech classification. The WGN classification algorithm combined a feature-based method and model-based method that was unreliable to provide inconclusive analysis. The statistical analysis did however show that improvement in frame accuracy from 93.4% to 96.9% over tradition WGN frameworks and outperforms the baseline system at all levels (Hosseinzadeh and Krishnan 2007, p.12).


Kumara et al (2007) opinion on sound quality production is at odds with the desired aesthetic provided my various researchers on this particular topic. They argue that spatial separated microscopes results in time delay in arrival of speech signals from a given speaker due to relative spacings from significant excitation of the vocal tract system. Unsatisfied with this methodology, Kumara et al (2007) propose for a theoretical approach for detecting the number of sources.

In another analysis, Chan and his colleagues (2007) argue that technology used speaker discrimination power based on statistical modelling is likely to experiences short-time features extracted from acoustic speech signals. They further propose for wavelet transform as opposed to the tradition Fourier transform of the residual signal. Evidently, Chu and Champagne (2006) present useful information on audio and video classification and segmentation however accuracy and content duration is needed for future studies. As a result, I conclude that researchers posses little knowledge about the speakers classification and recognition making it difficult to establish speaker model first hand and propose for future improvements

Reference List

Campbell, J.P., 1997. Speaker recognition: A tutorial. Proceedings of the IEEE, 85 (9), pp. 1-26.

Chan, N.W., & Zheng, N., 2007. Discrimination Power of Vocal Source and Vocal Tract Related Features for Speaker Segmentation. IEEE Transactions on Audio, Speech, And Language Processing, 15 (6), pp.1-9.

Chu, W., & Champagne, B., 2006. A Simplified Early Auditory Model with Application in Speech/Music Classification. IEEE CCECE/CCGEI, 1, p.1, 4.

Dutta, P.,& Haubold, A. 2009. Audio-based classification of speaker characteristics. New York, NY: Columbia University.

Guterman, H., Ben-Harush, O., & Lapidot, I., 2009. Frame level entropy based overlapped speech detection as a pre-processing stage for speaker diarization: Pure segment selection as speaker diarization postprocessing. Electrical and Electronics Engineers in Israel, IEEE, pp. 1-6.

Hosseinzadeh, D., & Krishnan. S., 2007. Combining Vocal Source and MFCC Features for enhanced speaker recognition performance using GMMs. MMSP, 1, 1-4.

Huang,R.,& Hansen,H.L., 2006. Advances in Unsupervised Audio Classification and Segmentation for the Broadcast News and NGSW Corpora. IEEE Transactions On Audio, Speech, And Language Processing, 14 (3), pp.1-13.

Lu, H., Zhang,J., & Jiang, H., 2002. Content analysis for audio classification and . IEEE Trans. Speech Audio Processing, 10 (7), pp. 504–516.

Panagiotakis, C., & Tziritas,G., 2005. A speech/music discriminator based on RMS and zero-crossings. IEEE Trans. Multimedia, 7, pp.155–166.

Plumpe, M. D.,& Quatieri, T. F., & Reynolds, D. A., 1999. Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7 (5), pp. 569–586.

Scheirer, E., & Slaney, M., 1997. Construction and evaluation of a robust multifeature speech/music discriminator. In Proc. ICASSP, 2, pp.1331-1334.

Swammy, K. P., Murty, K.S., & Yegnanarayana, B., 2007. Determining number of speakers from multispeaker speech signals using excitation source information. IEEE Signal Processing Letters, 14 (7), pp.1-4

Theodoros, G., Aggelos, P.,& Theodoridis, S., 2006. A speech/music discriminator for radio recordings using Bayesian networks. ICASS, 1, pp.1-4.

Wang, K., & Shamma, S., 1994. Self-normalization and noise-robustness in early auditory representations. IEEE Trans. Speech Audio Processing, 2(3), pp. 421–435.

Yunfeng, D., Wei, H., Yonghong, Y., &Tao W., 2007. Audio segmentation via tri-model bayesian information criterion. ICASSP, 205(1), pp.1-4.

Zhang, Y., & Zhou, J., 2004. Audio segmentation based on multi-scale audio classification. ICASSP, 1, 1-4.

Zhan, T.,& Kuo, J. 2001. Audio content analysis for online audiovisual data segmentation and classification. IEEE Trans. Speech Audio Processing, 9(4), pp. 441–457.

Cite this paper

Select style


StudyCorgi. (2022, June 11). Speaker Recognition: Multispeaker. Retrieved from


StudyCorgi. (2022, June 11). Speaker Recognition: Multispeaker.

Work Cited

"Speaker Recognition: Multispeaker." StudyCorgi, 11 June 2022,

* Hyperlink the URL after pasting it to your document

1. StudyCorgi. "Speaker Recognition: Multispeaker." June 11, 2022.


StudyCorgi. "Speaker Recognition: Multispeaker." June 11, 2022.


StudyCorgi. 2022. "Speaker Recognition: Multispeaker." June 11, 2022.


StudyCorgi. (2022) 'Speaker Recognition: Multispeaker'. 11 June.

This paper was written and submitted to our database by a student to assist your with your own studies. You are free to use it to write your own assignment, however you must reference it properly.

If you are the original creator of this paper and no longer wish to have it published on StudyCorgi, request the removal.