Print Сite this

Procedure of Speaker Recognition


Speaker recognition is a procedure carried out through a computer program that validates the identity claimed by an individual through taking into account certain aspects of human voice. The technical field of speaker recognition attempts to provide a practical and cost effective means for recognizing an individual from audio data available. Thus, it qualifies to be classified as a biometric tool and a very important one at that.

We will write a
custom essay
specifically for you

for only $16.05 $11/page
308 certified writers online
Learn More

Speaker recognition and other biometric methods attempt to recognize accurately humans through taking into account physical or behavioral qualities that are intrinsic to them. Speaker recognition is termed as a behaviometric, which means that it attempts to recognize humans by taking into account one of their intrinsic behavioral qualities. The behavioral quality being discussed about is the human voice and in particular the acoustic aspects of speech that vary in individuals. According to Dutta and Haubold (2009), the human voice conveys speech and is useful in providing gender, nativity, ethnicity and other demographics about a speaker (422).

Additionally, it also possesses other non-linguistic features that are unique to a given speaker (422). Biometric methods are built on statistical concepts. This means that the field of speaker recognition is related to the field of statistics directly and fundamentally. The basis of speaker recognition technology in use today is predominated by the process of statistical modeling (Chan et al, 2007, p.1884). The statistical model formed is of short-time features that are extracted from acoustic speech signals.The use speaker recognition dates back as much as four decades ago.


As with any scientific field, various terminologies are used in speaker recognition. Enrolment is the term given to describe the first time a person uses a speaker recognition system or any other biometric technology. Enrolment with respect to speaker recognition involves acquisition and storage of an individual’s relevant voice data. To maintain high levels of robustness in speaker recognition systems it is fundamentally important to ensure that the storage and retrieval mechanism is only for authorized use only. The failure to enroll rate (FER) in speaker recognition is a measure of failure with respect to unsuccessful attempts at creating a voice template from a given speaker.

False accept rate abbreviated as FAR is also known as (FMR) false match rate is a production metric employed in the field of amplifier detection over and above other biometric technologies. It is a probability measure of the likelihood of occurrence of an error or anomaly. The error (anomaly) being considered occurs when a speaker recognition system incorrectly matches a voice sample to the wrong template in the audio repository (database).

False reject rate (FRR) which is also referred to as false non-match rate (FNMR) is also another used performance metric in speaker recognition and in biometrics. As in FAR or FMR it is a probability measure of the likelihood of occurrence of an error or anomaly. However, the error (anomaly) being considered here occurs when the speaker recognition system fails to detect a match between the voice sample and the correct template in the audio repository (database).

Automatic speaker recognizers (ASRs) are the term given to applications used for doing speaker recognition. Automatic speaker recognition (ASR) is the process through which a person is recognized from a spoken phrase by the aid of an ASR machine (Campbell, 1997, p.1437). Machines that are used for speaker recognition purposes are referred to as automatic speech recognition (ASR) machines.

Get your
100% original paper
on any topic

done in as little as
3 hours
Learn More

Known countries that are using speaker recognition systems include the United States, United Kingdom, Iraq, Germany, Italy, Brazil, Israel, Italy, India, Canada and Australia. Speaker recognition systems are duo-phased, that is, they have 2 phases. Enrolment is the first phase and verification is the second. During the first phase (enrolment), a voiceprint – also known as template or model – is developed from a speaker’s voice. The development involves recording the voice and extracting certain features from it. During the second phase (verification), a voiceprint formed during the initial phase is compared against a speech sample – also known as an utterance – to determine if a match between the two will be realized. Speaker recognition systems are classified into two categories: those that are text-dependent and those that are not (text-independent).

Text-dependent speaker recognition systems require that the text used for doing enrolment and verification be the same. In text-independent speaker recognition systems, the text needs not be the same. Thus, these systems do not necessitate the need for corporation from the user. In fact, if the system is being applied in forensic contexts the enrolment might happen without the consent of the speaker. Speaker recognition systems are designed and developed to operate in two modes. The similarity between the two modes is that a comparison process involving a stored template and a voice sample is done.

The templates are stored in a database meaning the comparison procedure is done using computers. The difference between the two lies in the nature of the comparison procedure. In the verification mode, the comparison process is described as one to one whereas in the verification mode the comparison is one to many. Speaker recognition systems and other biometric methods operate under seven parameters. The first parameter is known as universality, which means that the targeted trait (e.g. voice) should be possessed or exhibited by the relevant individuals. The second parameter is uniqueness and this tells how well the method in use (e.g. speaker recognition) successfully separates individuals.

The third parameter is permanence and this provides a means to determine the integrity of the method (e.g. speaker recognition) with respect to time variances such as aging. The fourth parameter is collectability and this. the fifth parameter is performance and this touches on three aspects of the technology applied e.g. speaker recognition. The three facets employed in determining productivity are heftiness, rate and precision.

The sixth parameter is acceptability and this is a measure of how people are willing to accept the technology suggested e.g. speaker recognition. The seventh parameter, which is known as circumvention addresses the ease of using a biometric technology (e.g. speaker recognition) as a substitute to other technologies. Two factors that are considered when determining a recognizer’s performance are the discrimination power associated with the acoustic features and the effectiveness of the statistical modeling techniques. Overlapped speech (OS) contributes to degrading the performance of automatic speaker recognition systems. Conversations over the telephone or during a meeting possess high quantities of overlapped speech.

The following is a block diagram representing a basic speaker recognition system

A block representation of a speaker recognition system.
Fig. 1: A block representation of a speaker recognition system.

The first block is known as a sensor and it forms the appropriate interface between the speaker recognition system and humans. To create an apt sensor all the necessary audio data has to be collected successfully. The second block in the speaker recognition system is known as a pre-processor. The work of the pre-processor is mainly to remove undesired features from the captured audio. Thus by doing this it enhances the input of the system.

We will write a custom
for you!
Get your first paper with
15% OFF
Learn More

The pre-processor can be perceived as a normalization feature and one of the targeted undesired inclusions are background noises. The third block is known as a feature extractor and it is used in extracting the relevant features needed in developing an audio template. The fourth block is known as a template generator and it used to create the appropriate template.

A template is described as a synthesis of the features extracted in the third block. From the template generator the template can proceed to a repository where it is stored until when it is needed. This happens only when an enrolment is on going. If no enrolment is, taking place the template proceeds directly to a matcher. At the matcher, the template is involved in a test to determine if a match is found between it and an audio sample obtained from a speaker. The result of the match can form the basis of an application e.g. one for gaining entrance in restricted areas.

The field of speaker recognition has attracted a lot of research in recent times aimed at improving the practice. The research has seen various proposals being put forward that present new models (approaches) or that suggest improvements on existing ones. One of the reasons for the increased research activities is the need to develop more flexible, practical, accurate and robust systems. Another reason is to increase the performance of the systems.

Two critical areas in the field that have attracted research are audio classification and segmentation. Audio segmentation, according to Zhang and Zhou (2004), is one of the most important processes in multimedia applications (IV-349). Through audio segmentation, an audio stream is broken down into parts that are homogenous with respect to speaker identity, acoustic class and environmental conditions.

The research activities have been centered on reviewing the algorithms used for carrying out these processes in order to make modifications so as to enable them achieve certain objectives. One of the objectives is to achieve a classification algorithm that exhibits robustness even in noisy environments or backgrounds. Another objective is to develop a segmentation algorithm for multimedia applications that exhibits more accuracy than existing ones. In addition to this, it has also been desirable that in these applications (multimedia) the segmentation can be done over a network, that is, on-line. The fourth objective is to automate the process of speaker recognition through dealing with unsupervised audio segmentation and classification. The fifth objective is to formulate an approach to audio segmentation in the context of practical media..A proposal has been put forward by Chu and Champagne (2006) to aid in achieving the first objective (p.775).

Another proposal has been presented by Zhang and Zhou (2004) aimed at achieving the second and third objective (IV-349). A third proposal has been presented in the work of Huang and Hansen (2006) that is aimed at achieving the fourth objective (p.907). A fourth proposal has been presented by Du et al that aims at achieving the fifth objective (Du et al, 2007, p.I-205).

In the first proposal, Chu and Champagne (2006) state that, to achieve robustness in noisy environments or backgrounds their proposed model posses a self-normalization mechanism (p.775). Their suggestion known as an auditory model is a simplified and improved version of an earlier model. Chu and Champagne’s model is a self-normalized FFT-based model that has been applied and tested in speech/music/noise classification.

Shortcomings addressed in this new model are nonlinear processing and high computational requirements that are dominant in the earlier model. Thus, the auditory model proposed by Chu and Champagne is 99% linear and has significantly reduced computational requirements. The proposed model can be described as a three-stage processing sequence. In this sequence, an acoustic signal undergoes a transformation to become an auditory spectrum. The spectrum is the model’s internal neural representation. The modification targets four of the original processing steps namely pre-emphasis, nonlinear compression, half-wave rectification and temporal integration.

Need a
100% original paper
written from scratch

by professional
specifically for you?
308 certified writers online
Learn More

Minimization of computing requirements is achieved though the application of the Parseval theorem which enables the simplified model to be implemented in the frequency domain. The test done to assess the operation and performance of this proposed model is done using a support vector machine (SVM) as the classifier. The results of the test indicate that, indeed, the proposed model is more robust in noisy environments than earlier models (Chu and Champagne, 2006, p.775). Additionally, the results suggest that by reducing the computational complexity, the performance of the conventional FFT-based spectrum is almost the same as that of the original auditory spectrum (p.775).

In the second proposal, Zhang and Zhou (2004) have suggested a 2-step methodology aimed at achieving accurate as well as on-line segmentation. Results obtained from experiments reveal that classification of large scale audio is simpler compared to small scale audio. It is this fact that as propelled Zhang and Zhou to develop an extensive framework that increases robustness in audio segmentation. The first step of the methodology is termed as rough segmentation while the second is referred to as subtle segmentation. The first step (rough segmentation) involves classification on a large scale.

The step is taken as a measure of ensuring that there is integrality with respect to the content segments. This step is crucial in achieving homogeneity in the content segment, which is the main aim of the segmentation procedure. This is because it is in this step that the system ensures that that audio that is consecutive and that is from one source is not partitioned into different pieces. The second step (subtle segmentation) is a locating exercise aimed at finding segment points.

These segment points correspond to boundary regions, which are the output of the first step. Results obtained from experiments also reveal that it is possible to achieve a desirable balance between the false alarm and low missing rate. The balance is desirable only when these two rates are kept at low levels (Zhang & Zhou, IV-349). Earlier algorithms that have been in use and that have attempted to deal with the problem of accurate and on-line segmentation have exhibited two shortcomings. The first is that they are designed to handle classification of features at small-scale levels. The second is that they result in high false alarm rates.

The problem that Huang and Hansen (2006) tackle as presented in the third proposal is that of automating the process of speech recognition and spoken document retrieval in cases, which involve unsupervised audio classification and segmentation (p.907). A new algorithm for audio classification to be used in automatic speech recognition (ASR) procedures is suggested. GMM networks that are weighted form the key feature of this new algorithm. The algorithm includes variance values: the first is VSF and the second is VZCR. The first variance value is determined for spectrum flux whereas the second is determined for the zero-crossing rate.

VSF and VZCR are, additionally, extended-time features that are crucial to the performance of the algorithm. The two values are the criterion for a pre-classification procedure for the audio and additionally attach weights to the output probabilities of the GMM networks. For the segmentation process in automatic speech recognition (ASR) procedures, Huang and Hansen (2006) propose a compound segmentation algorithm (P. 907). As the name suggests the algorithm comprises of multiple features. A 2-mean distance metric and a smoothed zero crossing rate give two out of the 19 features proposed.

A perceptual minimum variance distortionless response (PMVDR) and a false alarm compensation procedure give another two additional features of the algorithm. 14 Filterbank log energy coefficients (FBLC) give 14 out of the 19 features proposed. The 14 FBLCs proposed are implemented in 14 noisy environments where they are used to determine the best overall robust features with respect to these conditions. Turns lasting 5 seconds or below can be enhanced for short segment and in such a case 2-mean distance metric is can be installed. The false alarm compensation procedure has been determined to boost efficiency at a cost effective manner.

A comparison involving Huang and Hansen’s proposed classification algorithm against a GMM network baseline algorithm for classification reveals a 50% improvement in performance. Similarly, a comparison involving Huang and Hansen’s proposed compound segmentation algorithm against a baseline Mel-frequency cepstral coefficients (MFCC) and traditional Bayesian information criterion (BIC) algorithm reveals a 23%-10% improvement in all aspects (Huang and Hansen, 2006, p. 907).

The fourth proposal is the work of Du et al (2007) presents audio segmentation as a problem in practical media such as TV series and movies (P. I-205). TV series, movies and other forms of practical media exhibit audio segments of varying lengths. Short audio segments – those that are 5 seconds long or less – are easily noticeable since they outnumber all the others. Du et al. (2007) has formulated an approach to unsupervised audio segmentation to be used in all forms of practical media.

Included in this approach is a segmentation-stage during which potential acoustic changes are detected. A refinement-stage is also included during which the detected acoustic changes are refined by a tri-model Bayesian Information Criterion (BIC). Results from experiments suggest that the approach possesses a high capability for detecting short segments (Du et al, I-205). Additionally, the results suggest that the tri-model BIC is effective in improving the overall segmentation performance (Du et al, I-205).

Researchers have not been confined to improving audio segmentation and classification only but have to better other areas too. One of these areas is that of speaker discrimination where a proposal has been put forward by Chan et al that introduces a new procedure for undertaking the process (Chan et al, 2007, p.1884). Another area is that of speaker diarization where a proposal by Ben-Harush et al introduces an improved speaker diarization system (Ben-Harush et al, 2009, p.1).

The proposal put forward by Chan et al (2007) is rooted on an analytic study of the speaker discrimination power as it pertains to two vocal features (1884). The two vocal features targeted either relate to the vocal source or conventional vocal tract. The analysis draws a comparison between these two features. The first types of features – those that are related to the vocal source – are known as wavelet octave coefficients of residues (WOCOR).

These types of features have to be taken out of the audio signal. In order to perform the extraction process linear predictive (LP) residual signals have to be induced. This is because the linear predictive (LP) residual signals are compatible with the pitch-synchronous wavelet transform that perform the actual extraction. WOCOR are discriminate in the face of limited quantity of data designed for training and they are less perceptive to verbal content.

These two merits make them appropriate for use in the duty of amplifier segmentation in phone talks (Chan et al, 1884). Such an undertaking requires that statistical amplifier models are established based on short sections of speech, as they exist in such talks. Additionally, experiments have shown that the use of WOCOR causes a noticeable decrease in errors (or anomalies) that occur during or that are linked with the segmentation process (Chan et al, 2007, p.1884).

According to Ben-Harush et al (2009), the problem that speaker diarization systems seek to solve is captured in the question “who spoke and when did the speaking take place? (p.1). Speaker diarization systems are functions that map temporal speech segments in a conversation to the appropriate speaker (Ben-Harush, 2009, p.1). Background noise and other non-speech segments are mapped into the set of non-speech elements. An inherent shortcoming in most of the diarization systems in use today is that they are unable to handle speech that is overlapped or co-channeled. To this end, algorithms have been developed in recent times seeking to address this challenge.

However, most of these require unique conditions in order to perform and necessitate the need for high computational complexity. They also require that an audio data analysis with respect to time and frequency domain be undertaken. Ben-Harush et al. (2009) have proposed a methodology that uses frame based entropy analysis, Gaussian Mixture Modeling (GMM) and well known classification algorithms to counter this challenge (p.1). To perform overlapped speech detection, the methodology suggests an algorithm that is centered on a single feature. This particular attribute is an entropy examination of the acoustic statistics in the time field.

To recognize speech sections that are partly covered the method uses the collective force of GMM (Gaussian Mixture Modeling) and distinguished categorization algorithms. A value of this projected method is that it gets rid of the necessity for a rigid threshold for any particular talk or record. The methodology proposed by Ben-Harush et al is known to detect 60.0 % of frames containing overlapped speech (p.1). This value is achieved when the segmentation is at baseline level (p.1). It is capable of achieving this value while it maintains the rate of false alarm at 5% (p.1).

Campbell has delved into research that has been aimed at improving the process of automatic speaker recognition. Automatic speaker recognition (ASR) systems are designed and developed to operate in two modes depending on the nature of the problem to be solved. The first mode is known as automatic speaker identification (ASI) while the second is known as automatic speaker verification (ASV). In ASV procedures, the person’s claimed identity is authenticated by the ASR machine using the person’s voice. In ASI procedures unlike the ASV ones there is no claimed identity thus it is up to the ASR machine to determine the identity of the individual and the group to which the person belongs. Known sources of error in ASV procedures are shown in the table below

Tab.1: Sources of verification errors.

Misspoken or misread prompted phases
Stress, duress and other extreme emotional states
Multipath, noise and any other poor or inconsistent room acoustics
The use of different microphones for verification and enrolment or any other cause of Chanel mismatch
Sicknesses especially those that alter the vocal tract
Time varying microphone placement

According to Campbell, a new automatic speaker recognition system is available and the recognizer is known to perform with 98.9% correct identification levels (p.1437 Signal acquisition is a basic building block for the recognizer. Feature extraction and selection is the second basic unit of the recognizer. Pattern matching is the third basic unit of the recognizer. A decision criterion is the fourth basic unit of the proposed recognizer.

Finally, Research work by Hosseinzadeh and Krishnan has been aimed at improving that the field of speaker recognition. According to Hosseinzadeh and Krishnan (2007), the concept of speaker recognition possesses seven spectral features. The seven features are used for quantification, which is important in speaker recognition since it is the case that vocal source information and the vocal tract function complement each other. The vocal truct function is determined specifically using two coefficients these are the MFCC and LPCC. MFCC stands for Mel frequency coefficients and LPCC stands for linear prediction cepstral coefficients.

The quantification is important in speaker recognition since it is the case where vocal source information and the vocal tract function complements each other. The vocal tract function determined using two coefficients, which are the Mel frequency cepstral coefficients (MFCC), or linear predication cepstral coefficients (LPCC). Very important in an experiment done to analyze the performance of these features is the use of a speaker identification system (SIS). A cohort Gaussian mixture model which is additionally text-independent is forms the ideal choice of a speaker identification method that is used in the experiment. The results from such an experiment reveal that these features achieve an identification accuracy of 99.33%. This accuracy level is achieved only when these features are combined with those that are MFCC based and additionally when undistorted speech is used.


Considering the level of research that has taken place in recent times in the field of speaker recognition and considering that more needs to be done. It will be useful for researchers in the field to apply the concept of knowledge integration. Information incorporation makes it possible to join various thoughts into a particular organization that is logical and rational. By achieving knowledge, integration an individual or organization is able to, first, make use of available knowledge to formulate solutions to address various problems or challenges that they are facing during growth. Secondly, knowledge integration helps to expose underlying assumptions and inconsistencies through reconciling conflicting ideas.

Thirdly, knowledge integration helps an individual or organization to identify areas with incoherence, uncertainty and in disagreement; it does his through synthesizing different perspectives. Finally, by weaving different ideas together knowledge integration achieves a whole that is better than the total of its part. Thus, researchers are able to create a recognizer that is more practical and cost effective.

The proposals put forward by the researchers should be assimilated to everyday life since they have been tried and found to be significant improvements. The results obtained from the experiments done serve as strong evidence that these methodologies do improve the practice of speaker recognition.

It is also important for the researchers to develop a culture of continuous quality Improvement (CQI). CQI refers to the formal approach applied in analyzing performance as well as improving it. Two commonly used CQI procedures are the Plan-Do-Check-Act (PDCA) system and the Failure Mode and Effect Analysis (FMEA). By applying either of these, the researchers can be assured of developing products that meet the set requirements of performance.


The field of speaker recognition is useful in today’s world considering the challenges people face in terms of security. Thus, the efforts that go into improving it should be noticed and rewarded. Governments and other institutions should fund research and development in this field without shying off. People on the other hand should embrace the technology as it enhances their safety and the safety of their processions.


Ben-Harush, O., Guterman, H. & Lapidot, I. (2009) Frame level entropy based overlapped speech detection as a pre-processing stage for speaker diarization, pp. 1-6. Israel: Jabotinsky.

Campbell, J. P. (1997) Speaker recognition: a tutorial, 85(9), pp. 1437-1462.

Chan, W. N., Zheng, N. & Lee, T. (2007) Discrimination power of vocal source and vocal tract related features for speaker segmentation, 15(6), pp. 1884-1892.

Chu, W. & Champagne, B. (2006) A simplified early auditory model with application in speech/music classification/music Classification. Mc University, pp. 775 – 778

Du, Y., Hu, W., Yan, Y., Wang, T. & Zhang, Y. (2007) Audio segmentation via tri-model bayesian information criterion, pp. I-205 – I-208. China: Intel China research.

Dutta, P. & Haubold, A. (2009) Audio-based classification of speaker characteristics, pp. 422 – 425. New York; Columbia University.

Giannakopoulos, T., Pikrakis, A. & Theodoridis, S. (2006) A speech/music discriminator for radio recordings using Bayesian networks, pp. V-809 – V.812. Greece: University of Athens.

Huang, R. & Hansen, J. H. L. (2006) Advances in unsupervised audio classification and segmentation for the broadcast news and ngsw corpora, 14(3), pp. 907-919.

Krishnan, S. & Hosseinzadeh, D. (2007) Combining vocal source and mfcc features for enhanced speaker recognition performance using gmms. Canada: Ryerson University.

Swamy, R., Murti, S. K. & Yegnanarayana, B. (2007) Determining number of speakers from multispeaker speech signals using excitation source information, 14(7), pp. 481-484.

Zhang, Y. & Zhou, J. (2004) Audio segmentation based on multi-scale audio classification, pp. IV-349 – IV-352. China: Tsinghua University.

Cite this paper

Select style


StudyCorgi. (2022, April 2). Procedure of Speaker Recognition. Retrieved from


StudyCorgi. (2022, April 2). Procedure of Speaker Recognition.

Work Cited

"Procedure of Speaker Recognition." StudyCorgi, 2 Apr. 2022,

* Hyperlink the URL after pasting it to your document

1. StudyCorgi. "Procedure of Speaker Recognition." April 2, 2022.


StudyCorgi. "Procedure of Speaker Recognition." April 2, 2022.


StudyCorgi. 2022. "Procedure of Speaker Recognition." April 2, 2022.


StudyCorgi. (2022) 'Procedure of Speaker Recognition'. 2 April.

This paper was written and submitted to our database by a student to assist your with your own studies. You are free to use it to write your own assignment, however you must reference it properly.

If you are the original creator of this paper and no longer wish to have it published on StudyCorgi, request the removal.