Quantcast
Channel: Recent Discussions - Hemoroizi Forum
Viewing all articles
Browse latest Browse all 26991

The MFCC feature extraction process is explained as follows Mermelstein

$
0
0
The MFCC feature extraction process is explained as follows (Mermelstein, 1976):(i)The b-th syllable of sb is sliced in c = 1, … , C shorter excerpts called frames: fc ∈ ?lf, of length lf. Typically, the spectral content is not present in the complete segment, but only during a certain time window. Thus, inaccuracies in the original segmentation are corrected. The length of the frame lf is a fixed parameter, but the number of frames depends on the length of the syllable ls.(ii)The Fourier transform, for d pre-defined frequencies, is taken for each of these excerpts in order to calculate the power spectrum. Consequently, the frequency bands of interest in the frame are identified.(iii)The power spectrum is mapped to the Mel-frequency scale (Eq. (3)). In the Mel-scale the frequency bands are not equally spaced, which is more approximated to the response of the animal auditory system (e.g., some individuals are unable to discern the difference between two closely spaced frequencies).equation(3)vmel=1000log2⋅log1+vfreq1000where vmel ∈ ?d is a vector of the original frequencies vfreq ∈ ?d mapped to a Mel frequency scale.(iv)A Mel-spaced filter-bank of z filters (algorithm parameter) is applied along the modified power spectrum in order to identify the existing L 006235 in each frequency region. For the methodology proposed in this paper, selected filters are triangular, half overlapping, with center frequencies uniformly distributed along the Mel frequency scale.(v)The log of the energy of each filter is obtained. The sound intensity is not perceived in a linear scale by the auditory system of the studied species, then, it should be taken into account.(vi)The discrete cosine transform (DCT) of each log of energy is taken. Filter-bank energies are quite correlated with each other because the filters of the filter-bank are all overlapping. The DCT is responsible to decorrelate the energies.(vii)Only the lower 12 DCT values are kept. This because increasing the accuracy of the parametric representation by adding parameters (12 or more) leads to an increment of complexity and eventually does not lead to better results due to stability issues. The larger the number of parameters in a model, the larger the training sequence (Mermelstein, 1976).(viii)The resultant n features (in this case 12 scalar numbers) are called Mel Frequency Cepstral Coefficients m ∈ ?n, with n = 12, and they are calculated for every c-th frame excerpted from the b-th syllable of sb. MFCCs can be understood as a modification of the conventional cepstrum in order to adapt the signal processing to the vocal specificities of the studied species (anurans). It emphasizes the frequency bands where their vocal apparatus works. The feature extraction is illustrated in Fig. 4.Fig. 4. Feature extraction—MFCC estimation. Each frame of the syllable is frequency-transformed, processed through a Mel-spaced filter bank and then decorrelated using a discrete cosine transform in order to obtain the Mel Frequency Cepstral Coefficients.Figure optionsDownload full-size imageDownload as PowerPoint slide(ix)Finally, the mean value of the MFCCs of all C   frames is calculated, obtaining a vector m¯∈?n per syllable. Then it is normalized (Eq. (4)) and used as input for the classification stage.equation(4)m^j=m¯j−m¯minm¯max−m¯minwhere m¯min and m¯max are the minimum and maximum values of m¯ respectively, m^∈?n is the vector m¯ normalized, and m^j is the datum belonging to the j  -th MFCC in m^.

Viewing all articles
Browse latest Browse all 26991

Trending Articles