Abstract
vlrMemos is an app to record voice or audio memos and to measure the quality of the voice.
The app will calculate and display in real-time acoustic parameters such as the LTAS (Long-Term Average Spectrum) and the HPR (High-Frequency Power Ratio).
When playing recordings, the application will add the calculation and the display of other parameters such as the Jitter, the Shimmer and the HNR (Harmonic to Noise Ratio).
The advanced parameters such as the CPP (Cepstral Peak Prominence) and the CPPS (Smoothed Cepstral Peak Prominence), which are reliable measures of the dysphonia, will be also calculated and displayed. Finally, the spectral parameters measuring the slowdown of the spectrum, the loss of complexity of the signal or the similarty between several channels will be calculated and displayed.
During the sound renderings, before the decompressions, personalized FIR (Finite Impulse Response) filters, generated from normalized and non normalized audiograms, will be applied to the channels in the Fourier domain (fast convolutions), for highly optimized and tailored audio outputs.
The app will be available for computers, tablets, smartphones, smartwatches and connected objects.
It can use an audio codec (audio compression and decompression method). This very fast and high-quality audio codec is based on FFT (Fast Fourier Transform) and can be accelerated with the GPU support (Graphics Processing Unit), for a very low battery consumption.
This codec is quasi-lossless in energy: the energy of an uncompressed frame is almost the same as the energy of the compressed frame.
This codec can provide the audio in 3D. During the sound renderings, before the decompressions, generic or personalized HRTF (Head-Related Transfer Function) filters are applied to the channels in the Fourier domain (very fast operations), for high quality 3D positional audio outputs.
The app will be compatible with the body sounds, the physiological signals and the variability data.
Using this app, one can:
- Detect anomalies in the voice.
- Monitor the effectiveness of a treatment of the voice.
- Monitor the progress during a training of the voice or during a speech therapy.
- Record and analyze the heartbeat sounds and the lung sounds.
- Record and analyze the physiological signals.
- Optionally, perform the sonification of the physiological signals and the variability data.
- Optionally, send the average values of some parameters in the form of codes of intensity and / or color (light notifications) to connected bulbs or to bridges of connected bulbs.
- Optionally, display the curves of values for the selected acoustic parameters.
- Optionally, compute and display the heart rate variability (HRV).
Description
Several acoustic parameters exist for measuring the quality of the voice.
One can mention:
- The
LTAS: Long-Term Average Spectrum.
To measure the quality of the voice. It provides an objective measure of the evaluation of this quality, which usually depends on the auditory perception.
- The
HPR: High-Frequency Power Ratio.
To allow breathy voices detection. It compares the proportion of the acoustic energy in the high frequencies to the proportion of energy in the low frequencies.
- The
Jitter:
To Measure the irregularities in the durations. It measures the short-term disruption of the fundamental frequency of the sound signal.
- The
Shimmer:
To Measure the irregularities in the amplitudes. It measures the short-term disruption of the amplitude of the sound signal.
- The
HNR: Harmonic to Noise Ratio.
To measure the harmonicity of the signal. It measures the level of noise in the sound signal.
The
CPP: Cepstral Peak Prominence.
The
CPPS: Smoothed Cepstral Peak Prominence.
To estimate the dysphonia severity. These parameters are good predictors and are reliable measures of the dysphonia.
-
The Spectral Entropy (Shannon and Renyi):
To measure the degree of complexity of a signal. A noisy signal will tend towards 1 whereas a pure tone will tend towards 0.
-
The Spectral Rolloff Point:
To measure the frequency below which there is 95% of the magnitude distribution.
-
The Spectral Centroid:
To measure the center of mass of the spectrum.
- The
MCCC: Multichannel Cross Correlation Coefficient.
To measure the degree of similarty between several channels.
Optionally, the following parameters will be calculated and displayed:
- The
COG: Center of Gravity.
To measure the musical timbre or the brightness of the sound signal.
- The
Average Absolute Deviation:
To measure the deviation of the magnitudes or the fundamental frequencies around the mean.
- The
Standard Deviation:
To measure the dispersion of the magnitudes or the fundamental frequencies around the mean.
- The
Skewness:
To measure the asymmetry of the distribution of the magnitudes or the fundamental frequencies.
- The
Kurtosis:
To measure the flattening of the distribution of the magnitudes or the fundamental frequencies.
vlrMemos is an app that will allow to record voice or audio memos and display these parameters.
The LTAS and the HPR will be displayed in real time and when reading voice recordings.
The Jitter, the Shimmer, the HNR, the CPP, the CPPS and other parameters will be displayed during the readings.
The values displayed in real time will be calculated with the uncompressed samples.
The values displayed during the readings will be calculated with the compressed or uncompressed samples depending on the backup option.
The saving of data in memory or on disk will be in the classic WAVE format (compressed or uncompressed). By default one will offer the VLC HQ 48 codec and the compressed WAVE format.
In option, we will propose the W64 format (Sony Pictures Digital Wave 64), with compressed or uncompressed samples. This format supports files larger than 4 Go.
VLC HQ 48 Codec
Very fast and high quality audio codec, using FFT.
The recordings will be in compressed WAVE format. They directly contain the coded values of frequencies (positions), magnitudes and phases.
Because the codec uses the frequency domain and FFT, during the readings, there is no need to perform FFT to recalculate the acoustic parameters if one uses the compressed WAVE format. With the uncompressed WAVE format, one must do again FFT.
Because the codec uses FFT, it can be accelerated with the GPU programming.
In addition, it uses the greatest points and the most energetic bands. The bands will be encoded independently of each other, so we can encode in parallel and also use the GPU acceleration.
The GPU programming will allow to have very low battery consumptions.
One will find more information about this codec at the following addresses:
-
Algorithms
-
VLB
It should be noted that the current version of the codec is quasi-lossless in energy: the energy of an uncompressed frame is almost the same as the energy of the compressed frame.
There is no concept of psycho-acoustic, all points can be taken into account.
There is no concept of similar frames, useful concept for the communications.
It should be noted also that the use of the compression allows to require less memory, limits the amount of data to transfer and saves the storage space.
Without compression, with one channel, 16 bits and 48 kHz sampling rate, a second of voice occupies 0.768 Mbits (megabits), 30 seconds occupy 23.040 Mbits, a minute occupies 46.080 Mbits and 5 minutes occupy 230.400 Mbits.
With compression by the VLC HQ 48 codec at 64000 bps, and without additional lossless compression, a second of voice occupies 0.064 Mbits (megabits), 30 seconds occupy 1.92 MBit, a minute occupies 3.84 Mbits and 5 minutes occupy 19.2 Mbits.
This codec will support the multichannel (in option).
VLC HQ 16 Codec
To take into account the sounds of the body (very low frequencies) and the very long recording durations, a lower sampling rate (16 kHz and less instead of 48 kHz) will be used.
The VLC HQ 16 codec will further support the multichannel (in option), for the transmission of data such the ECG (ElectroCardioGram).
Data such as the EEG (ElectroEncephaloGram) and the EMG (ElectroMyoGram) will be supported. The ABP (Arterial Blood Pressure) waveforms data and the plethysmographic waveforms data (from the pulse oximetry) will be supported too. Lastly, the blood glucose waveforms data will be supported.
The multichannel will be compatible with the
USB 2.0 Audio Interface.
If useful, one will use an additional lossless compression.
The number of frames per second is about 31.25 for the audio. It will be around 0.5 to 2.0 frames per second for the physiological signals.
One will find more information about the inclusion of the ECG data and the use of this codec for the telemonitoring at the following address:
-
Telemonitoring
VLC 3D 48 and VLC HQ 3D 48 Codecs
These codecs will be compatible with the 3D positional audio.
The HRTF filters (Head-Related Transfer Function), customizable, will be applied to outputs in mono, stereo or multichannel.
The custom HRTF filters are useful not only for the 3D audio effects, but also as hearing aids for the hard of hearing.
It should be noted an interesting property that is found in no other non FFT audio codec: compressed frames being directly in the Fourier domain, it is not necessary to make FFT transforms in order to apply the HRTF filters.
-
More Information
Custom FIR Filters
Possibility to load custom FIR (Finite Impulse Response) filters for all the codecs and all the sampling rates.
This is useful for personalized audio output and hearing corrections. The filters are generated from text files containing the relative sensitivity of each ear at different frequencies (as the audiogram data).
The length of the FIR filters may be up to 1536 samples for a single channel with a sampling frequency of 48 kHz. The FIR filters are applied in the Fourier domain (fast convolutions).
Heart Rate Variability (HRV)
In option, we will compute and display the heart rate variability (HRV) from heartbeat sounds, from ECG signals or from variability data. More information on the HRV at the following address:
-
Heart Rate Variability
Variability Data
We are interested in data such as the changes in the heart rate as a function of the time or the changes in the systolic blood pressure as a function the time. These data are used to calculate the heart rate variability or the blood pressure variability. There are typically 60 to 100 samples per second, therefore 5 minutes of data occupy a buffer of 300 to 500 samples. One will issue frames containing 1024 samples. Other types of data can be considered.
The input data consist of lines in the text CSV format (time,data). If there are N channels, the lines will be in the form:
- (time1,data1,time2,data2,...,timeN,dataN).
The (minimum) sampling rate after interpolation will be:
- sampling rate = (total samples / total time).
The (minimum) number of frames per second will be:
- frames per second = (sampling rate / 1024).
Very low or very high sampling rates are not problems with our codecs.
The recordings will be compressed or uncompressed WAVE files depending on the backup option.
Instead of displaying the values of the LTAS or of the HPR, we will display the spectral energy for low frequencies (LF), the spectral energy for the high frequencies (HF) and the LF/HF ratio.
Sonification
The Sonification concerns the physiological signals and the variability data. During the recordings or the readings, by default, there is no sound for these signals or data, but the displays of the parameter values for a channel or for the average of the channels.
Optionally, we will generate a sound per channel (the multichannel will be possible) using a good quality sonification algorithm. We will use the sonification by the spectral mapping (Spectral Mapping Sonification).
The spectral mapping sonification allows to monitor all the frequencies or a specific band of frequencies.
Recent studies have shown, for example, you could hear the difference between a normal heart rate and an abnormal heart rate thanks to the sonification of the ECG signals.
More information on the data sonification with vlrMemos at the following address:
-
Data Sonification with vlrMemos
Send and Share
It is not planned in the immediate to have sending and sharing features for the files created by vlrMemos. One will be able to use messenging apps which allow to send files (WhatsApp, Skype, ...).
One will be able to play WAVE files (uncompressed or compressed with vlrMemos) in readable directories.
Using vlrMemos in reading, one will be able to use custom FIR and HTRF filters. With the VLC codecs, one will be able to use more effectively those filters, because there will no need to be placed in the Fourier domain.
Crowdfunding
We will consider the following operating systems:
- Windows.
- Android and Android Wear.
- iOS (iPhone, iPad) and watchOS (Apple Watch).
- Windows Phone.
Optionally, all the systems supported by
PJSIP may be considered.
The programming of the GPU will be made via the
Compute Shaders (available from OpenGL 4.3 and OpenGL ES 3.1).
Optionally, there will be the support of the MARE (Asynchronous Multicore Runtime Environment) SDK, dedicated to the Android operating system and the Qualcomm Snapdragon SoCs (Adreno GPU), and the Metal SDK, dedicated to the iOS operating system.
More information on the GPU support with the mobile devices at the following addresses:
-
Wearables
-
MARE SDK
-
Metal SDK
For the counterparties, see on the crowdfunding website.
Unlike
vlrPhone that is completely Open Source,
vlrMemos will include some proprietary parts, chiefly the graphical interface.
PJSIP, all the VLC and VLR codecs as well as other libraries are Open Source. Open Source libraries will be statically or dynamically linked to the proprietary modules. The sources codes of all the Open Source libraries will be public.
Usefulness
The calculated and displayed acoustic parameters will allow to measure the quality of the voice in order to:
- detect anomalies in the voice;
- monitor the effectiveness of a treatment of the voice;
- monitor the progress during a training of the voice or during a speech therapy.
One can also use them for the purpose of classification and research.
We will note in red the data below the thresholds that can be considered as pathological.
In option, this app will also display the curves of values for the selected acoustic parameters.
The app will be compatible with non speech body sounds in real situations.
One will be able to record and analyze heartbeat sounds and lung sounds.
The app will be compatible with longs recordings.
Our codecs are based on FFT and directly store the values of frequencies, magnitudes and phases. In addition to data compression, one can have the spectral density of signals when reading recordings, without doing FFT again.
The spectral density indicates the power of each frequency component of the signal. The spectral density can be used to analyze directly a variety of physiological signals.
A change of voice is called dysphonia, and a voice reduced to a whisper is called aphonia.
Diseases resulting in voice disorders can be treated or delayed more effectively if they are detected earlier.
One can cite for example:
- The acute laryngitis.
- The mishandling or vocal abuse and dysfunctional dysphonia.
- The benign lesions of the vocal cords.
- The laryngeal papillomatosis.
- The chronic laryngitis.
- The laryngeal paralysis.
- The spasmodic dysphonia.
- The cancer of the vocal cords (throat cancer).
- The dysarthria is a weakness or paralysis of the vocal cords caused by damages to the nerves and/or to the brain, for example in case of multiple sclerosis, stroke or Parkinson's disease.
- The aphasia is caused by a loss of speech or disorders in the spoken and/or written language, for example in case of Alzheimer's disease.
For some professions (such as speakers, coaches, teachers, animators and singers for example), the voice quality is fundamental.
For smokers, the detection of a persistent abnormality of the voice can enable the early detection of a serious illness such as the lung cancer.
The interest of the heart rate variability (HRV) has been demonstrated in the analysis of the recovery of the athletes. The HRV is an excellent general health level indicator and a predictor of the hypertension. A decrease in the spectral energy is a sign of risks of cardiac events.
The measure of the dysphonia allows the detection and the effective monitoring of the Parkinson disease.
The Alzheimer disease is characterized by:
- The slowdown of the EEG (electroencephalogram), that is to say, a rise in the power of magnitudes in the lower frequencies.
- The loss of complexity of the EEG signals.
- The loss of synchrony of the EEG signals.
From the EEG signals, the spectral parameters and the multichannel cross correlation coefficient allow to quantify these anomalies.
More complex synchrony measures exist, such as:
- The phase syncrony.
- The Magnitude Squared Coherence (MSC).
- The Granger causality and the derived measures, including the Direct Transfer Function (DTF). These measures use specific areas of the frequency domain. One will be able to use directly the records in the frequency domain, especially in a monitoring context.
Finally, one can point that the power spectral analysis of the EEG signals is the most common tool used in the sleep research.
App Interfaces
Examples of Results
Version = V7 (quasi-lossless in energy).
LTAS = Long-Term Average Spectrum.
HPR = High-Frequency Power Ratio, with:
(1)
Low-Frequency range: 0 - 500 Hz,
High-Frequency range: 500 - 4000 Hz.
(2)
Low-Frequency range: 0 - 6000 Hz,
High-Frequency range: 6000 - 20000 Hz.
Caculations with the
PRAAT free software.
For the CPPS (Smoothed Cepstral Peak Prominence), PRAAT default values are used with the quefrency averaging window equals to 0.032 second.
|
|
Original Voice 48 kHz Sampling Rate
|
After compression and decompression by the VLC HQ 48 codec at 96000 bps.
|
Original Male Voice
Click Here to Listen WAV
LTAS = 70.064 dB
HPR = -6.412 dB (1)
HPR = -28.849 dB (2)
CPPS: 55.913 dB
Quefrency: 331 Hz
|
Compressed Male Voice
Click Here to Listen WAV
LTAS = 70.047 dB
HPR = -6.402 dB (1)
HPR = -29.787 dB (2)
CPPS: 57.006 dB
Quefrency: 331 Hz
|
Original Female Voice
Click Here to Listen WAV
LTAS = 79.958 dB
HPR = -5.428 dB (1)
HPR = -27.527 dB (2)
CPPS: 55.129 dB
Quefrency: 331 Hz
|
Compressed Female Voice
Click Here to Listen WAV
LTAS = 79.939 dB
HPR = -5.432 dB (1)
HPR = -28.133 dB (2)
CPPS: 55.057 dB
Quefrency: 331 Hz
|
|
|
subjects
Voice Memos.
Audio Memos, Sound Memos.
Voice Quality Measure.
Voice Training.
Acoustic Parameters.
Long-Term Average Spectrum (LTAS).
High-Frequency Power Ratio (HPR).
Jitter.
Shimmer.
Harmonic to Noise Ratio (HNR).
Cepstral Peak Prominence (CPP).
Smoothed Cepstral Peak Prominence (CPPS).
Spectral Entropy.
Spectral Rolloff Point.
Spectral Centroid.
Multichannel Cross Correlation Coefficient (MCCC).
Center of Gravity (COG).
Average Absolute Deviation, Standard Deviation.
Skewness, Kurtosis.
Heart Rate Variability (HRV).
Heart Sounds or Heartbeat Sounds.
Lung Sounds, Breath Sounds or Respiratory Sounds.
Quantified Self, Self Training.
Spectral Mapping Sonification.
3D Audio, 3D Positional Audio.
HRTF (Head-Related Transfer Function).
Graphics Processing Unit (GPU).
Our codecs are based on FFT and can be accelerated with the GPU support
as this WebGL animation.
three.js