(0 votes)

Recitation of the Holy book of Muslims, The Holy Quran,

Abstract: Recitation of the Holy book of Muslims, The Holy Quran, is a regions duty and hence is done with utmost care such that no mistakes are made while reading it. These mistakes may include the wrong utterance of words, misreading words, and punctuation and pronunciation mistakes. Believers of Islam are spread all over the world and hence there also can be a difference in accent. To avoid this, tajweed rules are implemented to ensure that the utterance is done according to some rules. These rules ensure that there is no variance in the recitation of the Holy book for different reciters. For further improvements, the people are encouraged to memorize the whole book and the person who does that is called a Hafiz. Having known the whole book by heart down to every word with Tajweed rules he/she can a guide to correct other learners who intend to learn by listening to learners and correcting their recitation. But the availability of a Hafiz can be a problem where Islam is not a dominant religion. Furthermore the competency and level of expertise are of epic importance. To get around this problem we have designed and developed a system E-Hafiz. It is based on an idea that Tajweed rules are used to train learners how to recite Quran. To achieve this we used on Mel-Frequency Cepstral Coefficient (MFCC) technique. We extract the features of recorded voices using MFCC and compared with experts’ voices stored in database. Any mismatch on word level is pointed out and ask the user to correct it. [Aslam Muhammad, Zia ul Qayyum, Waqar Mirza M. Saad Tanveer, Martinez-Enriquez A.M., Afraz Z. Syed: E- Hafiz: Intelligent System to Help Muslims in Recitation and Memorization of Quran. Life Science Journal. 2012;9(1):534-541] (ISSN:1097-8135). Keywords: The Holy Quran, Islam, Muslims, Tajweed, Hafiz, Voice recognition, feature extraction, MFCC. 1. Introduction As Islam is the second largest religion of world’s population, and their Holy book being in Arabic with only 3.12% of the world’s population speaking Arabic (https://www.cia.gov).The adherents of Islam are 1.57 billion in 200 countries all over the world in the year 2009 (http://pewforum.org). Such diverse geographical distribution leads to the fact that most of the Muslim population does not have Arabic as their primary language. Regarding to problems in understanding and reading Arabic, the religion binds Muslims strongly to Arabic as their Holy book is in Arabic. The Holy Quran also gives an insight in science, engineering, social sciences, law and management (Abdul Rashid Sheikh, 2000).This fact encourages people to read and understand the Quran for both religious and scientific purposes. The Holy Quran was compiled as book nearly 1400 years ago. During this time Arabic as all other languages evolved and underwent changes. So the text reading is not same as Arabic we know today. To solve this barrier, some rules are implemented to ensure the correct reading of the Holy book. These rules have come to be known as Tajweed rules (H. Tabbal et al., 2006) (M.S. Bashir et al.). The recitation of Holy Quran by use of Tajweed rules is an art and reciters follow the Tajweed rules to build their recitation attractive (H. Tabbal et al., 2006). If someone wants to learn the Tajweed then he/she has to consult an expert Hafiz at some Learning institute. Learning is done manually as the student and Hafiz have to sit face to face and the learner recites and the Experts points out and corrects the mistakes if any occur. Presently, there exist institutes and websites that teach the recitation in manual method. The main problem consists here is the accessibility and feasibility of a Hafiz expert. Availability of Hafiz in countries where Islam is not a dominant religion can be quite less to come across. In addition, Hafiz’s level of expertise is a big issue. As a human being a Hafiz can also make mistakes while listening, so not all Hafiz can be tutors. The goal or objective for the development of E-Hafiz system is to facilitate the learning/memorizing of the holy Quran, minimizing errors or mistakes of all kinds, and the systematization of the recitation process. Using this system any reciter can learn the recitation skills at any place and any time. The presence of an expert hafiz would not be needed. Consequently, this system helps the Hafizes in preparation of recitation for 5 times prayers and Traveeh prayer in the month of Ramdhan. Many audio enabled applications are available in the markets which offer the Holy Quran as audio streams. One of the most popular and commonly used is the Quran Auto Reciter (QAR) (http://www.searchtruth.com/download.php). QAR provides a user interface where user can select the verse they intend to listen and the verse can be played stopped and paused. And as the words are played the text also gets highlighted. QAR can help learn the Quran and recitation improvements can be made as well as some basics can be learned. But it cannot judge the user’s accuracy and performance as there is no way QAR can indicate or detect the mistakes made by user. If software has the utility to detect and correct the mistake the learning and expertise can be improved substantially. So, the learning process through QAR is one only sided. A user can only listen his desired surah many times to improve recitation ability however he cannot know about the mistakes if any he made during his own recitation. Therefore, to solve this problem, he must consult to a hafiz/expert of Quran to know either he recite the holy Quran correct or make any mistake. So, reliability on other person is still exists in this system. Hence this system fulfill required objective in limited area not completely. (Hassan Tabbal et al., 2006) introduce an automated delimiter that can extract the verses from the audio file using the open source Sphinx framework and speech recognition techniques. In whole processing, two models acoustic and language are discussed. These models takes parts with feature vectors to generate search space for HMM nodes. These two models are: Acoustic Model: A set of phonemes symbols was used to train the state of HMM that corresponds to the acoustic model. For generating the Acoustic model for the application, the audio recitation of surah Al-Ikhlass (The Holy Quran) (recited by different reciters near about 1 hour) alongside with corresponding dictionary mapping is feed to the sphinxTrain application that generate acoustic representation of each word. Language Model: It was not easy to choose a language model for the holy Quran, as the high precision and accuracy required for recognition. So, for this system, a language model based on the Java Speech Grammar Format (JSGF) specification was chosen that is compatible with sphinx framework and fulfills all the requirements of the system. The JSGF rules used for this system are similar to those used for conversational system and generated them as such to reflect the structure of the Surah. The sphinx frame work automatically provides the core recognition process by using appropriate language and acoustic models. For this purpose, the sphinx frame work is configured by an xml based configuration file that includes the feature extraction algorithms and all other aspects that needed by any speech recognition systems. The system design phase is divided into two sub phases. Data Preparation: The frame of 10ms and a threshold of 10db were selected for speech segment extractor. To make the recognition ratio more accurate, a 2-stage pre-emphasis filter with factor values (0.92 and 0.97) is used. A raised cosine widower with the 512 points FFT analysis is used for the system and a Mel Filter Bank followed by a Discrete Cosine transformation is used for extracting MFCC features. MFCC have the ability to transform the frequency from a linear scale to a non linear one which can be achieved by this system by using set of 30 triangular Mel filters. Finally, to reduce the distortion effects produced by the microphone, the Cepstral Mean Normalization (CMN) operation is performed. The CMN is achieved by subtracting the mean vector from each vector. System Settings: The frontend phase output is feed to sphinx core recognizer which uses HMM as a recognizer tool. To translate the result of recognizer into common Arabic language, a hash map was used. The breath first search combined with beam search algorithm is used by the decoder to search to the same words obtained from the recognizer. When the words matched with the stored words, the audio verse correspond to the obtained combination is extracted. As we have seen, the above discussed application can help users to search a required verse form audio files but unfortunately it is not useful for the users who want to learn the recitation. This application is only used by those persons who know the recitation skill well but not for those who not know how to read the holy Quran. Fortunately, this application helped us very much in our research as it also uses MFCC feature extraction techniques for its implementation. So, the brief study of this application helps us a lot in establishing our ideas and solves the problems we faced. (Zaidi Razzak et al., 2008) wrote a review paper presents techniques used in Quran Arabic verse recitation recognition and also mention there advantages and disadvantages by comparing them. The objective of this research is to found more effective and efficient technique of Quran Arabic verse recitation recognition for their system that will used to support in j-QAF learning process. According to the paper, process of recitation recognition is commonly divided into: pre-processing, feature extraction, training and testing, feature classification, as well as pattern recognition. Preprocessing: In Pre-processing the information is organized to simplify the task of recognition. Three steps are performed: -End Point Detection specifies the start and end points of recorded words. -Smoothing reduces noise from the speech. -Channel normalization used to train a recognizer with recorded speeches. The recognition process is depending on speeches recorded from different microphones. Feature extraction: To differentiate words, the unique, discriminative, and computation efficient features are extracted from the speech signal. Four techniques are treated: (a) Linear Predictive Coding (LPC) that is not considered as a good method, since LPC reduces high and low order Cepstral coefficient into noise when coefficient are transferred into Cepstral coefficient, (b) Perceptual Linear Prediction (PLP) that is better than LPC, since the spectral features remains smooth within the frequency band in PLP and the spectral scale is non-linear Bark scale, (c) Mel-Frequency Cepstral Coefficient (MFCC), based on the frequency domain of Mel scale for human ear scale. MFCC is considered the best technique because behavior of acoustic system remains unchanged during transferring the frequency from linear to non-linear scale. (d) Spectrographic Analysis is used for Arabic language phoneme identification. Arabic phonemes are identified by spectrograms that are represented by distinct bands. Training and Testing: Speech sample is enrolled in the system database after constructing a model based on features extracted from the speech. Testing process determines similarity between score of newly speech word with the speech stored in DB. Three training and testing methods discussed are: a) Hidden Markov Model (HMM) in which each word is trained independently to get the best likelihood parameters and for this several utterances is performed to train each set of model. b) Artificial Neural Network (ANN) is a based mathematical model, that recognizes speech in such a way that a person applies to visualizing, analyzing, and characterizing the speech to measure its acoustic features. Here, we show that ANN is not well equipped with respect to HMM that solve problems. c) Vector Quantization (VQ) that uses a set of fixed prototype vectors called codebook and each vector in codebook is known as codeword. Quantization is performed by matching input vector against each codeword using distortion measure. d) Features Classification and Pattern Recognition that classify the object of interest into classes. The goal is to know patterns and classes referred to individual words. There are three methods for this purpose: HMM, VQ, and ANN. This process is also referred as feature matching. The author recommends HMM the best approach for feature extraction and HMM or VQ is for training and testing. HMM is used when Arabic language recognition has to perform and VQ for English language. There are many different methods suggested by (Zaidi Razzak et al., 2008) in to extract the features from the speech, the system suggests rules and regulations that should be followed during recitation. Again the basic learner is ignored. The system is useful for people who already know the correct pronunciation and Holy Quran rules. But, it is not suitable for non Arabic speakers. So, a system to help naïve learners to recite the Holy book but also is effective for expert users to know Tajweed rules, pointing out mistakes made during recitation is suitable, tasks achieved by our E- Hafiz system. We designed, implemented, and tested E- Hafiz application that helps learning like a Hafiz expert. Speeches signals will be gained by a sound speak by a person in microphone. By means of Mel- Frequency Cepstral Coefficient (MFCC) (Noor Jamaliah Ibrahim et al., 2008) transformation, voice features are extracted from the signal emphasized for further processing. MFCC transformation technique produces remarkable results, because the emulation of an auditory system behavior. The linear scale frequency is transformed to non linear one. MFCC is implemented by use of MATLAB framework. The extracted features are used to form a model of speech by use of Vector Quantization (VQ), and are stored in the Database which also contains a large number of speech vectors, obtained from different Quranic verses passed through the above process. Basically, the speech vector is an array of MFCC features. When user utters any verse, it is compared with the stored verse. Verses that are not matched with any registered one are considered as mistakes and pointed to user. 2. Material and Methods Voice content matching is the process of comparing the voice content of a speaker with the relevant voices contents stored in database of system and make decision on the bases of this compression. At abstract level the content matching system has two phases: Training phase and Testing phase. During the Training phase, the system is trained with the experts’ voices and during testing phase a user records his voice and this voice is matched with the experts’ voices to generate results. If we analyses the current method of teaching in existing organizations/institutes, we can find that most of the Islamic institutes follow the manual method of teaching recitations. In manual process, the teacher and student sit in front of each other and student starts recitation. Whenever student makes mistake, teacher point it out and correct it like a real time system. The developed system does not work in this real time mode, but it can help the user to know his mistakes after he complete his recitation. The reason behind the development of this non-real time/offline mode system is that at initial level we have some kind of issue which we cannot resolve during user recitation like removing silence from the voice and filter out signals etc. The figure 1 shows the core architecture of E-Hafiz system. In this system, the MFCC feature extraction techniques is used to get the feature vectors of some specific verses read by some experts and store in the system’s database. In order to test the performance of the system, the voice sample of same verses read by ordinary persons are taken and the feature of these voices are taken too by using MFCC techniques. Then, these vectors are compare with the vectors stored in system’s database to detect the mistakes if exists. The current system has the ability to detect mistakes on word level as the back end database is made at word level. So, any mistake made at word level can be identified. This is done by a word extraction module which extracts the words out the audio stream recorded. Whenever, a mistake is found the system gives option to the user to listen the verse again as in real life a Hafiz does. The user listens to the verse again to comprehend the right pronunciation and after that he again reads it until he passes the minimal criteria .This criteria however, provides more flexibility especially in Tajweed rules for beginners. But, they are stricter for experts. The dataset of E-Hafiz consists of 10 experts with database of first 5 Sorahs of the Holy Quran. Figure 1. Architectural model of E-Hafiz. The principal phases in E-Hafiz architecture are: data preparation, feature extraction, and modeling, storing and comparison. Data Preparation: In data preparation phase, the raw data (input speech signal) contains silences and noises, are filtered out in order to avoid the errors which may occur during processing and disturb the accuracy of results. The data preparation is subdivide into three steps: Silence Trimming: The first verse of Surah Al- Fatiha (sorah-1, verse-1) (The Holy Quran) of holy Quran is uttered as an input audio. The recorder audio signal of this verse is shown in figure 2. As it is widely possible that at start time and end times in which the user may have recorded his/her recitation may contain long or short silence gaps. A module trims these silence gaps. This gives a content rich stream for processing. In order to remove silences trimming, the short-term energy method (Mark Greenwood et al., 1999) is used. In short-term energy method, the energy of a speech signal can be calculated at any instance of time. So, the energy of each frame is calculated and removed all those frames whose energy is near about to zero. Figure 2. Audio signal of 1st verse of Surah “Al- Fatiha” Word Extraction: The next module separates the words that are content of the audio stream on the basis of silence threshold that enables the system to store all words uttered separately into the database. This is done to ensure that we can match the error on word level as the comparison done is not on the whole audio stream but on the words that have been uttered. Figure 3. Audio signal of 1st verse of Surah “Al- Fatiha” after extracting words The detection of words is also performed using short-term energy method (Mark Greenwood et al., 1999). Here, we calculate the energy of each frame and whenever a set of 16 frames having energy approaches to zero is found it is consider an end of word and all the consequent zero energy frames after that are removed. On moving forward where the set of 16 frame having energy greater than threshold founds, it is a sign of starting next word and all the consequent frames are included until again found zero energy frames. This process continues till the end of speech and finally we got all the available words in the speech (see Figure 3). Once words are extracted each of them undergoes these following steps. Pre Emphasis: After extracting all available words, the next step of data preparation is the pre emphasis each sub-signal/word, giving raise to higher magnitude frequencies with respect to lower in order to improve the signal to noise ratio. The echoes lies in the signal are also eliminated and due to this quality this process is also known as noise canceling filter. To perform the pre-emphasis, the application of first order Finite Impulse Response (FIR) filter is imply on digitized signal. The equation used to apply FIR filter is: 1 1)( zzH (1) Where α represents the pre-emphasis parameter, which may have value close to 1, in our case 0.935, it gives rise to the high frequency spectrum to more than 18dB amplification. Feature Extraction: The voice features are extracted by means of Mel-Frequency Cepstral Coefficient (MFCC) (Noor Jamaliah Ibrahim et al., 2008) transformation for further analysis purpose. MFCC is one of the best techniques used for feature extraction purpose that produces remarkable results in the field of voice content matching systems. It is because, it simulate the behavior of human ear and uses Mel Frequency scale. Here the Feature Extraction technique is performed on each extracted word of speech one after another to get their codebooks. The MFCC feature extraction technique consists of seven major components which are: Pre-Processing: This is the process in which we prepare data for processing and make it ready to operate. In section 3.1, some necessary initial processing is performed. So, our date is already prepared and no further operation is needed. Framing: Each word’s signal is segmented into 23 mS frames to convert non-stationary signals into qui-stationary format. Also, theses frames are further blocked overlap i.e. every frame contains the 11.5 mS of its previous frame’s data. The overlapping is performed to reduce the chances of losing information lies at the end of each frame, which may crash during segmenting of speech. Windowing: Performing windowing means, multiplying each frame with Hamming Window. Windowing shrink the signal values toward zero level at the boundaries of each frame and hence, it reduces the discontinuity. The Hamming Window is obtained by: 1 2 cos46.054.0 N w n n (2) In this equation N is the total number of samples in each frame and n is any value from range 0 to N-1. Windowing is performed by multiplying each sample of each frame with each corresponding element of hamming window. Discrete Fourier Transformation: In order to transfer each windowed frame from time domain to frequency domain, Discrete Fourier Transformation is applied by use of FFT algorithm. The windowed signal obtained from previous step, is given as an input to DFT and the output of this is a complex number, representing each frequency band (0 to N-1) having magnitude and phase of that frequency component in original signal. The DFT is obtained by the following equation: Nnkj N k ekYnY /2 1 0 12 (3) Where k= 0, 1, 2, 3, …, N-1 and Y2[n] is the Fourier Transform of Y1[k]. Mel Filter Bank: Often low frequencies in speech signal contain more useful and important information as compare to higher ones. So, to emphasize these low frequency components, Mel scale is applied. The formula used to calculate Mels for a frequency f in Hz is: ) 700 f +(1 log10*2592 = Scale) (MelFrequency (4) Logarithm: By use of logarithm, the multiplication effect of the magnitude of Fourier Transform is changed into addition. The Matlab command log is used to take the natural log of the Mel filtered speech segments. The effect of taking natural log is that it reduces the values of Mel filter bank. Inverse Discrete Fourier Transformation: IDFT converts the speech signal back to time domain from frequency. (5) Here x[k] is the logged value of each Mel filtered speech segment gained from previous step. L is the required number of Mel Cepstral Coefficient taken from N filtertapes of each frame and in our case L is 12. Modeling, Storing and Comparison of Codebooks: The output of second phase is used to generate model (a features codebook) of speech, and is stored in the Database. Basically, the feature vector is an array of MFCC features. As the speech have a large number of frames and each contains 12 feature vectors. So it is not easy to use all these feature vectors to form codebook. Hence, the numbers of features vectors are reduced by getting highly representative vectors which is achieved by Vector Quantization (VQ) technique (R. M. Gray et al., 1984). Vector Quantization is the data compression technique in which probability density functions are modeled by the distribution of prototype vectors. In this technique a large set of vectors having similar number of points are grouped together and these group are represented by their centroid point gained through clustering algorithm. In this system LBG algorithm is used to implement VQ. LBG algorithm works by clustering the similar vectors and finding a single representative value (centroid) for each cluster. This centroid value is also called a code vector and collection of all code vectors correspond to a specific voice is called codebook. For generating codebook, Mean is calculated for all features vector. So, to calculate the mean of a set of K vectors, the following formula is used: , k x M k 1i i (6) Here we have a single Mean value representing the whole data. Now this Mean is split into two Means and for this, a very small positive number let say (read as epsilon) is used as follow: M M 1 (7) M M 2 (8) Figure 4. Process of VQ Codebook Generation Now by use of these two Means two clusters of features i.e. Cluster1 contains all feature vectors having values close to M1 and Cluster2 contains all feature vectors having values close to M2, are created. In order to generate clusters of vectors, the distance between each feature vector and Means values is calculated through Euclidean Distance Formula. Now these two clusters are further divided into four clusters according to method discussed above. This iterative process will continue till 32 clusters are created as shown in figure 4. In every cluster there is a mean value called the code vector of that cluster and all code vectors in all clusters is called a codebook. Now for testing the system, an interface is given to user to selects the Sorah and verse she/he intends to recite as shown in figure 5(a). The system also gives him the option to select his expertise level. On this expertise level, the system decides that with how many experts’ voices, his voice is compare. For beginner, the voice of user is compare with all 10 experts’ voices. Figure 5(a). Main screen of E-Hafiz application When user utters any verse, it is compared with the stored verses read by number of different experts; this number however depends on the difficulty level selected. The comparison is performed on word level and in case of any word not matched with any registered one is considered as mistakes and pointed to user. The whole process is as follow: Figure 5(b). Result screen of E-Hafiz application 1. Gets the utterance of a selected verse recited by any user. 2. Extract the words from the voice sample taken. 3. Extract features of each word using MFCC technique discussed above. 4. Generate the codebook of these words and form an array represents whole verse. 5. From system’s database, extract the codebooks array of the same verse recited by experts. 6. For each word of Expert’s codebook array and user’s codebook array, compute their averages and calculate distance between them. 7. The resultant distance value is compared with the Threshold value. If the distance is less than the threshold value (Dependant on the skill level of user value is set to 2.6 for beginners) it is considered matched, else mismatch word. 8. If all the words of one expert’s array are matched with the user’s array, also the number of word in both arrays and their sequence is same then 1 match is considered. 9. In case of beginners at least 3 matches must be found and if not it is consider wrong utterance and pointed to user. 10. In case of wrong utterance result, the word which is found most mismatch is consider as wrong uttered word and that word is highlighted in result screen as shown in figure 7(b). In result screen option given to user to listen verse in expert voice so that he can found where he misread the word and then he given another option to try it again. User does this process again and again till he recites the whole verse correctly. 3. Results For experimentation, three groups of reciters men, women and children are chosen. Each reciter was asked to read some specific verse of the Holy Quran and his recitations were tested against the expert’s recitation. All these experiments are performed in the presence of an expert Hafiz. It is because an expert hafiz knows better either a user recite correct or wrong. So, when any candidate recites any verse, the expert listen his utterance along with E-Hafiz. After evaluating the utterance of candidate, expert tells his decision which is actually the true result of user utterance. The result generated by E-Hafiz is compared with the expert’s result. This comparison tells us the accuracy rates of E-Hafiz i.e. how much results generated by E-Hafiz are correct. Figure 6. The accuracy rate of E-Hafiz against each candidate The accuracy rate of E-Hafiz against each candidate is calculated by use of the following formula: 100 . . versesofNoTotal HafizEbytionidentificacorrectofNo RateAccuracy (9) The accuracy rate of each candidate is shown in the figure 6 in which x-axis contains the candidate IDs and y-axis contains results. Also the global mean of all type of candidate is calculated by dividing the sum of all results of one type of candidate to the total number of candidate in that group as shown in table 1. These results make us inspire and encourage us to make more improvements in this system and enhance its performance much more than the current ones. Table 1. Accuracy evaluation of E-Haiz Type of Reciters Number of Reciters Accuracy Rate Men 10 92% Children 10 90% Women 10 86% 4. Discussions The system solves a huge problem of arranging Hafiz for learning Quran or in cases where Hafiz could not be arranged learning in fear of mistakes. Users of the system have the possibility to finding mistakes and enhance recitation skills. After meticulous testing of system on many verses and test subjects the results obtained inspire us to further develop the system. With addition of word extraction feature E-Hafiz not only tackles the whole verse but also can identify the mistakes on word level. Our next endeavor is going to make this system capable of identifying the notes which are uttered, taking the recognition ability to letter level, where even a miss pronounced letter can be identified , further more a model assistance can be given on how to pronounce the word as per guided by rules in phonetics is underway.
Imam.Mostafa Nabawy

All of our courses are conducted online with one on one Real time tutoring. There are no group sessions so a teacher can concentrate on each student individually. Quality classes guaranteed every time.Learning Quran is the way to your Lord.

Leave a comment

Make sure you enter all the required information, indicated by an asterisk (*). HTML code is not allowed.