A Hybrid Accurate Alignment method for large Persian-English corpus construction based on statistical analysis and Lexicon/Persian Word net

Mohammad Bagher Dastgheib, Seyed Mostafa Fakhrahmad, Mansour Zolghadri Jahromi


A bilingual corpus is considered as a very important knowledge source and an inevitable requirement for many natural language processing (NLP) applications in which two languages are involved. For some languages such as Persian, lack of such resources is much more significant. Several applications, including statistical and example-based machine translation needs bilingual corpora, in which large amounts of texts from two different languages have been aligned at the sentence or phrase levels. In order to meet this requirement, this paper aims to propose an accurate and hybrid sentence alignment method for construction of an English-Persian parallel corpus. As the first step, the proposed method uses statistical length based analysis for filtering of candidates. Punctuation marks are used as a directing feature to reduce the complexity and increase the accuracy. Finally, the proposed method makes use of some lexical knowledge in order to produce the final output. . In the phase of lexical analysis, a bilingual dictionary as well as a Persian semantic net (denoted as FarsNet) is used to calculate the extended semantic similarity. Experiments showed the positive effect of expansion on synonym words by extended semantic similarity on the accuracy of the sentence alignment process. In the proposed matching scheme, a semantic load based approach (which considers the verb as the pivot and the main part of a sentence) was also used in order for increasing the accuracy. The results obtained from the experiments were promising and the generated parallel corpus can be used as an effective knowledge source by researchers who work on Persian language.

Full Text:



Biçici, E. (2008). Context-based sentence alignment in parallel corpora. Lecture Notes in Computer Science, 4919, 434-444.

Bijankhan, M., Sheykhzadegan, J., Bahrani, M., & Ghayoomi, M. (2011). Lessons from building a Persian written corpus: Peykare. Language Resources and Evaluation, 45(2), 143-164.

Braune, F., & Fraser, A. (2010, August). Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 81-89). Association for Computational Linguistics.

Chen, S. F. (1993). Aligning Sentences in Bilingual Corpora Using Lexical Information. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio (1993) 9–16.

Chuang, T. C., Wu, J. C., Lin, T., Shei, W. C., & Chang, J. S. (2005). Bilingual sentence alignment based on punctuation statistics and lexicon. In Natural Language Processing–IJCNLP 2004 (pp. 224-232). Springer Berlin Heidelberg.

Deng, Y., Kumar, S., & Byrne, W. (2007). Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering, 13 (3), 235-260.

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61-74.

Fattah, M. A., Bracewell, D. B., Ren, F., & Kuroiwa, S. (2007). Sentence alignment using P-NNT and GMM. Computer Speech & Language, 21(4), 594-608.

Feili, H., & Ghassem-Sani, G. (2004, August). An application of lexicalized grammars in English-Persian translation. In ECAI (Vol. 16, p. 596).

Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75-102.

Gautam, M., & Sinha, R. M. K. (2007, March). A hybrid approach to sentence alignment using genetic algorithm. In Computing: Theory and Applications, 2007. ICCTA'07. International Conference on (pp. 480-484). IEEE.

Haruno, M., & Yamazaki, T. (1997). High-performance bilingual text alignment using statistical and dictionary information. Natural Language Engineering, 3(1), 1-14.

Mosavi Miangah, T. (2009). Constructing a large-scale english-persian parallel corpus. Meta: Journal des traducteursMeta:/Translators' Journal, 54(1), 181-188.

Pilevar, M. T., Faili, H., & Pilevar, A. H. (2011). Tep: Tehran english-persian parallel corpus. In Computational Linguistics and Intelligent Text Processing (pp. 68-79). Springer Berlin Heidelberg.

Mohammadi, M., & GhasemAghaee, N. (2010, March). Building bilingual parallel corpora based on wikipedia. In Computer Engineering and Applications (ICCEA), 2010 Second International Conference on (Vol. 2, pp. 264-268). IEEE.

Moore, R. C. (2002). Fast and accurate sentence alignment of bilingual corpora (pp. 135-144). Springer Berlin Heidelberg.

Rasooli, M. S., Kashefi, O., & Minaei-Bidgoli, B. (2011). Extracting parallel paragraphs and sentences from english-persian translated documents. InInformation Retrieval Technology (pp. 574-583). Springer Berlin Heidelberg.

Sarikaya, R., Maskey, S., Zhang, R., Jan, E. E., Wang, D., Ramabhadran, B., & Roukos, S. (2009). Iterative sentence-pair extraction from quasi-parallel corpora for machine translation. In INTERSPEECH (pp. 432-435).

Shamsfard, M. (2008). Developing FarsNet: A lexical ontology for Persian. In4th Global WordNet Conference, Szeged, Hungary.

Simard, M., Foster, G. F., & Isabelle, P. (1993, October). Using cognates to align sentences in bilingual corpora. In Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: distributed computing-Volume 2 (pp. 1071-1082). IBM Press.

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., & Varga, D. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058.


  • There are currently no refbacks.

E-ISSN: 2008-8310

   ISSN: 2008-8302