How to use DataTables in Laravel 11
Read More
An Innovative Method to Building Online Urdu Corpus
(اردو کارپس)
1. Summary
This proposal aims to introduce an innovative method of creating online Urdu corpus that addresses the challenges rose due to the lack of advancement in technology to support Urdu script and the unavailability of Online Urdu corpus. These challenges prevent the researcher to conduct Urdu linguistic analysis using cutting-edge corpus tools. However, this innovative method enables researchers to build online Urdu corpus easily and for free of cost.
2. Introduction to the Problem
Urdu is a language with a rich cultural and linguistic heritage, but it faces significant challenges in the field of corpus linguistics. Due to its script which is different from the other modern languages like English, French and German, etc and the limited availability of suitable tools make it difficult to build its corpora. There are a few Urdu corpus available online which are often outdated, limited in size, paid, and also incompatible with cutting-edge corpus tools. Despite the growing importance of Urdu in various domains, including literature, media, and academia, the lack of updated and sizable Urdu corpora impedes researchers from doing corpus analysis using the cutting-edge corpus tool. Additionally, in order to build Urdu corpus, it required text in a word document. However, most of the Urdu text found online is in other formats like PDF. When we attempt to convert the PDF Urdu files into Word document using online file converter, it poses additional challenges because the converters do not support Urdu script.
3. Problem Statement
The lack of online Urdu corpus and an effective method to build Urdu corpus prevent the researchers to take befit from the cutting-edge corpus tools and novel methodologies to conduct researches in Urdu language. The process of converting Urdu PDF text to editable format is also a big challenge that makes it difficult to build Urdu corpus. Therefore, it is significantly important to find an innovative approach to overcome these challenges and facilitate the building of Urdu corpora.
4. Objectives
The primary objective of this proposal is to introduce an innovative approach to building online Urdu corpus that addresses the challenges that Urdu language faces in corpus analysis. The proposed methods aims to streamline the Urdu corpus building process, making it feasible, easier and accessible to linguists, researchers, and language enthusiasts.
5. Methodology/ Project Detail
5.1 Procedure
The innovative idea proposed in this proposal was given a final shape after going through numerous subtle steps. The first challenge was to convert Urdu text which is in PDF to an editable format in order to make the text machine readable and to clean the file of non-essentials. I tried more than 20 commonly used online file converter tools, however; none of them supports Urdu script as when I attempted to convert, the text messed up and was unable to understand. There were some paid converter that claim to convert Urdu PDF to Word document however; when I tried trials the text was still messed up. In this regard, after putting a lot of efforts and brainstorming I found an alternative way through which I addressed this issue which can be seen in the following section. Secondly, when cleaning the text in word document, it takes a lot of times to remove the line breaks and spaces. However, I explored a website that addresses this issue. Thus, this innovative method of building Urdu corpus was developed following the steps below:
5.2 Step by Step guide to building Urdu corpus
Step 1
First of all, download all the required PDF files on which you want to build Urdu corpus like books, research articles etc. After that convert the PDF files to JPG file online through the following online files converter tool:
https://www.ilovepdf.com/pdf_to_jpg
Select the PDF files one by one and convert to JPG file
When the PDF file is converted into JPG images, download the JPG images into a new folder.
When you download the JPG file, you will find a zip file. Now, extract the Zip file into the same folder and you will find images with page numbers.
Step 2
In the second step, use Google Lens to scan the images one by one.
Upload the images one by one or drag into the Google scan
After that select the text and copy and paste it into a word document
Step 3
In the third step, clean the file of non-essentials including headings and sub-headings, pagination, tables and figures manually.
After that in order to remove the line breaks, spaces between paragraphs and indentations, use the online tool RemoveLineBreak: https://removelinebreaks.net/
Cut the text from the Word document and paste into the box or upload the file and then click on the convert button
Copy the text to clipboard and paste it into the word document again
Here is the final look of the text in word file.
Now, note down the word count.
After that convert the word file into a text file after clicking Save As going to file button on the top the word document or simply press Ctrl+S. After that give file a code or text ID of any range for example if it is a research article then give a code e.g. RA-001 etc.
After that click the drop-down list and choose plain text
Click on the save button
Clicking on the save button, a window tab will pop up. Choose other encoding option in the setting.
Select Unicode (UTF-8) and click on ok button
The Txt file is ready.
6. Application and Contributions
The proposed method for building an online Urdu corpus has immense potential to revolutionize the field of Urdu linguistics and linguistic research as a whole. This innovative approach provides linguists and researchers an innovative method to build a sizable Urdu corpus of any type of text from different genres that best suit their research objectives. Following this method, researchers can easily build corpus of their own to analyse linguistic patterns, study language variation and evolution, explore sociolinguistic phenomena, and develop computational models for Urdu language processing. In conclusion, the innovation of this method for online Urdu corpus building marks a significant milestone in the advancement of Urdu linguistics and linguistic research methodologies. Consequently, it will promote inclusivity and diversity in Urdu linguistic studies, amplifying voices and perspectives that may have been previously marginalized in scholarly discourse.
Recent posts form our Blog
0 Comments
Like 1