Loading image

Blogs / English

An Innovative Method to Building Online Urdu Corpus (اردو کارپس)

An Innovative Method to Building Online Urdu Corpus (اردو کارپس)

  • Nasir Hussain
  • 0 Comments
  • 699 View

An Innovative Method to Building Online Urdu Corpus

(اردو کارپس)

1.      Summary

This proposal aims to introduce an innovative method of creating online Urdu corpus that addresses the challenges rose due to the lack of advancement in technology to support Urdu script and the unavailability of Online Urdu corpus. These challenges prevent the researcher to conduct Urdu linguistic analysis using cutting-edge corpus tools. However, this innovative method enables researchers to build online Urdu corpus easily and for free of cost. 

2.      Introduction to the Problem

Urdu is a language with a rich cultural and linguistic heritage, but it faces significant challenges in the field of corpus linguistics. Due to its script which is different from the other modern languages like English, French and German, etc and the limited availability of suitable tools make it difficult to build its corpora. There are a few Urdu corpus available online which are often outdated, limited in size, paid, and also incompatible with cutting-edge corpus tools. Despite the growing importance of Urdu in various domains, including literature, media, and academia, the lack of updated and sizable Urdu corpora impedes researchers from doing corpus analysis using the cutting-edge corpus tool. Additionally, in order to build Urdu corpus, it required text in a word document. However, most of the Urdu text found online is in other formats like PDF. When we attempt to convert the PDF Urdu files into Word document using online file converter, it poses additional challenges because the converters do not support Urdu script.

3.      Problem Statement

The lack of online Urdu corpus and an effective method to build Urdu corpus prevent the researchers to take befit from the cutting-edge corpus tools and novel methodologies to conduct researches in Urdu language. The process of converting Urdu PDF text to editable format is also a big challenge that makes it difficult to build Urdu corpus. Therefore, it is significantly important to find an innovative approach to overcome these challenges and facilitate the building of Urdu corpora.

 

 

4.      Objectives

The primary objective of this proposal is to introduce an innovative approach to building online Urdu corpus that addresses the challenges that Urdu language faces in corpus analysis. The proposed methods aims to streamline the Urdu corpus building process, making it feasible, easier and accessible to linguists, researchers, and language enthusiasts.

5.      Methodology/ Project Detail

5.1 Procedure

The innovative idea proposed in this proposal was given a final shape after going through numerous subtle steps. The first challenge was to convert Urdu text which is in PDF to an editable format in order to make the text machine readable and to clean the file of non-essentials. I tried more than 20 commonly used online file converter tools, however; none of them supports Urdu script as when I attempted to convert, the text messed up and was unable to understand. There were some paid converter that claim to convert Urdu PDF to Word document however; when I tried trials the text was still messed up. In this regard, after putting a lot of efforts and brainstorming I found an alternative way through which I addressed this issue which can be seen in the following section. Secondly, when cleaning the text in word document, it takes a lot of times to remove the line breaks and spaces. However, I explored a website that addresses this issue. Thus, this innovative method of building Urdu corpus was developed following the steps below:

5.2 Step by Step guide to building Urdu corpus

Step 1

First of all, download all the required PDF files on which you want to build Urdu corpus like books, research articles etc. After that convert the PDF files to JPG file online through the following online files converter tool:

 

 

https://www.ilovepdf.com/pdf_to_jpg

Select the PDF files one by one and convert to JPG file

                          

 

When the PDF file is converted into JPG images, download the JPG images into a new folder.

When you download the JPG file, you will find a zip file. Now, extract the Zip file into the same folder and you will find images with page numbers.

Step 2

In the second step, use Google Lens to scan the images one by one.

 

Upload the images one by one or drag into the Google scan

After that select the text and copy and paste it into a word document

 

 

Step 3

In the third step, clean the file of non-essentials including headings and sub-headings, pagination, tables and figures manually.

After that in order to remove the line breaks, spaces between paragraphs and indentations, use the online tool RemoveLineBreak: https://removelinebreaks.net/

 

Cut the text from the Word document and paste into the box or upload the file and then click on the convert button

 

Copy the text to clipboard and paste it into the word document again

 

 

 

Here is the final look of the text in word file.

 

Now, note down the word count.

 

 

After that convert the word file into a text file after clicking Save As going to file button on the top the word document or simply press Ctrl+S. After that give file a code or text ID of any range for example if it is a research article then give a code e.g. RA-001 etc.

After that click the drop-down list and choose plain text

Click on the save button

 

Clicking on the save button, a window tab will pop up. Choose other encoding option in the setting.

Select Unicode (UTF-8) and click on ok button

The Txt file is ready.

 

 

 

6.      Application and Contributions

 

The proposed method for building an online Urdu corpus has immense potential to revolutionize the field of Urdu linguistics and linguistic research as a whole. This innovative approach provides linguists and researchers an innovative method to build a sizable Urdu corpus of any type of text from different genres that best suit their research objectives. Following this method, researchers can easily build corpus of their own to analyse linguistic patterns, study language variation and evolution, explore sociolinguistic phenomena, and develop computational models for Urdu language processing. In conclusion, the innovation of this method for online Urdu corpus building marks a significant milestone in the advancement of Urdu linguistics and linguistic research methodologies. Consequently, it will promote inclusivity and diversity in Urdu linguistic studies, amplifying voices and perspectives that may have been previously marginalized in scholarly discourse.

  • English
Nasir Hussain Author

Nasir Hussain

Nasir Hussain completed his graduation in English Linguistics and Literature from University of Baltistan Skardu. Currently, He is pursuing his M.Phil degree in Linguistics with the specialization in Corpus Linguistics from Air University Islamabad. He is also working as aTeach for Pakistan fellow. Nasir Hussain is also a researcher who has published couple of research articles in the field of linguistics. He is also a dedicated Facebook blogger committed to sharing valuable content focused on .........

0 Comments

Post Comment

Recent Blogs

Recent posts form our Blog

How to use DataTables in Laravel 11

How to use DataTables in Laravel 11

showkat ali
/
Programming

Read More
Laravel Cloud: The Future of Instant App Deployment

Laravel Cloud: The Future of Instant App Deployment

showkat ali
/
Programming

Read More
The Power of Feedback: How Regular Check-Ins Can Transform Employee Performance and Engagement

The Power of Feedback: How Regular Check-Ins Can Transform Employee Performance and Engagement

rimsha akbar
/
Human Resource

Read More
Measuring HR Performance: KPIs Every HR Manager Should Track

Measuring HR Performance: KPIs Every HR Manager Should Track

rimsha akbar
/
Human Resource

Read More
How to Use Quill Rich Text Editor in React JS | 2024

How to Use Quill Rich Text Editor in React JS | 2024

showkat ali
/
Programming

Read More
Build and Deploy Your Laravel Application Using GitHub Actions

Build and Deploy Your Laravel Application Using GitHub Actions

showkat ali
/
Programming

Read More