Table of Contents
The content on the internet is increasing at a very fast speed. So the concern for conversion of unstructured data to structured data also increases. Thousands of data types are available on web nowadays. There are many formats for storing the data which include text documents, images, excel spreadsheets and many more. The need to spend less effort and to get more data has led to increase in the research in this field. The purpose of our project is to fetch the data from the different file formats such as. pdf, txt, doc, xls, etc. After fetching the data from a file, transformation is performed on this fetched data and it is exported into the database. It will not require any kind of special format editor on your machine. This will serve as a tool for accessing files containing different formats irrespective of the editors they support.
It is very difficult for the human beings to manually prepare summaries of large documents, to extract keywords, key phrases of large documents. In order to simplify these problems, a tool is required. This tool aims at extracting English files alone. Extraction of very large files is very easy using it with the use of parallel processing. Parallel processing is the technique in which a file whose’s size is very large is divided into smaller parts and each part is processed in parallel fashion with respect to the other part. It enables faster execution of the file. The file which consists of 5 pages is considered a large file in this context. After the file gets extracted, a repository of the data is created. Later on many different functionalities can be performed on it which include sentence extraction, keyword extraction, metadata extraction, key phrase extraction, summarization which provides a short form of the entire content which has been extracted. The extraction of the text from the file includes three basic steps which are analyzing the unstructured data, locating specific pieces of the text and filling in the database. The process of text extraction can be supported by many APIs. Apache Tika has been used in it. The basic task of text extraction is transforming the data and storing it for further use in a database.
Naive-Bayes Methods-There is a classification function which decides whether a sentence is suitable for extraction or not based upon a naive bayes classifier. Another technique using it included the presence of uppercase words and sentence length also. Then the n top level sentences extracted using it were used for making the summary.
K-means clustering- The entire file is divided into sentences. These sentences serve as points on Cartesian plane. Frequency of each word is calculated using term frequency. Based upon the term frequency, sentence score of each and every sentence is calculated. These sentence score are used for uniquely representing the coordinates. The coordinates serve as input for the clustering algorithm. After the application of this k rounds of this algorithm on the coordinates generates k cluster centres. Classification of each sentence is done into different clusters based upon scores computed for each sentence. The condensed form is created using the cluster which contains the most sentences. Summary is generated by placing sentences in the same order as they appear in the original file. This approach yields better results as compared to human written summaries. Abstractive summaries are those which are created by rephrasing the information in the file.
Graph theoretic approach-In this approach after the stop word removal process, the sentences are represented as unique coordinates in the unidirected graph. Each node represents a sentence. The two sentences are interconnected by a edge if they have similarity above a threshold level. This approach yields two results. One is that the nodes which are not connected with any of other nodes form distinct topics covered in the document. The second result specifies sentences of greater significance. The nodes which have more nodes connected to them have high preference to be included in the summary since those nodes share information with many other nodes.
Complex Natural Language Analysis Methods-A lexical chain(sequence of related words) is defined. Consider a example for this. Geeta bought a Duster. She loves the car. Here car refers to duster. It forms a lexical chain. It can happen at word sequence level also. To find out the lexical chains, three steps are followed. Selecting a set of the candidate words. For each of the candidate word, find a chain based upon relatedness among members of chain. Inserting the word in the chain if it is found and updating it accordingly.
Position method-In this method the position of the sentence in the document was considered. Usually the text in the document follows a defined structure. The sentences of greater importance usually occur at certain specific locations for example titles, introduction etc. But as the structure of the documents varies from one another. This cannot be used as a suitable method.
illustration not visible in this excerpt
The file which needs to be extracted is browsed by the user. The “upload” button is pressed to upload the file to the server. Then the size of the file is checked. If the file contains more than 5 pages, then it is regarded as a large file and it will undergo parallel processing otherwise it will be executed as a whole. The extracted content is segmented into the sentences and the sentence count is created. Removal of stopwords is used for generating keywords and key phrases. Stop words include special characters, punctuation marks as well as the words which are repeated many a times but do not have much importance such as is, like, are, not etc. The links, images and the metadata are also extracted from the file.. Auto summarization is done based upon the keywords present in the sentences. Description of the modules is explained below.
Home Page-In this page the user will upload the file and click on the“upload” button.
Result Page-In this page different tabs for different options is provided. The options include metadata, auto summary, keywords, key phrases, plain text. Button is provided under auto summary tab by clicking on which auto summary will be generated.
Metadata-The values and attributes of the metadata will be displayed in it.
Plain text-It displays all the sentences present in the file as well as number of times each sentence occurred in the file. Sentence alone with the count is provided in form of a table.
Links-It will display all the links present in the file. If the file doesn’t have any link, then it will display blank space.
Keywords-These are the important words present in the file. Firstly all the stop words are removed from the content. Term frequency and inverse term frequency approach is followed in it which measured how frequently a term occurs in the file. The frequency of the words is calculated and the most frequently occurring terms are fetched. Those are the keywords. Stemming algorithm is followed in it.
illustration not visible in this excerpt
Images-This tab will display the images present in the file. If no image is present then it will display nothing.
Key phrases-These are the keywords which contain more than one word. They are extracted using the same approach as keywords.
Auto Summarization-In this process the fetched data is segmented into sentences. The keywords and key phrases are used for generating the summary. The sentences are rated based upon presence of keywords and key phrases. Accordingly the sentences are arranged in the summary.
The biggest challenge for text extraction and auto summarization is to extract data from different semi structured sources including databases and the web pages in the proper format, size and time. Another challenge is to extract files belonging to different languages such as Hindi, Urdu, French etc. The text summarization should not be too small and it should not have redundancy.
Dipanjan Das and André F. T. Martins “ A Survey on Automatic Text Summarization” Language Technologies Institute Carnegie Mellon University, November 21, 2007
 Vishal Gupta and Gurpreet Singh Lehal “ A Survey of Text Summarization Extractive Techniques” Journal of emerging technologies in web intelligence, Volume 2, Number 3, August 2010
Ayush Agrawal and Utsav Gupta ”Extraction based approach for text summarization using k-means clustering” International Journal of Scientific and Research Publications, Volume 4, Issue 11, November 2014 1 ISSN 2250-3153