Web Scraping with HTML DOM Method for Data Collection of Scientific Articles from Google Scholar

Google Scholar is a web-based service for searching broad academic literature. Various types of references can be accessed such as: peer-reviewed papers, theses, books, abstracts and articles from academic publishers, professional communities, pre-printed data center, universities and other academic organizations. Google Scholar provides the profile creation feature of every researcher, expert, and lecturer. The quantity of publication from an academic institution along with detailed data on the publication of scientific articles can be accessed through Google Scholar. Recapitulation of the publication of scientific articles of each researcher in an institution is needed to determine the research performance collectively. However, it still leaves a problem, that is the unavailability of recapitulation services publication of scientific articles for each researcher in an institution. Therefore, this study attempts to make the recapitulation of scientific article publications. Data collection from Google Scholar was carried out by applying web scraping technology. The scraping experiment from Google Scholar in this study has succeeded in retrieving 238 researchers' data and 2,523 article files. The data that had been downloaded was stored in a database, then used to recapitulate the publication of scientific articles, which can display: a list of researcher profiles, a list of affiliations, a list of Indonesian Journal of Information Systems (IJIS) Vol. 2, No. 2, February 2020 17 Rahmatulloh, Gunawan (Web Scraping with HTML DOM Method for Data Collection of Scientific Articles from Google Scholar) citation and a list of article titles that can be printed in the form of *. Pdf or * .xlsx and is equipped with features data search and sorting.


Introduction
For scientists or researchers, publishing research results is an obligation. Forms of research publications include: books, Intellectual Property Rights (IPR) and scientific articles. The publications for the academic community, especially universities have a significant impact on the awareness of lecturers on the importance of conducting studies, research and writing scientific papers [1]. Google Scholar is a web-based service from Google Incorporation to search broad academic literature. One can search in all fields of science and references, such as: peer-reviewed papers, theses, books, abstracts, and articles from academic publishers, professional communities, pre-printed data centers, universities, and other academic organizations. Google Scholar is designed to arrange articles written by researchers into an account that can display the frequency of article citations and later increase the citation of the articles in other academic literature [2].
Google Scholar provides a profile creation feature for every researcher, expert or lecturer. The quantity of publications from an academic institution can be accessed through Google Scholar. Researchers' profiles and scientific article publication data can also be accessed through Google Scholar. Every scientific article that has been published in an online journal, only requires a short amount of time to be indexed by Google Scholar. A recap of the publication of scientific articles of each researcher in an institution or organization is needed to determine the research performance collectively. However, a problems then emerges, that is the unavailability of services to recap the publication of scientific articles for each researcher in an institution or organization. It requires time and efforts to obtain collective data or recapitulation of the publication of all researchers or lecturers from an institution or college. As the result, the publication data of scientific articles can be utilized by academic institutions or organizations. This research obtained data from Google Scholar to recapitulate scientific article publication data by applying web scraping technology. Web scraping is a technology that allows the taking of resources from the web and the results can be utilized again by other systems. The process of retrieving data or information from sites on the internet is called web scraping [3], [4], [5]; web extraction [6], [7], [8]; web mining [9], [10]; and web harvesting [11], [12].
Several studies related to the implementation of web scraping of scientific article or literature from the internet have been carried out beforehand including: web scraping for Indonesian -English parallel corpus using HTML DOM method [4], web-scraping software in searching for gray literature [5], application of web scraping techniques in scientific article search engines [13], the application of web scraping and winnowing web for the detection of plagiarism in the final project title [14], [15]. There are several algorithms that can be used in web scraping such as: regular expressions, HTML DOM, and Xpath [16]. Each algorithm has its own characteristics, so it needs a good understanding before applying it. The regular expression algorithm requires less memory compared to the HTML DOM, and Xpath methods, and HTML DOM takes the least amount of time and uses the least data compared to regular expressions and Xpath [15]. In this study, web scraping using the HTML DOM method was used to download scientific article publication data from Google Scholar based on the Id or class contained in the Google Scholar web source code. The data that were successfully downloaded was stored in a database, then used to recapitulate the publication of scientific articles, which are designed to display: a list of researcher profiles, a list of affiliations, a list of citation and a list of article titles that can be printed in the form of * .pdf or * .xlsx and completed with data search and sorting feature.

Literature Review
The parallel corpus is two interconnected text documents. The first text document contains a collection of source sentences, while the second document contains a collection of translated sentences. The parallel corpus serves as the main source in developing statistical translation machines. Collecting parallel corpus manually requires a long time and cost. The research conducted by [4], tried to implement web scraping with HTML DOM method to collect parallel corpus in Indonesian and English. Experiments on his research have been able to produce 38,712 pairs of parallel corpus from the bilingual news website http://www.berita2bahasa.com/ as well as Indonesian news collection documents as a source and English as the translation. The research conducted by [5], suggests a variety of tools that can be used to search for references and scraping gray literature. There are about 15 platforms that can be used for scraping data are presented and are equipped with descriptions, prices, and URLs to access them. The results of his research have provided information about the availability of a variety of free and low-cost web scraping software and provide opportunities for those who have limited resources, especially researchers who work alone or work in small organizations. The research conducted by [13], tried to apply web scraping technology to retrieve data from several scientific article search engines. Three scientific article search engine webs were chosen for his research, including: Digital Referral Garama (Garuda) http://garuda.ristekdikti.go.id/, Indonesian Scientific Journal Database (ISJD) http: //isjd.pdii.lipi. go.id/ and Google Scholar https://scholar.google.com/. His research has succeeded in applying web scraping techniques and downloading some data from selected scientific article search engines. The downloaded data then stored in a database table consisting of 6 attributes: id, website, keywords, results, file_download, and date_time_update. The research conducted by [14], applied winnowing algorithm to find the level of similarity in the publication of scientific article titles. Google Scholar was used to obtain research title data that had been previously available as a comparison with the research title entered. Web scraping with CURL (URL Client) and Hypertext Markup Language-Document Object Model (HTML DOM) parser were used to retrieve the title data from Google Scholar. Experiments in his research, have succeeded in presenting a percentage level of similarity in percent with the category of low, middle or high plagiarism.
The creation of a recapitulation service for publishing scientific articles collectively for an institution is the main focus of this research. Therefore, the recapitulation of scientific article publications can be done easily. In this study, the data were scraped using the HTML DOM method based on the Id or class contained in the Google Scholar web source code. Data scraping was done based on each researcher's Google Scholar Id. The downloaded data were then stored in a database to recapitulate the publication of scientific articles.

Data Collection
There are three main steps involved in the process of retrieving data from the Web Scholar. Activities undertaken at this stage consist of: mapping google scholar web pages, developing web scraping source code, save the scraped data to the database, and reporting as shown in Figure 1. It was done by displaying the source code of web pages through a web browser and identifying each id or class on the web page element. The identification results of the id or class were chosen according to the data attributes to be scrapped. Figure 2 shows an example of some of id or class identified on the target website. In Figure 2, some examples of classes are displayed in the source code of Google Scholar web pages, such as: gsc_a_x, gsc_a_t, gsc_a_c, gsc_a_y. The id or class was then identified and adjusted to the attributes of the data to be downloaded.

Developing web scraping source code
It was done by using PHP version 5, Apache Web Server, and MySQL Database. Some functions inserted in the PHP source code were designed to pass data scraping based on the id or class on the Google Scholar web page. The source code snippet for Google scholar scraping is shown in Figure 3.  Figure 3 shows a pseudocode designed to scrap data from Google Scholar. There is one class named request_paper, which has 10 methods.

Save the scraping data to the database
It was done after the scraping process is complete. MySQL Server was used in this study as a tool to store data, which was connected with PHP based applications. Several tables were designed to store  Researcher profile table, title table, citation table, affiliation table  and others.

Reporting
It was done by accessing data that had been stored in a database. The data was then displayed to recapitulate the publication of scientific articles, which are designed to display: a list of researcher profiles, a list of affiliates, a list of citations, and a list of article titles that can be printed in the form of * .pdf or * .xlsx.

Data Requirement
There are various data attributes available on the Web Scholar. Some data attributes needed to produce a recapitulation of scientific article publications in this study are shown in Table 1.  Information  Information  8  #gsc_rsb_st  Citation_recap_  Citation  9 #gsc_md_hist_b Graph Graph

Results and Analysis
As previously designed, in this stage the web scrap can be implemented in the web-based application that can be accessed via the URL http://adagos.yucoding.com. There are two levels of access to the application developed, namely: admin and public. The main page for users with admin privileges after successfully logging in is shown in Figure 4.  In Figure 4, the section displays a menu that can be accessed by the admin and the right part is a record of each item that has been input. For example, "Lecturer Profile" menu on the left is selected, then the right list of lecture profiles that have been successfully input and stored in a database are shown on the right. Users with admin access rights in addition to accessing data can also manipulate data. In this study, lecturer data as a researcher in a university was chosen as a sample data for the experiment. The lecturer data with attributes of NIDN, Name, Affiliation, and Google Scholar ID was input at an early stage before scraping. The lecturer data input form display is shown in Figure 5.  Figure 5 shows the lecturer data input page. Each lecturer whose data will be input into the system must have a Google scholar ID. If the lecturers have not got a Google Scholar ID, they are required to create a research profile, especially through https://scholar.google.com/. Experiments in this study have succeeded in inputting data from 238 lecturers who are members of 10 affiliates. Data scraping can be done after the lecturer profile is added and stored in a database. The scraping process begins with selecting one of the lecturer profiles, then selecting the "Syncronization" menu as shown in Figure 6. Scraping is done based on the Google Scholar ID and id class selected according to the required data attributes. After the scraping process is finished, it will display the article data that has been successfully downloaded as shown in Figure 6.
The display in Figure 6 is similar to the profile display on Google Scholar, because all data used is the result of the scraping process from Google Scholar. The recapitulation process is automatically created after the scraping is done. Lecturer list display can be sorted by NIDN, Name or department as shown in Figure 7.    Each scientific article that is in an affiliation can be displayed in order, for example based on the number of citations as shown in Figure 11. Examples of displaying scraping scientific report data reports that are poured into *.pdf format can be seen in Figure 12. Besides being in PDF format, reports can also be downloaded in the form of Excel.