Intelligent information extraction from scholarly document databases


  • Fernando Vegas Fernandez



Market Market Intelligence, Business Intelligence, Competitive Intelligence, Information Systems, Geo-Economics


Extracting knowledge from big document databases has long been a challenge.Most researchers do a literature review and manage their document databases with tools thatjust provide a bibliography and when retrieving information (a list of concepts and ideas), thereis a severe lack of functionality. Researchers do need to extract specific information from theirscholarly document databases depending on their predefined breakdown structure. Thosedatabases usually contain a few hundred documents, information requirements are distinct ineach research project, and technique algorithms are not always the answer. As most retrievingand information extraction algorithms require manual training, supervision, and tuning, itcould be shorter and more efficient to do it by hand and dedicate time and effort to perform aneffective semantic search list definition that is the key to obtain the desired results. A robustrelative importance index definition is the final step to obtain a ranked importance concept listthat will be helpful both to measure trends and to find a quick path to the most appropriatepaper in each case.


Adrian, W. T., Leone, N., and Manna, M. (2015).

"Ontology-driven information extraction."

arXiv preprint arXiv:1512.06034.

Afantenos, S., Karkaletsis, V., and

Stamatopoulos, P. (2005). "Summarization

from medical documents: a survey." Artificial

intelligence in medicine, 33(2), 157-177.

Ahmad, M. W., and Ansari, M. "A survey: soft

computing in intelligent information retrieval

systems." Proc., 2012 12th International

Conference on Computational Science and Its

Applications, IEEE, 26-34.

Al-Hroob, A., Imam, A. T., and Al-Heisa, R.

(2018). "The use of artificial neural networks

for extracting actions and actors from

requirements document." Information and

Software Technology, 101(2018), 1-15.

Alashwal, A. M., and Al-Sabahi, M. H. (2018).

"Risk factors in construction projects during

unrest period in Yemen." Journal of

Construction in Developing Countries, 23(2),


Allan, J., Aslam, J., Belkin, N., Buckley, C.,

Callan, J., Croft, B., Dumais, S., Fuhr, N.,

Harman, D., and Harper, D. J. "Challenges in

information retrieval and language modeling:

report of a workshop held at the center for

intelligent information retrieval." Proc., ACM

SIGIR Forum, ACM New York, NY, USA, 31-

Ansari, A., Maknojia, M., and Shaikh, A. (2016).

"Intelligent information extraction based on

artificial neural network." International

Journal in Foundations of Computer Science

& Technology, 6(1).

Barde, B. V., and Bainwad, A. M. (2018). "An

overview of topic modeling methods and tools."

Proc., 2017 International Conference on

Intelligent Computing and Control Systems

(ICICCS), IEEE, 745-750.

Bettany-Saltikov, J. (2012). How to do a

systematic literature review in nursing: a stepby-

step guide, McGraw-Hill Education (UK),

Maidenhead, UK.

Boden, C., Löser, A., Nagel, C., and Pieper, S.

(2012). "Fact-aware document retrieval for

information extraction." Datenbank-

Spektrum, 12(2), 89-100.

Buzan, T. (2004). Cómo crear mapas mentales,

Ediciones Urano, Barcelona, Spain.

Chen, H., and Lynch, K. J. (1992). "Automatic

construction of networks of concepts

characterizing document databases." Ieee T

Syst Man Cyb, 22(5), 885-902.

Dezsenyi, C., Dobrowiecki, T. P., and Meszaros,

T. (2007). "Adaptive information extraction

from unstructured documents." International

Journal of Intelligent Information and

Database Systems, 1(2), 156-180.

Esposito, F., Ferilli, S., Basile, T. M. A., and Di

Mauro, N. (2005). "Semantic-based access to

digital document databases." Proc.,

International Symposium on Methodologies

for Intelligent Systems, Springer, Berlin,

Heidelberg, Germany, 373-381.

Fan, H., Xue, F., and Li, H. (2015). "Project-based

as-needed information retrieval from

unstructured AEC documents." Journal of

Management in Engineering, 31(1), A4014012.

Gaizauskas, R., and Wilks, Y. (1998).

"Information extraction: Beyond document

retrieval." Journal of documentation, 54(1),


Grishman, R. (2019). "Twenty-five years of

information extraction." Natural Language

Engineering, 25(6), 677-692.

Gupta, P., and Gupta, V. (2012). "A survey of text

question answering techniques." International

Journal of Computer Applications, 53(4), 1–8.

Hassan, F. u., and Le, T. (2020). "Automated

Requirements Identification from

Construction Contract Documents Using

Natural Language Processing." Journal of

Legal Affairs and Dispute Resolution in

Engineering and Construction, 12(2),

Hassan, T., and Baumgartner, R. "Intelligent text

extraction from pdf documents." Proc.,

International Conference on Computational

Intelligence for Modelling, Control and

Automation and International Conference on

Intelligent Agents, Web Technologies and

Internet Commerce (CIMCA-IAWTIC'06),

IEEE, 2–6.

Hassan, T., and Baumgartner, R. (2005b).

Intelligent wrapping from PDF documents,

CEUR Workshop Proceedings, Točná, Czech


Hobbs, J. R. (2002). "Information extraction from

biomedical text." Journal of biomedical

informatics, 35(4), 260-264.

Hu, X., Lin, T. Y., Song, I., Lin, X., Yoo, I.,

Lechner, M., and Song, M. "Ontology-based

scalable and portable information extraction

system to extract biological knowledge from

huge collection of biomedical web documents."

Proc., IEEE/WIC/ACM International

Conference on Web Intelligence (WI'04), IEEE,


Inui, K., Abe, S., Hara, K., Morita, H., Sao, C.,

Eguchi, M., Sumida, A., Murakami, K., and

Matsuyoshi, S. "Experience mining: Building

a large-scale database of personal experiences

and opinions from web documents." Proc.,

IEEE/WIC/ACM International

Conference on Web Intelligence and Intelligent

Agent Technology, IEEE, 314-321.

Jarkas, A. M., and Haupt, T. C. (2015). "Major

construction risk factors considered by general

contractors in Qatar." Journal of Engineering,

Design and Technology, 13(1), 165–194.

Karol, S., and Mangat, V. (2013). "Evaluation of

text document clustering approach based on

particle swarm optimization." Open Computer

Science, 3(2), 69-90.

Karthik, M., Marikkannan, M., and Kannan, A.

"An intelligent system for semantic

information retrieval information from textual

web documents." Proc., International

Workshop on Computational Forensics,

Springer, Berlin, Heidelberg, Germany, 135-

Kasperiuniene, J., and Zydziunaite, V. (2019). "A

systematic literature review on professional

identity construction in social media." SAGE

Open, 9(1), 2158244019828847.

Kim, T., and Chi, S. (2019). "Accident case

retrieval and analyses: using natural

language processing in the construction

industry." Journal of Construction

Engineering and Management, 145(3),

Koval, R., and Návrat, P. (2012). "Intelligent

support for information retrieval of web

documents." Computing and Informatics,

(5), 509–528.

Lambrix, P., and Shahmehri, N. (2000).

"Querying documents using content, structure

and properties." Journal of Intelligent

Information Systems, 15(3), 287-307.

Lee, R. "Automatic information extraction from

documents: A tool for intelligence and law

enforcement analysts." Proc., Proceedings of

AAAI Fall Symposium on Artificial

Intelligence and Link Analysis, AAAI Press

Menlo Park, CA.

Li, J., Wang, H. J., and Bai, X. (2015). "An

intelligent approach to data extraction and

task identification for process mining."

Information Systems Frontiers, 17(6), 1195-

López-Robles, J.-R., Guallar, J., Otegi-Olaso, J.-

R., and Gamboa-Rosales, N.-K. (2019).

"Bibliometric and thematic analysis (2006-

." El profesional de la información, 28(4),


Lutsky, P. (2000). "Information extraction from

documents for automating software testing."

Artificial Intelligence in Engineering, 14(1),


Malik, S. K., Prakash, N., and Rizvi, S. (2010).

"Semantic annotation framework for

intelligent information retrieval using KIM

architecture." International Journal of Web &

Semantic Technology (IJWest), 1(4), 12-26.

Marinai, S. "Metadata extraction from PDF

papers for digital library ingest." Proc., 2009

th International conference on document

analysis and recognition, IEEE, 251-255.

Matos, P. F., Lombardi, L. O., Pardo, T. A.,

Ciferri, C. D., Vieira, M. T., and Ciferri, R. R.

(2010). "An environment for data analysis in

biomedical domain: information extraction for

decision support systems." Proc., International

Conference on Industrial, Engineering and

Other Applications of Applied Intelligent

Systems, Springer, Berlin, Heidelberg,

Germany, 306-316.

Matsuo, Y., and Ishizuka, M. (2004). "Keyword

extraction from a single document using word

co-occurrence statistical information."

International Journal on Artificial

Intelligence Tools, 13(01), 157-169.

Milward, D., and Thomas, J. "From information

retrieval to information extraction." Proc.,

ACL-2000 Workshop on Recent Advances in

Natural Language Processing and

Information Retrieval, 85-97.

Mitra, M., and Chaudhuri, B. (2000).

"Information retrieval from documents: A

survey." Information retrieval, 2(2-3), 141-163.

Nagalla, V., Dendukuri, S. C., and Asadi, S. S.

(2018). "Analysis of risk assessment in

construction of highway projects using

relative importance index method."

International Journal of Mechanical

Engineering and Technology, 9(3), 1–6.

Nasar, Z., Jaffry, S. W., and Malik, M. K. (2018).

"Information extraction from scientific

articles: a survey." Scientometrics, 117(3),


Nualart-Vilaplana, J., Pérez-Montoro, M., and

Whitelaw, M. (2014). "Cómo dibujamos textos:

Revisión de propuestas de visualización y

exploración textual." El profesional de la

información, 23(3), 221-235.

Oliveira, D. A. B., and Viana, M. P. (2018). "Fast

CNN-based document layout analysis." Proc.,

Proceedings of the IEEE International

Conference on Computer Vision Workshops,

IEEE Computer Society, 1173-1180.

Oro, E., and Ruffolo, M. "Xonto: An ontologybased

system for semantic information

extraction from pdf documents." Proc., 2008

th IEEE International Conference on Tools

with Artificial Intelligence, IEEE, 118-125.

Rahman, N. A., Soom, A. B. M., and Ismail, N. K.

"Enhancing Latent Semantic Analysis by

Embedding Tagging Algorithm in Retrieving

Malay Text Documents." Proc., Asian

Conference on Intelligent Information and

Database Systems, Springer, 309-319.

Renault, B. Y., and Agumba, J. N. (2016). "Risk

management in the construction industry: a

new literature review." MATEC Web of

Conferences, 66(2016), 0008.

Rizvi, S. T. R., Mercier, D., Agne, S., Erkel, S.,

Dengel, A., and Ahmed, S. (2018). "Ontologybased

Information Extraction from Technical

Documents." Proc., ICAART (2), Science and

Technology Publications, Lda, 493-500.

Rodríguez, A., Colomo, R., Gómez, J. M., Alor-

Hernandez, G., Posada-Gomez, R., Juarez-

Martinez, U., Gayo, J. E. L., and Vidyasankar,

K. "A proposal for a semantic intelligent

document repository architecture." Proc., 2009

Electronics, Robotics and Automotive

Mechanics Conference (CERMA), IEEE, 69-75.

Rostami, A., Sommerville, J., Wong, I. L., and

Lee, C. (2015). "Risk management

implementation in small and medium

enterprises in the UK construction industry."

Engineering, Construction and Architectural

Management, 22(1), 91–107.

Saik, O., Demenkov, P., Ivanisenko, T.,

Kolchanov, N., and Ivanisenko, V. (2017).

"Development of methods for automatic

extraction of knowledge from texts of scientific

publications for the creation of a knowledge

base Solanum TUBEROSUM." Agricultural

Biology, 52(1), 1.

Sarwar, S. M., and Allan, J. "A Retrieval

Approach for Information Extraction." Proc.,

Proceedings of the 2019 ACM SIGIR

International Conference on Theory of

Information Retrieval, Association for

Computing Machinery, 249-252.

Schalley, A. C. (2019). "Ontologies and ontological

methods in linguistics." Language and

Linguistics Compass, 13(11), e12356.

Seedah, D. P., and Leite, F. (2015). "Information

Extraction for Freight-Related Natural

Language Queries." Proc., Computing in Civil

Engineering 2015, American Society of Civil

Engineers, 427-435.

Seng, J.-L., and Lai, J. (2010). "An Intelligent

information segmentation approach to extract

financial data for business valuation." Expert

Systems with Applications, 37(9), 6515-6530.

Shrihari, R. C., and Desai, A. (2015). "A review on

knowledge discovery using text classification

techniques in text mining." International

Journal of Computer Applications, 111(6).

Sirsat, S. R., Chavan, V., and Deshpande, S. P.

(2014). "Mining knowledge from text

repositories using information extraction: A

review." Sadhana-Acad P Eng S, 39(1), 53-62.

Snyder, H. (2019). "Literature review as a

research methodology: An overview and

guidelines." Journal of Business Research,

(2019), 333–339.

Song, D., Lau, R. Y., Bruza, P. D., Wong, K.-F.,

and Chen, D.-Y. (2007). "An intelligent

information agent for document title

classification and filtering in documentintensive

domains." Decision Support

Systems, 44(1), 251-265.

Srihari, R. K., Zhang, Z., and Rao, A. (2000).

"Intelligent indexing and semantic retrieval of

multimodal documents." Information

Retrieval, 2(2-3), 245-275.

Tseng, F. S., and Chou, A. Y. (2006). "The concept

of document warehousing for multidimensional

modeling of textual-based

business intelligence." Decision Support

Systems, 42(2), 727-744.

Upadhyay, R., and Fujii, A. "Semantic knowledge

extraction from research documents." Proc.,

Federated Conference on Computer

Science and Information Systems (FedCSIS),

IEEE, 439–445.

Vegas-Fernández, F. (2019). "Factor de

visibilidad. Nuevo indicador para la

evaluación cuantitativa de riesgos." PhD PhD,

Universidad Politécnica de Madrid,

Universidad Politécnica de Madrid.

Vegas-Fernández, F., and Rodríguez López, F.

(2019). "Risk management improvement

drivers for effective risk-based decisionmaking."

Journal of Business, Economics and

Finance (JBEF), 8(4), 223–234.

Wang, Q., Qu, S. N., Du, T., and Zhang, M. J. "The

Research and Application in Intelligent

Document Retrieval Based on Text

Quantification and Subject Mapping." Proc.,

Advanced Materials Research, Trans Tech

Publ, 2561-2568.

Wolf, C., and Jolion, J.-M. (2004). "Extraction and

recognition of artificial text in multimedia

documents." Formal Pattern Analysis &

Applications, 6(4), 309-326.

Xia, N., Zou, P. X., Griffin, M. A., Wang, X., and

Zhong, R. (2018). "Towards integrating

construction risk management and

stakeholder management: A systematic

literature review and future research

agendas." International Journal of Project

Management, 36(5), 701–715.

Xie, X., Fu, Y., Jin, H., Zhao, Y., and Cao, W.

(2019). "A novel text mining approach for

scholar information extraction from web

content in Chinese." Future Generation

Computer Systems.