A Clustering Backed Deep Learning Approach for Document Layout Analysis

Rhys Agombar; Max Luebbering; Rafet Sifa

doi:10.1007/978-3-030-57321-8_23

Conference Papers Year : 2020

A Clustering Backed Deep Learning Approach for Document Layout Analysis

(1) , (1) , (1)

Rhys Agombar

Function : Author
PersonId : 1115838

Fraunhofer Institute for Intelligent Analysis and Information Systems

Max Luebbering

Function : Author

Fraunhofer Institute for Intelligent Analysis and Information Systems

Rafet Sifa

Function : Author

Fraunhofer Institute for Intelligent Analysis and Information Systems

Abstract

Large organizations generate documents and records on a daily basis, often to such an extent that processing them manually becomes unduly time consuming. Because of this, automated processing systems for documents are desirable, as they would reduce the time spent handling them. Unfortunately, documents are often not designed to be machine-readable, so parsing them is a difficult problem. Image segmentation techniques and deep-learning architectures have been proposed as a solution to this, but have difficulty retaining accuracy when page layouts are especially dense. This leads to the possibilities of data being duplicated, lost, or inaccurate during retrieval. We propose a way of refining these segmentations, using a clustering based approach that can be easily combined with existing rules based refinements. We show that on a financial document corpus of 2675 pages, when using DBSCAN, this method is capable of significantly increasing the accuracy of existing deep-learning methods for image segmentation. This improves the reliability of the results in the context of automatic document analysis.

Keywords

Document layout analysis Faster R-CNN DBSCAN Post-processing Bounding box refinement

Domains

Computer Science [cs] Library and information sciences

Fichier principal

497121_1_En_23_Chapter.pdf (1.03 Mo)

Origin	Files produced by the author(s)

Hal Ifip : Connect in order to contact the contributor

https://inria.hal.science/hal-03414749

Submitted on : Thursday, November 4, 2021-3:58:28 PM

Last modification on : Friday, November 5, 2021-3:57:59 AM

Long-term archiving on : Saturday, February 5, 2022-7:10:38 PM

Dates and versions

hal-03414749 , version 1 (04-11-2021)

Licence

Attribution

Identifiers

HAL Id : hal-03414749 , version 1
DOI : 10.1007/978-3-030-57321-8_23

Cite

Rhys Agombar, Max Luebbering, Rafet Sifa. A Clustering Backed Deep Learning Approach for Document Layout Analysis. 4th International Cross-Domain Conference for Machine Learning and Knowledge Extraction (CD-MAKE), Aug 2020, Dublin, Ireland. pp.423-430, ⟨10.1007/978-3-030-57321-8_23⟩. ⟨hal-03414749⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

IFIP-LNCS IFIP IFIP-TC IFIP-TC5 IFIP-WG IFIP-TC12 IFIP-TC8 IFIP-WG8-4 IFIP-WG8-9 IFIP-CD-MAKE IFIP-WG12-9 IFIP-LNCS-12279

82 View

131 Download

A Clustering Backed Deep Learning Approach for Document Layout Analysis

Abstract

Keywords

Domains

Dates and versions

Licence

Identifiers

Cite

Export

Collections

Altmetric

Share