Representation learning of writing style

Julien Hay; Bich-Lien Doan; Fabrice Popineau; Ouassim Ait Elhara

doi:10.18653/v1/2020.wnut-1.30

Communication Dans Un Congrès Année : 2020

Representation learning of writing style

(1) , (2) , (1) , (3)

1
2
3

Julien Hay

Fonction : Auteur

Données et Connaissances Massives et Hétérogènes (LRI)

Bich-Lien Doan

Fonction : Auteur

TAckling the Underspecified

Fabrice Popineau

Fonction : Auteur
PersonId : 6988
IdHAL : fabrice-popineau
IdRef : 078844967

Données et Connaissances Massives et Hétérogènes (LRI)

Ouassim Ait Elhara

Fonction : Auteur

Société Octopeek (Enghien Les bains, France)

Résumé

In this paper, we introduce a new method of representation learning that aims to embed documents in a stylometric space. Previous studies in the field of authorship analysis focused on feature engineering techniques in order to represent document styles and to enhance model performance in specific tasks. Instead, we directly embed documents in a stylometric space by relying on a reference set of authors and the intra-author consistency property which is one of two components in our definition of writing style. The main intuition of this paper is that we can define a general stylometric space from a set of reference authors such that, in this space, the coordinates of different documents will be close when the documents are by the same author, and spread away when they are by different authors, even for documents by authors who are not in the set of reference authors. The method we propose allows for the clustering of documents based on stylistic clues reflecting the authorship of documents. For the empirical validation of the method, we train a deep neural network model to predict authors of a large reference dataset consisting of news and blog articles. Albeit the learning process is supervised, it does not require a dedicated labeling of the data but it relies only on the metadata of the articles which are available in huge amounts. We evaluate the model on multiple datasets, on both the authorship clustering and the authorship attribution tasks.

Domaines

Informatique [cs]

Fichier principal

2020.wnut-1.30.pdf (703.39 Ko)

Origine	Fichiers éditeurs autorisés sur une archive ouverte

Kumar Guha : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04244991

Soumis le : mardi 5 décembre 2023-17:11:42

Dernière modification le : mardi 5 décembre 2023-17:41:19

Dates et versions

hal-04244991 , version 1 (05-12-2023)

Identifiants

HAL Id : hal-04244991 , version 1
DOI : 10.18653/v1/2020.wnut-1.30

Citer

Julien Hay, Bich-Lien Doan, Fabrice Popineau, Ouassim Ait Elhara. Representation learning of writing style. Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Nov 2020, Online, France. pp.232-243, ⟨10.18653/v1/2020.wnut-1.30⟩. ⟨hal-04244991⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRIA UMR8623 CENTRALESUPELEC INRIA2 LRI-LAHDAK LRI-AO UNIV-PARIS-SACLAY LISN GS-COMPUTER-SCIENCE GS-LIFE-SCIENCES-HEALTH LISN-LAHDAK LISN-AO HUB-IA

244 Consultations

40 Téléchargements

Representation learning of writing style

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager