Representation learning of writing style - Laboratoire de recherche en informatique. Équipe: Apprentissage et optimisation
Communication Dans Un Congrès Année : 2020

Representation learning of writing style

Résumé

In this paper, we introduce a new method of representation learning that aims to embed documents in a stylometric space. Previous studies in the field of authorship analysis focused on feature engineering techniques in order to represent document styles and to enhance model performance in specific tasks. Instead, we directly embed documents in a stylometric space by relying on a reference set of authors and the intra-author consistency property which is one of two components in our definition of writing style. The main intuition of this paper is that we can define a general stylometric space from a set of reference authors such that, in this space, the coordinates of different documents will be close when the documents are by the same author, and spread away when they are by different authors, even for documents by authors who are not in the set of reference authors. The method we propose allows for the clustering of documents based on stylistic clues reflecting the authorship of documents. For the empirical validation of the method, we train a deep neural network model to predict authors of a large reference dataset consisting of news and blog articles. Albeit the learning process is supervised, it does not require a dedicated labeling of the data but it relies only on the metadata of the articles which are available in huge amounts. We evaluate the model on multiple datasets, on both the authorship clustering and the authorship attribution tasks.
Fichier principal
Vignette du fichier
2020.wnut-1.30.pdf (703.39 Ko) Télécharger le fichier
Origine Fichiers éditeurs autorisés sur une archive ouverte

Dates et versions

hal-04244991 , version 1 (05-12-2023)

Identifiants

Citer

Julien Hay, Bich-Lien Doan, Fabrice Popineau, Ouassim Ait Elhara. Representation learning of writing style. Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), Nov 2020, Online, France. pp.232-243, ⟨10.18653/v1/2020.wnut-1.30⟩. ⟨hal-04244991⟩
244 Consultations
40 Téléchargements

Altmetric

Partager

More