Article | Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16 | Simple and Accountable Segmentation of Marked-up Text
Göm menyn

Title:
Simple and Accountable Segmentation of Marked-up Text
Author:
Jonathon Read: School of Computing, Teesside University, UK Rebeca Dridan: Department of Informatics, University of Oslo, Norway Stephan Oepen: Department of Informatics, University of Oslo, Norway
Download:
Full text (pdf)
Year:
2013
Conference:
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16
Issue:
085
Article no.:
033
Pages:
365-373
No. of pages:
9
Publication type:
Abstract and Fulltext
Published:
2013-05-17
ISBN:
978-91-7519-589-6
Series:
Linköping Electronic Conference Proceedings
ISSN (print):
1650-3686
ISSN (online):
1650-3740
Series:
NEALT Proceedings Series
Publisher:
Linköping University Electronic Press; Linköpings universitet


Export in BibTex, RIS or text

Segmenting documents into discrete; sentence-like units is usually a first step in any natural language processing pipeline. However; current segmentation tools perform poorly on text that contains markup. While stripping markup is a simple solution; we argue for the utility of the extra-linguistic information encoded by markup and present a scheme for normalising markup across disparate formats. We further argue for the need to maintain accountability when preprocessing text; such that a record of modifications to source documents is maintained. Such records are necessary in order to augment documents with information derived from subsequent processing. To facilitate adoption of these principles we present a novel tool for segmenting text that contains inline markup. By converting to plain text and tracking alignment; the tool is capable of state-of-the-art sentence boundary detection using any external segmenter; while producing segments containing normalised markup; with an account of how to recreate the original form.

Keywords: Accountability; Markup; Normalisation; Sentence Boundary Detection; Traceability

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Author:
Jonathon Read, Rebeca Dridan, Stephan Oepen
Title:
Simple and Accountable Segmentation of Marked-up Text
References:

Flickinger; D.; Oepen; S.; and Ytrest√łl; G. (2010). Wikiwoods: Syntacto-semantic annotation for English Wikipedia. In Proceedings of the 7th Conference on International Language Resources and Evaluation; Valletta; Malta.


Foster; J.; Cetinoglu; O.; Wagner; J.; Le Roux; J.; Nivre; J.; Hogan; D.; and van Genabith; J. (2011). From news to comment: Resources and benchmarks for parsing the language of Web 2.0. In Proceedings of the 2011 International Joint Conference on Natural Language Processing; page 893 ‚Äď 901; Chiang Mai; Thailand.


Gimpel; K.; Schneider; N.; O‚ÄôConnor; B.; Das; D.; Mills; D.; Eisenstein; J.; Heilman; M.; Yogatama; D.; Flanigan; J.; and Smith; N. A. (2011). Part-of-speech tagging for Twitter: Annotation; features; and experiments. In Proceedings of the 49th Meeting of the Association for Computational Linguistics; page 42 ‚Äď 47; Portland; OR; USA.


Kilgarriff; A. and Grefenstette; G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics; 29(3):333 ‚Äď 347.


Marcus; M.; Santorini; B.; and Marcinkiewicz; M. A. (1993). Building a large annotated corpora of English: The Penn Treebank. Computational Linguistics; 19:313 ‚Äď 330.


Read; J.; Dridan; R.; Oepen; S.; and Solberg; L. J. (2012a). Sentence boundary detection: A long solved problem? In Proceedings of the 24th International Conference on Computational Linguistics; Mumbai; India.


Read; J.; Flickinger; D.; Dridan; R.; Oepen; S.; and √ėvrelid; L. (2012b). The WeSearch Corpus; Treebank; and Treecache. A comprehensive sample of user-generated content. In Proceedings of the 8th International Conference on Language Resources and Evaluation; Istanbul; Turkey.


Sch√§fer; U.; Kiefer; B.; Spurk; C.; Steffen; J.; and Wang; R. (2011). The ACL Anthology Searchbench. In Proceedings of the 49th Meeting of the Association for Computational Linguistics System Demonstrations; page 7 ‚Äď 13; Portland; OR; USA.


Solberg; L. J. (2012). A corpus builder for Wikipedia. Master’s thesis; University of Oslo; Norway.

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013); May 22-24; 2013; Oslo University; Norway. NEALT Proceedings Series 16

Author:
Jonathon Read, Rebeca Dridan, Stephan Oepen
Title:
Simple and Accountable Segmentation of Marked-up Text
Note: the following are taken directly from CrossRef
Citations:
No citations available at the moment


Responsible for this page: Peter Berkesand
Last updated: 2017-02-21