Publications

You can also find my articles on my Google Scholar profile.

Published in , 1900

Breakthroughs in Tibetan NLP & Digital Humanities

Published in Revue d'Etudes Tibétaines, 2024

This paper discusses recent advancements in Tibetan Natural Language Processing and Digital Humanities.

Recommended citation: Meelen, M., Nehrdich, S., & Keutzer, K. (2024). "Breakthroughs in Tibetan NLP & Digital Humanities." Revue d'Etudes Tibétaines. 72, 5-25.
Download Paper

One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

Published in The 2024 Conference on Empirical Methods in Natural Language Processing (Findings), 2024

This paper introduces ByT5-Sanskrit, a new pretrained language model for Sanskrit NLP tasks, demonstrating superior performance in word segmentation, dependency parsing, and OCR post-correction, while also introducing a novel multitask dataset.

Recommended citation: Nehrdich, S., Hellwig, O., & Keutzer, K. (2024). "One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks." In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (Findings).
Download Paper

Observations on the Intertextuality of Selected Abhidharma Texts Preserved in Chinese Translation

Published in Religions, 2023

This study applies computer-aided methods to detect textual reuse in Xuanzang’s translation corpus and selected Abhidharma texts in Chinese. It presents network graph visualizations and examines reuse patterns, demonstrating alignment with established scholarship and providing a foundation for future detailed studies.

Recommended citation: Nehrdich, S. (2023). "Observations on the Intertextuality of Selected Abhidharma Texts Preserved in Chinese Translation." Religions, 14(7), 911.
Download Paper

MITRA-zh: An efficient, open machine translation solution for Buddhist Chinese

Published in NLP4DH, 2023

This paper presents a novel dataset and fine-tuned models for machine translation of Buddhist Classical Chinese, outperforming commercial solutions in efficiency and performance.

Recommended citation: Nehrdich, S., Bingenheimer, M., Brody, J., & Keutzer, K. (2023). "MITRA-zh: An efficient, open machine translation solution for Buddhist Chinese." NLP4DH.
Download Paper

Data-driven dependency parsing of Vedic Sanskrit

Published in Language Resources and Evaluation, 2023

This paper introduces the first data-driven parser for Vedic Sanskrit, exploring various input feature representations and analyzing parsing errors. The optimal model achieves 87.61 unlabeled and 81.84 labeled attachment scores, demonstrating good performance for this under-resourced ancient Indo-Aryan language.

Recommended citation: Hellwig, O., Nehrdich, S., & Sellmer, S. (2023). "Data-driven dependency parsing of Vedic Sanskrit." Language Resources and Evaluation, 57, 1173-1206.
Download Paper

SansTib, a Sanskrit - Tibetan Parallel Corpus and Bilingual Sentence Embedding Model

Published in Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022

This paper introduces SansTib, a large-scale Sanskrit-Classical Tibetan parallel corpus with 317,289 automatically aligned sentence pairs. It also presents a bilingual sentence embedding model and evaluates the quality of the automatic alignment using a gold evaluation dataset.

Recommended citation: Nehrdich, S. (2022). "SansTib, a Sanskrit - Tibetan Parallel Corpus and Bilingual Sentence Embedding Model." In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6728-6734, Marseille, France. European Language Resources Association.
Download Paper

Accurate Dependency Parsing and Tagging of Latin

Published in Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, 2022

This paper explores the use of Latin BERT word embeddings for morpho-syntactic tagging and introduces a graph-based dependency parser for Latin. The proposed models show competitive performance in tagging and outperform various baselines in parsing.

Recommended citation: Nehrdich, S., & Hellwig, O. (2022). "Accurate Dependency Parsing and Tagging of Latin." In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, pages 20-25, Marseille, France. European Language Resources Association.
Download Paper

Obtaining More Expressive Corpus Distributions for Standardized Ancient Languages

Published in CHR 2021: Computational Humanities Research Conference, 2021

This paper introduces a latent variable model for ancient languages to quantify the influence of early authoritative works on their literary successors in terms of lexis. The model jointly estimates word reuse and composition dates, applied to a corpus of pre-Renaissance Latin texts.

Recommended citation: Hellwig, O., Sellmer, S., & Nehrdich, S. (2021). "Obtaining More Expressive Corpus Distributions for Standardized Ancient Languages." In Proceedings of the Computational Humanities Research Conference (CHR 2021), Amsterdam, The Netherlands.
Download Paper

A Method for the Calculation of Parallel Passages for Buddhist Chinese Sources Based on Million-scale Nearest Neighbor Search

Published in Journal of the Japanese Association for Digital Humanities, 2020

This paper introduces a novel approach to detect parallel passages in the Chinese Buddhist canon using continuous word representations and nearest neighbor search. It evaluates the quality of detected parallels and demonstrates a web application for philological research.

Recommended citation: Nehrdich, S. (2020). "A Method for the Calculation of Parallel Passages for Buddhist Chinese Sources Based on Million-scale Nearest Neighbor Search." Journal of the Japanese Association for Digital Humanities, 5(2), 132-153.
Download Paper

Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks

Published in Conference on Empirical Methods in Natural Language Processing, 2018

This paper presents end-to-end neural network models for Sanskrit tokenization, jointly handling compound splitting and Sandhi resolution. The language-agnostic models outperform previous approaches for Sanskrit and also excel in German compound splitting.

Recommended citation: Hellwig, O., & Nehrdich, S. (2018). "Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
Download Paper