Abstract
Morphemes are the smallest units of meaning in a language. They play an important role in creating meaning and grammatical syntax. This is especially true in agglutinative languages, such as the Nguni languages, which construct words by concatenating many morphemes. Many Natural Language Processing (NLP) tasks, such as machine translation, can be improved by incorporating morphological information. However, methods to extract this information for the Nguni languages still need to be refined.
This paper evaluates the use of neural methods in morphological parsing, or the labelling of morphemes with their corresponding grammatical role. We compare two main approaches:
We compared the performance of these models with each other as well as with traditional methods of solving the task, such as Finite State Transducer (FST) models.
We found that models trained from scratch outperformed both fine-tuned pre-trained language models and the traditional FST model. Models using morpheme-level embeddings and sentence-level context tended to perform the best.
Introduction
Morphemes are the basic building blocks of meaning (semantics) in a language. By understanding the grammatical role that morphemes play in a sentence, we can solve downstream Natural Language Processing (NLP) tasks better. These tasks range from information retrieval to machine translation.
Splitting up text into its morphemes is known as morphological segmentation. There are two kinds of morphological segmentation: canonical and surface segmentation. Canonical segmentation splits the text into the full, underlying linguistic morphemes in their canonical forms. Surface segmentation splits the text into its morphs, which are the morphemes as they appear in the word after undergoing any sound or spelling changes that may occur.
Morphological parsing is the task of identifying the grammatical role of each morpheme within a word. For example, "zobomi" (meaning "of life" in isiXhosa) is parsed as "za[PossConc14] - u[NPrePre14] - (bu)[BPre14] - bomi[NStem]" The goal of morphological parsing is to predict these morpheme tags for arbitrary text. In our project we focus on the tagging step, and text that is already segmented by another algorithm.
The Nguni languages are a group of widely-spoken South African languages, which include IsiXhosa, IsiZulu, IsiNdebele, and SiSwati. These languages are low-resourced, meaning that there are few tools and corpora (collections of documents) available for these languages.
Morphological information is especially important for NLP in Nguni languages for two main reasons:
- Nguni languages are agglutinative, meaning many words are created by combining multiple morphemes.
- Nguni languages are written conjunctively, meaning that morphemes are concatenated into a single word. For example, in isiXhosa, "andikambuzi" means "I haven't yet asked him", and is composed of the morphemes "a", "ndi", "ka", "m", "buza", and "i".
Few morphological parsers exist for the Nguni languages. One example of a morphological parser is the rule-based ZulMorph parser for isiZulu. Rule-based parsers require linguists to manually incorporate stems, affixes, and grammar rules into the software. This is a tedious process which requires a high degree of expertise.
By comparison, machine-learning approaches are data driven. Instead of manually incorporating information into the algorithm, the parser can be automatically generated from linguistically-annotated data. This means that the process is language agnostic and can leverage previously-existing datasets.
In this project, we investigated the use of neural methods for morphological parsing of Nguni languages. We took two main approaches to this:
- Simbarashe Mawere examined the fine-tuning of pre-trained language models. Three different pre-trained language models (PLMs) were evaluated, with varying levels of inclusion of Nguni Languages. These were XLM-Roberta (XLM), Afro-XLMR, and Nguni-XLMR.
- Cael Marquard examined the use of training models from scratch. Two kinds of models were evaluated: bidirectional Long Short-Term Memory (bi-LSTM) models, and neural Conditional Random Fields (CRFs).
The three research questions that we aimed to answer were:
- Can neural approaches outperform traditional approaches to morphological parsing for the Nguni languages?
- Do models trained from scratch or fine-tuned pre-trained language models perform better?
- Do models classifying surface segmentations or models classifying canonical segmentations perform better?
In order to answer these questions, we compared the tagging quality of these models to each-other and to a traditional, rule-based approach (ZulMorph) as a baseline. Comparisons were also made across segmentation type.
Models Trained From Scratch
Models
Two architectures were chosen: Bidirectional Long Short-Term Memory (Bi-LSTM) and neural Conditional Random Fields (CRFs). These architectures have both been successfully applied to the closely-related tasks of morphological segmentation and part-of-speech tagging for the Nguni languages.
Bi-LSTMs
Bi-LSTM models are a type of recurrent neural network (RNN), a class of neural networks able to "remember" past inputs when computing future outputs. LSTMs are an RNN architecture which avoid issues such as vanishing and exploding gradients, and are popular for NLP tasks. Bidirectional LSTMs combine two separate LSTMs, one reading the input forward and one in reverse. This allows bi-LSTMs to take into account both the future and the past of the sequence that it is classifying in order to have better context.
CRFs
A CRF is a probabilistic model which explicitly models the statistical dependence of the output (label) sequence on the input sequence, as well as the dependence of the output sequence on itself. This allows them to explicitly model the grammar of the language. CRFs have been used for morphological segmentation as well as part-of-speech tagging in the Nguni languages. For this project, the CRF only models the interdependence of each label on the neighbouring labels and the input item. This linear chain approach is simpler to implement and more computationally efficient to train.
CRFs usually rely on a set of hand-crafted features in order to assign probabilities. However, an alternative to this is to use a neural network to generate these features. In this project, a bi-LSTM is used to generate these features.
Implementation and training
All the models were implemented using the Torch machine learning library.
Models were trained on the Centre for High-Performance Computing's (CHPC) Lengau GPU cluster. Each model was also tuned to find optimal hyperparameters. The parameters selected were the learning rate, weight decay, hidden state dimension, dropout, and gradient clip. The Ray tuning library was used to assist in this process.
Pre-trained Language Models
Pre-trained language models (PLMs) are transformer based models that have been trained on large corpora of text, allowing them to learn the structure of language. They make use of masked language modelling (MLM) as the pre-training task from which the models gain the pre-learned word contexts and embeddings. This is a task where a word is masked and the model is task on predicting what it was. After this the model can be fine-tuned on data for a different task with considerable performance at reduced time and resource expenditure.
Beginning with BERT in there have been many models that have been developed and fine-tuned for different languages and tasks. These models have been shown to be effective in a wide range of NLP tasks and have been shown to be effective in low-resource languages. In this exploration we fine-tuned three models on the morphological parsing task for the Nguni languages. The models were XLM-RoBERTa, Afro-XLMR and Nguni-XLMR based on their varying levels of inclusion of the Nguni languages.
Models
Training and Fine-tuning
All three models were adapted from the Hugging Face Transformers library in their largest variants to ensure the best performance. Since they were all derivative of XLM-RoBERTa, the SentencePiece tokenizer was used for the tokenization and alignment of input. Due to the size of the models, and the storage and time limitations of the CHPC cluster, the choice of fine-tuning hyperparameters was limited to the number of training epochs, batch size and initial learning rate. With 3 choices in each hyperparameter, 4 languages in the project and 3 models to consider, 108 different configurations had to be trained and evaluated on a validation set to find the best setting for each model-language pair. The selection was limited to ensure enough time for training and testing on limited 12-hour cluster slots. Grid search results.
Outcomes
Results
The Macro and Micro F1 scores were chosen to evaluate the quality of the models. The Macro F1 score was the main metric, though, as it is more difficult for the model to optimise for.
The results for the exploration were split based on the data which the models were trained on. The first exploration in the project was using expertly annotated canonical segmentations which were obtained from the dataset. For the other result sets, we used canonical and surface segmentations predicted using the morphological segmenters from MORPH-SEGMENT. For the models trained from scratch, the models are split into word- and sentence-level.
Click to view the full tables of results:
Discussion
Models Trained from Scratch
Both Bi-LSTMs and CRFs performed well. The CRF layer did not improve significantly over the bi-LSTM used to generate its features. This could be due to the CRF being a simple linear chain — perhaps higher-order CRFs (which model dependence across more than just neighbours) could improve this.
Sentence-level was better than word-level. This makes intuitive sense as the added context allows for easier disambiguation.
Morpheme-level models outperformed character-level models. This could be because morphemes are a more effective representation, or because morpheme embeddings are more sensitive to small changes in the morpheme. Since each morpheme is mapped to its own learnt embedding, even a single differing character can yield an entirely different embedding. This lets the model identify rare classes more easily.
Pre-trained Language Models
Effect of Nguni-specific transfer learning. It was the expectation that the Nguni-XLMR model would outperform the other PLMs due to its specialized pre-training on tasks for the Nguni languages however this was not the case. The model performed well but was not the best in any of the tasks. This could be due to the fact that the model was trained on a narrower linguistic scope than the other PLMs and thus did not have the same level of generalization as the other models. The model was also not able to leverage the similarity of the Nguni languages as effectively as expected.
Effect of subword tokenization. Since the models were adapted from XLM-RoBERTa, they all used a SentencePiece tokenizer which is optimised for subword tokenization. This is suboptimal for our task since its inputs are already subword units. By subword tokenizing individual morphemes, we reduced the models ability to learn the morphological structure.
Lexical analysis of datasets. Analysis of the dataset revealed that isiXhosa had an advantage over the other languages since it had less unique morphemes and thus the model was able to learn the morphological structure more effectively. It achieve significantly higher (+9%) macro-F1 score over the other languages.
Conclusions
Our research questions can be answered as follows:
-
Can neural approaches outperform traditional approaches to morphological parsing for the Nguni languages?
Yes. Our deep-learning approaches (MorphParse) outperformed the rule-based baseline (ZulMorph).
-
Do models trained from scratch or fine-tuned pre-trained language models perform better?
Models trained from scratch. The gap is not huge, though, and this could be due to issues like suitability of the PLM's tokenisers. Both approaches performed well overall and outperformed the baseline by similar margins.
-
Do models classifying surface segmentations or models classifying canonical segmentations perform better?
Canonical segmentations. The models performed significantly better on canonical segmentations than on surface segmentations. This could be because canonical segmentations provide the model with more linguistic information, and tend to segment the text into more morphemes (advantaging our sequence-tagging models).
Our project's contributions are as follows:
- We demonstrated the viability of neural methods in morphological parsing for Nguni languages. Both PLMs and models trained from scratch performed well at the task, showing that the approach is viable.
- We developed new state-of-the-art morphological taggers for Nguni languages. The morphological taggers that we have developed outperform the previous baseline. Available on our GitHub page with each subsection on it's own separate branch.