Abstract

Team: Simbarashe Mawere and Cael Marquard.

Supervisor: Francois Meyer.

Morphemes are the smallest units of meaning in a language. They play an important role in creating meaning and grammatical syntax. This is especially true in agglutinative languages, such as the Nguni languages, which construct words by concatenating many morphemes. Many Natural Language Processing (NLP) tasks, such as machine translation, can be improved by incorporating morphological information. However, methods to extract this information for the Nguni languages still need to be refined.

This paper evaluates the use of neural methods in morphological parsing, or the labelling of morphemes with their corresponding grammatical role. We compare two main approaches:

Training neural models from scratch, and
Fine-tuning pre-trained language models.

We compared the performance of these models with each other as well as with traditional methods of solving the task, such as Finite State Transducer (FST) models.

We found that models trained from scratch outperformed both fine-tuned pre-trained language models and the traditional FST model. Models using morpheme-level embeddings and sentence-level context tended to perform the best.

Introduction

Morphemes are the basic building blocks of meaning (semantics) in a language. By understanding the grammatical role that morphemes play in a sentence, we can solve downstream Natural Language Processing (NLP) tasks better. These tasks range from information retrieval to machine translation.

Splitting up text into its morphemes is known as morphological segmentation. There are two kinds of morphological segmentation: canonical and surface segmentation. Canonical segmentation splits the text into the full, underlying linguistic morphemes in their canonical forms. Surface segmentation splits the text into its morphs, which are the morphemes as they appear in the word after undergoing any sound or spelling changes that may occur.

Canonical and surface segmentations of 'zobomi'. The canonical
form is 'za-u-(bu)-bomi', whilst the surface form is 'zo-bomi' — The surface and canonical segmentations of the word "zobomi" (of life in IsiXhosa).

Morphological parsing is the task of identifying the grammatical role of each morpheme within a word. For example, "zobomi" (meaning "of life" in isiXhosa) is parsed as "za[PossConc14] - u[NPrePre14] - (bu)[BPre14] - bomi[NStem]" The goal of morphological parsing is to predict these morpheme tags for arbitrary text. In our project we focus on the tagging step, and text that is already segmented by another algorithm.

Architecture diagram depicting how the word 'zobomi' is handled
by the parsing system. It is first segmented into za-u-(bu)-bomi by the segmenter. Then, it is passed
into MorphParse (our models), which produces the tagged output za[PossConc14]-u[NPrePre14]-(bu)[BPre14]-bomi[NStem] — Each bracketed tag labels the preceding morpheme with its grammatical function and noun class. The morpheme "bomi" is the word's *noun stem* in this example.

The Nguni languages are a group of widely-spoken South African languages, which include IsiXhosa, IsiZulu, IsiNdebele, and SiSwati. These languages are low-resourced, meaning that there are few tools and corpora (collections of documents) available for these languages.

A map of South Africa showing what proportion
of people speak a Nguni language at home. Speakership is most dominant in the south-east of the country,
covering almost the whole of the Eastern Cape and Kwa-Zulu Natal, but stretching into Gauteng and
Mpumalanga. — Proportion of South Africans who speak a Nguni language at home. (Original on Wikipedia, public domain)

Morphological information is especially important for NLP in Nguni languages for two main reasons:

Nguni languages are agglutinative, meaning many words are created by combining multiple morphemes.
Nguni languages are written conjunctively, meaning that morphemes are concatenated into a single word. For example, in isiXhosa, "andikambuzi" means "I haven't yet asked him", and is composed of the morphemes "a", "ndi", "ka", "m", "buza", and "i".

Few morphological parsers exist for the Nguni languages. One example of a morphological parser is the rule-based ZulMorph parser for isiZulu. Rule-based parsers require linguists to manually incorporate stems, affixes, and grammar rules into the software. This is a tedious process which requires a high degree of expertise.

By comparison, machine-learning approaches are data driven. Instead of manually incorporating information into the algorithm, the parser can be automatically generated from linguistically-annotated data. This means that the process is language agnostic and can leverage previously-existing datasets.

In this project, we investigated the use of neural methods for morphological parsing of Nguni languages. We took two main approaches to this:

Simbarashe Mawere examined the fine-tuning of pre-trained language models. Three different pre-trained language models (PLMs) were evaluated, with varying levels of inclusion of Nguni Languages. These were XLM-Roberta (XLM), Afro-XLMR, and Nguni-XLMR.
Cael Marquard examined the use of training models from scratch. Two kinds of models were evaluated: bidirectional Long Short-Term Memory (bi-LSTM) models, and neural Conditional Random Fields (CRFs).

The three research questions that we aimed to answer were:

Can neural approaches outperform traditional approaches to morphological parsing for the Nguni languages?
Do models trained from scratch or fine-tuned pre-trained language models perform better?
Do models classifying surface segmentations or models classifying canonical segmentations perform better?

In order to answer these questions, we compared the tagging quality of these models to each-other and to a traditional, rule-based approach (ZulMorph) as a baseline. Comparisons were also made across segmentation type.

Models Trained From Scratch

Author: Cael Marquard

Models

Two architectures were chosen: Bidirectional Long Short-Term Memory (Bi-LSTM) and neural Conditional Random Fields (CRFs). These architectures have both been successfully applied to the closely-related tasks of morphological segmentation and part-of-speech tagging for the Nguni languages.

Bi-LSTMs

Bi-LSTM models are a type of recurrent neural network (RNN), a class of neural networks able to "remember" past inputs when computing future outputs. LSTMs are an RNN architecture which avoid issues such as vanishing and exploding gradients, and are popular for NLP tasks. Bidirectional LSTMs combine two separate LSTMs, one reading the input forward and one in reverse. This allows bi-LSTMs to take into account both the future and the past of the sequence that it is classifying in order to have better context.

CRFs

A CRF is a probabilistic model which explicitly models the statistical dependence of the output (label) sequence on the input sequence, as well as the dependence of the output sequence on itself. This allows them to explicitly model the grammar of the language. CRFs have been used for morphological segmentation as well as part-of-speech tagging in the Nguni languages. For this project, the CRF only models the interdependence of each label on the neighbouring labels and the input item. This linear chain approach is simpler to implement and more computationally efficient to train.

Diagram of a CRF organised as a linear chain. Each output label depends on
the previous and next output labels, as well as the input datum which it labels. — An example of a CRF organised as a linear chain. Diamonds represent input variables **X_i**, circles represent output variables **Y_i**, and edges represent statistical interdependence.

CRFs usually rely on a set of hand-crafted features in order to assign probabilities. However, an alternative to this is to use a neural network to generate these features. In this project, a bi-LSTM is used to generate these features.

Implementation and training

All the models were implemented using the Torch machine learning library.

Models were trained on the Centre for High-Performance Computing's (CHPC) Lengau GPU cluster. Each model was also tuned to find optimal hyperparameters. The parameters selected were the learning rate, weight decay, hidden state dimension, dropout, and gradient clip. The Ray tuning library was used to assist in this process.

Pre-trained Language Models

Author: Simbarashe Mawere

Pre-trained language models (PLMs) are transformer based models that have been trained on large corpora of text, allowing them to learn the structure of language. They make use of masked language modelling (MLM) as the pre-training task from which the models gain the pre-learned word contexts and embeddings. This is a task where a word is masked and the model is task on predicting what it was. After this the model can be fine-tuned on data for a different task with considerable performance at reduced time and resource expenditure.

Beginning with BERT in there have been many models that have been developed and fine-tuned for different languages and tasks. These models have been shown to be effective in a wide range of NLP tasks and have been shown to be effective in low-resource languages. In this exploration we fine-tuned three models on the morphological parsing task for the Nguni languages. The models were XLM-RoBERTa, Afro-XLMR and Nguni-XLMR based on their varying levels of inclusion of the Nguni languages.

Illustration of how BERT works. The model is trained on a masked language modelling task, where it predicts masked words in a sentence. This allows it to learn the structure of language and later use that knowledge when being fine-tuned on more down-stream tasks.

Models

XLM-RoBERTa

XLM-RoBERTa is a model that was proposed by Conneau et al. to be pre-trained on 100 languages and applied on various cross-lingual transfer tasks like cross-lingual natural language inference (XNLI) and named entity recognition (NER).

Strength: High multilingual capabilities with strong results across different tasks and good performance for low-resource languages.

Included Nguni Languages: isiXhosa

Afro-XLMR

Afro-XLMR is a version of XLM-RoBERTa fine-tuned using multilingual adaptive fine-tuning (MAFT) on 20 African languages by Alabi et al. to improve performance on languages that were previously unseen in XLM-R's pre-training corpora.

Strengths: Specialized for African languages with improved performance for low-resource languages and distilled size from XLM-R

Included Nguni Languages: isiXhosa & isiZulu

Nguni-XLMR

Nguni-XLMR by Meyer et al. follow the multilingual adaptive fine-tuning as in Afro-XLMR but on a narrower linguistic scope to leverage the similarity of the Nguni languages for cross-lingual transfer. This model was evaluate on a variety of natural language understanding tasks and showed improved over the two other PLMs.

Strength: Optimized for Nguni languages, offering superior results in linguistic tasks.

Included Nguni Languages: all

Training and Fine-tuning

All three models were adapted from the Hugging Face Transformers library in their largest variants to ensure the best performance. Since they were all derivative of XLM-RoBERTa, the SentencePiece tokenizer was used for the tokenization and alignment of input. Due to the size of the models, and the storage and time limitations of the CHPC cluster, the choice of fine-tuning hyperparameters was limited to the number of training epochs, batch size and initial learning rate. With 3 choices in each hyperparameter, 4 languages in the project and 3 models to consider, 108 different configurations had to be trained and evaluated on a validation set to find the best setting for each model-language pair. The selection was limited to ensure enough time for training and testing on limited 12-hour cluster slots. Grid search results.

Outcomes

Results

The Macro and Micro F₁ scores were chosen to evaluate the quality of the models. The Macro F₁ score was the main metric, though, as it is more difficult for the model to optimise for.

The results for the exploration were split based on the data which the models were trained on. The first exploration in the project was using expertly annotated canonical segmentations which were obtained from the dataset. For the other result sets, we used canonical and surface segmentations predicted using the morphological segmenters from MORPH-SEGMENT. For the models trained from scratch, the models are split into word- and sentence-level.

Click to view the full tables of results:

Model	isiNdebele		siSwati		isiXhosa		isiZulu
	Micro F1	Macro F1	Micro F1	Macro F1	Micro F1	Macro F1	Micro F1	Macro F1
Baselines
ZulMorph	-	-	-	-	-	-	0.6471	0.3378
Models Trained from Scratch
Bi-LSTM, morpheme	0.9222	0.6950	0.9160	0.6761	0.9529	0.7359	0.9264	0.6860
Bi-LSTM, char-sum	0.9226	0.6907	0.9177	0.6835	0.9553	0.7418	0.9271	0.6817
Bi-LSTM, morpheme	0.9188	0.7009	0.9190	0.6872	0.9609	0.7694	0.9263	0.6812
Bi-LSTM, char-sum	0.9142	0.6901	0.9132	0.6748	0.9585	0.7604	0.9210	0.6661
CRF, morpheme	0.9189	0.7047	0.9196	0.6945	0.9619	0.7777	0.9272	0.6825
CRF, char-sum	0.9167	0.7007	0.9179	0.6855	0.9623	0.7633	0.9255	0.6730
Pre-trained Language Models
XLM-RoBERTa	0.9152	0.6425	0.9095	0.6420	0.9467	0.6773	0.9132	0.6157
Afro-XLMR	0.9133	0.6273	0.9100	0.6460	0.9583	0.7363	0.9282	0.6610
Nguni-XLMR	0.9104	0.6190	0.9042	0.6176	0.9488	0.6738	0.9187	0.6302

Model	isiNdebele		siSwati		isiXhosa		isiZulu
	Micro F1	Macro F1	Micro F1	Macro F1	Micro F1	Macro F1	Micro F1	Macro F1
Models Trained from Scratch
Bi-LSTM, morpheme	0.911	0.677	0.905	0.657	0.952	0.767	0.912	0.659
Bi-LSTM, char-sum	0.903	0.650	0.894	0.613	0.903	0.747	0.907	0.626

Model	isiNdebele		siSwati		isiXhosa		isiZulu
	Micro F₁	Macro F₁	Micro F₁	Macro F₁	Micro F₁	Macro F₁	Micro F₁	Macro F₁
Baselines
ZulMorph	-	-	-	-	-	-	0.6471	0.3378
Models Trained from Scratch
Bi-LSTM, morpheme	0.8084	0.5769	0.8230	0.5717	0.9110	0.6807	0.8250	0.5936
Bi-LSTM, char-sum	0.8094	0.6816	0.8289	0.5760	0.9107	0.6816	0.8248	0.6033
Bi-LSTM, morpheme (Sentence-level)	0.8059	0.5834	0.8274	0.5792	0.9172	0.7171	0.8265	0.5989
Bi-LSTM, char-sum (Sentence-level)	0.8010	0.5814	0.8238	0.5749	0.9135	0.7020	0.8184	0.5903
CRF, morpheme (Sentence-level)	0.8049	0.5961	0.8301	0.5860	0.9190	0.7224	0.8280	0.5991
CRF, char-sum (Sentence-level)	0.8081	0.5893	0.8271	0.5782	0.9181	0.7084	0.8246	0.5973
Pre-trained Language Models
XLM-RoBERTa	0.8151	0.5509	0.8280	0.5278	0.9137	0.6346	0.8251	0.5438
Afro-XLMR	0.8137	0.5413	0.8273	0.5296	0.9140	0.6423	0.8269	0.5469
Nguni-XLMR	0.8144	0.5468	0.8264	0.5285	0.9155	0.6390	0.8272	0.5495

Model	isiNdebele		siSwati		isiXhosa		isiZulu
	Micro F₁	Macro F₁	Micro F₁	Macro F₁	Micro F₁	Macro F₁	Micro F₁	Macro F₁
Models Trained from Scratch
Bi-LSTM, morpheme	0.7830	0.5409	0.8129	0.5274	0.8661	0.6724	0.8023	0.5529
Bi-LSTM, char-sum	0.7742	0.5238	0.8053	0.5212	0.7994	0.6054	0.7976	0.5511
Pre-trained Language Models
XLM-RoBERTa	0.7282	0.4868	0.5107	0.2231	0.7244	0.5208	0.6759	0.4349
Afro-XLMR	0.7275	0.4832	0.5216	0.2412	0.7273	0.5300	0.6794	0.4495
Nguni-XLMR	0.7258	0.4746	0.5328	0.2505	0.7309	0.5270	0.6818	0.4512

	isiNdebele	siSwati	isiXhosa	isiZulu
Word count	49689	47385	48735	49097
Canonical Segmentations
Morpheme count	137400	127698	149294	144047
Morphemes/word	2.77	2.69	3.06	2.93
Unique morphemes	5100	3389	2453	3284
Unique tags	240	246	236	256
Surface Segmentations
Morpheme count	131204	125282	133244	133476
Morphemes/word	2.64	2.64	2.73	2.72
Unique morphemes	5843	4932	3762	4240
Unique tags	252	342	340	350

Language	Model	Epochs	LR	Batch	F1
isiNdebele	XLMR	10	5e-5	32	0.7412
	Afro-XLMR	5	3e-5	16	0.7405
	Nguni-XLMR*	5	5e-5	32	0.7455
siSwati	XLMR	15	5e-5	16	0.7248
	Afro-XLMR	15	5e-5	16	0.7087
	Nguni-XLMR*	10	5e-5	16	0.7309
isiXhosa	XLMR*	5	5e-5	16	0.7391
	Afro-XLMR*	10	3e-5	32	0.7516
	Nguni-XLMR*	10	1e-5	16	0.7432
isiZulu	XLMR	10	3e-5	32	0.6912
	Afro-XLMR*	10	3e-5	16	0.6981
	Nguni-XLMR*	10	5e-5	32	0.6803

Results of the grid search showing the best version of each model in each of the Nguni languages. The F1 score reported is the macro-F1 average. Asterisk (*) indicates the language was present in the pre-training of the model. The bold results are the best model per language indicating the best hyperparameters. The underlined result is the best model performance in the grid search irrespective of language or model.

Discussion

Models Trained from Scratch

Both Bi-LSTMs and CRFs performed well. The CRF layer did not improve significantly over the bi-LSTM used to generate its features. This could be due to the CRF being a simple linear chain — perhaps higher-order CRFs (which model dependence across more than just neighbours) could improve this.

Sentence-level was better than word-level. This makes intuitive sense as the added context allows for easier disambiguation.

Morpheme-level models outperformed character-level models. This could be because morphemes are a more effective representation, or because morpheme embeddings are more sensitive to small changes in the morpheme. Since each morpheme is mapped to its own learnt embedding, even a single differing character can yield an entirely different embedding. This lets the model identify rare classes more easily.

Graph comparing Macro F1s of from-scratch models on
gold canonical data. — Comparison of Macro F₁ scores of different from-scratch models on gold canonical data.

Pre-trained Language Models

Effect of Nguni-specific transfer learning. It was the expectation that the Nguni-XLMR model would outperform the other PLMs due to its specialized pre-training on tasks for the Nguni languages however this was not the case. The model performed well but was not the best in any of the tasks. This could be due to the fact that the model was trained on a narrower linguistic scope than the other PLMs and thus did not have the same level of generalization as the other models. The model was also not able to leverage the similarity of the Nguni languages as effectively as expected.

Effect of subword tokenization. Since the models were adapted from XLM-RoBERTa, they all used a SentencePiece tokenizer which is optimised for subword tokenization. This is suboptimal for our task since its inputs are already subword units. By subword tokenizing individual morphemes, we reduced the models ability to learn the morphological structure.

Lexical analysis of datasets. Analysis of the dataset revealed that isiXhosa had an advantage over the other languages since it had less unique morphemes and thus the model was able to learn the morphological structure more effectively. It achieve significantly higher (+9%) macro-F1 score over the other languages.

Conclusions

Our research questions can be answered as follows:

Can neural approaches outperform traditional approaches to morphological parsing for the Nguni languages?

Yes. Our deep-learning approaches (MorphParse) outperformed the rule-based baseline (ZulMorph).
Do models trained from scratch or fine-tuned pre-trained language models perform better?

Models trained from scratch. The gap is not huge, though, and this could be due to issues like suitability of the PLM's tokenisers. Both approaches performed well overall and outperformed the baseline by similar margins.
Do models classifying surface segmentations or models classifying canonical segmentations perform better?

Canonical segmentations. The models performed significantly better on canonical segmentations than on surface segmentations. This could be because canonical segmentations provide the model with more linguistic information, and tend to segment the text into more morphemes (advantaging our sequence-tagging models).

Graph comparing Macro F1 of ZulMorph with Afro-XLMR and
bi-LSTMs. In all cases, surface-based models are outperformed by their canonical counterparts. The
bi-LSTM models are best overall. — Comparison of Macro F₁ scores of ZulMorph with Afro-XLMR and bi-LSTMs.

Our project's contributions are as follows:

We demonstrated the viability of neural methods in morphological parsing for Nguni languages. Both PLMs and models trained from scratch performed well at the task, showing that the approach is viable.
We developed new state-of-the-art morphological taggers for Nguni languages. The morphological taggers that we have developed outperform the previous baseline. Available on our GitHub page with each subsection on it's own separate branch.

Abstract

Introduction

Models Trained From Scratch

Models

Bi-LSTMs

CRFs

Implementation and training

Pre-trained Language Models

Models

XLM-RoBERTa

Afro-XLMR

Nguni-XLMR

Training and Fine-tuning

Outcomes

Results

Discussion

Models Trained from Scratch

Pre-trained Language Models

Conclusions

Deliverables

Research Paper - Simbarashe Mawere

Literature Review - Simbarashe Mawere

Research Paper - Cael Marquard

Literature Review - Cael Marquard

Proposal

Poster

Code