Parasynthetic Morpholexical Relationships Of The Spanish: Lexical Search Beyond The Lexicographical Regularity
Octavio Santana Suárez
Departamento de
Informática y Sistemas
Universidad de Las
Palmas de Gran Canaria
Las Palmas de Gran
Canaria, 35017, Spain
osantana@dis.ulpgc.es
Francisco J. Carreras
Riudavets
Departamento de
Informática y Sistemas
Universidad de Las
Palmas de Gran Canaria
Las Palmas de Gran
Canaria, 35017, Spain
fcarreras@dis.ulpgc.es
Jose R. Pérez Aguiar
Departamento de
Informática y Sistemas
Universidad de Las
Palmas de Gran Canaria
Las Palmas de Gran
Canaria, 35017, Spain
jperez@dis.ulpgc.es
Juan Carlos Rodríguez
del Pino
Departamento de
Informática y Sistemas
Universidad de Las
Palmas de Gran Canaria
Las Palmas de Gran
Canaria, 35017, Spain
jcrodriguez@dis.ulpgc.es
ABSTRACT
This work talks about parasynthesis of the Spanish language. This formative process of Spanish words is useful for the establishment of morpholexical relationships. From a lexicon of over 4 million different words, around 6 million parasynthetic morpholexical relationships are established. All the irregularities and exceptions found in referenced lexicon have been considered, which are many in a highly inflected language. These relationships turn out to be useful because they allow, between other possibilities, doing semantic searches, offering alternative sentences in the correction of style or summarization and finding semantically synonymous sentences. The principal main function of this application is that it allows lexical searches beyond the lexicographical regularity. The developed web tool is capable of solving any morpholexical aspect of a Spanish word. This tool includes the suffixation and prefixation processing and also shows the graph of morpholexical relationships. This tool is only one way to show the potentiality of the system which can be incorporated to other tools of high linguistic level.
KEYWORDS
Morpholexical Relationships, Computational Linguistic, Information Retrieval.
1. INTRODUCTION
The problem in the works of investigation on recognition of the lexical morphology, as an essential and autonomous component of the grammar, is to face the derivative properties of the lexicon across the relationships that are established between the constituent morphemes. The existence of a lot of morphemes ―bases, roots, and affixes― and its excessive number of allowed combinations makes its study difficult. However, the words possess a common pattern in its morphologic behaviour, which is the aim of present work. Another controversial question, in the research on morphology, consists of the synchronous definition; words are fully tied to their history, in both the morphologic and semantic fields. The history generally determines its current lexicography and semantic. Therefore, it is necessary to take into account the etymological references of the words, which complicates the tagging. Some words, already old-fashioned or belonging to mother tongues like Latin or Greek, have relevant information about its historical morphologic process. This information increases in number the connections between the different formation types of the current lexicon in a generational sequence of words. In this work, a wide lexicon of Spanish words have been considered, without spreading to other languages, to minimize the absence of connectors between morpholexically related words by the formative processes. This wide lexicon makes a synchronous study possible without obviating some necessary information about archaic etymological processes. The resulting system is characterized by simple and flexible algorithms with 100% success in the recognition and generation of morpholexical relationships of the language independently of the complexity of the data they are handling —in another highly inflectional languages like Greek (Socrates, 2005) or French (Polguère, 2000) similar works have been developed, the latter emphasizes the usefulness of these relations for natural language processing applications.
2. Lexicon
The lexicon handled in
this work has been created from: the Diccionario de
A canonical
form is defined as any word with its own identity susceptible of enduring
derivational processes to form other words. Such a word could be formed from
another by similar processes. In the reference lexicon a canonical form is any
entry word of consulted sources having own meaning ―those entries that
are appreciative forms of others and do not add any substantial meaning variation
are discarded.
The universe
of words analyzed in this work consists of 148,798 canonical forms that provide
about 4 million different inflected words, among which 6 million parasynthetic
morpholexical relationships have been established, of which 3,800, between
canonical forms. A morpholexical relationship between canonical forms is
projected to all the words that are generated from each of the canonical forms,
through any of the processes of push-up or appreciation of Spanish: gender, number,
augmentative, diminutive, pejorative, superlative, conjugation... (Santana et
al., 1997 and Santana et al., 1999).
3. aim
This work aims at obtaining a set of parasynthetic morpholexical
relationships among Spanish words useful for automatic applications in natural
language processing, such as automatic searchers, spelling correctors, style
analyzers, automatic text generators, automated ideological dictionaries, etc.
It would be of great usefulness for those who deal with documents in Spanish as
lexicographers, analysts of style, text information retrievers, translators,
and many others.
To give an
example, the parasyntheticly related verbs with loco are alocar,
enloquecer
and aloquecer,
which notably diminishes the possible response of a mask search “*loc?r”: aclocar, allocar, alocar, bilocar, blocar, clocar, colocar, descolocar, desflocar, dislocar, enclocar, enllocar, locar, recolocar. Although, they are not in this group the verbs
which have suffered spelling changes as consequence of the phonetic adjustments
or of any other kind: enloquecer
and aloquecer.
Having morpholexical
relationships allows doing
lexical searches, which by means of searching mechanisms based on the graphic
word are notably improved: exact searches, with wildcards, masks or
truncations. On the one hand, introducing words whose relationship with the
search pattern is merely graphical is avoided and, on the other hand, the
abysses which are producing the strong irregularities of inflected languages,
as the Spanish one, are avoided. This way, about any word, it can be
figured-out the lemma from which it comes, the relationships it supports with
other lemmas and the words to which these lemmas give place. All this is
applied in the context of interest; for example: a paragraph, a text, a web
address, an electronic library or a textual corpus.
This
searching system allows the whole flexibility that could be needed: from an
exact search casero,
to include changes of genre and/or number casero, casera, caseros, caseras, to include the appreciative and/or the superlative
forms with or without changes in gender and number caserote, caserito, caserejo, caserísimo... to include morpholexically
related canonical forms in the first descending suffixing level caserillo, caseramente,
prefixing or parasynthetic level acaserar, ascending casa, in other levels acaserado, acaseramiento, casariego, caserío, caserón, caseta, casona... up to consider all the inflections for every
canonical form (in this example, there would be 542 different words with some
kind of relationship with the considered entry).
On the other
hand, the search where the emphasis is more semantically than morphological can
be done from the meanings that the joined affixes provide. For example, if it
is desirable to look for adjectives derived from verbs that mean 'unable to
carry out the action of the verb', using parasynthetic relations, it would be
enough to discriminate for the affixes in‑...-ble: incansable, inconmovible, inconvencible, infatigable, irreprimible...
A
parasynthetic morpholexical relationship, between two words, exists when one of
them has been formed from the other one by simultaneous addition of an affix on
the left and another on the right ―usually a prefix and a suffix―
and also the grammatical categories and the semantics are just right. By way of
example, agrupar
possesses a semantic and functional relationship with grupo.
In a
synchronous study, with automated analysis of the morphology in mind, formal or
theoretical aspects may be not coincide with those strictly linguistics. There
are Spanish words which have a strong functional and semantic relationship like
the parasynthesis, but it is not. Though, there is a formal relationship
through other stages in the evolution of the languages, so it is indeed
considered necessary to include them ―cabeza with acapiza, pez with empegar, edad with coetáneo. This concept must be
restricted to avoid reaching the concept of related idea ―which exceeds
the objectives of this work, nuevo with remozar, color with anaranjado, pelo with encabellecer. Therefore, a historic-etymological meeting
criterion is applied. It is obvious that for the speaker aovado, arrocado and amelonado are forms equally
related with huevo,
rueca and melon respectively ―it must also
be so for the automatic data processing. In order to solve the linguistic
boundaries preventing to treat relationships beyond the strict parasynthesis,
it is necessary to be located at a different level of the morphological level.
Thus, the concept of morpholexical relationships is extended to improve the
quality of the language processing in this aspect.
4. Parasynthetic Relationships
A word might have suffered, with respect to another one, suffixal and
prefixal alterations ―parasynthesis. These alterations have been studied
to establish the morpholexical relationships between two words from a
synchronic point of view and applying the extended criteria explained in the
previous section. When it comes to considering this type of relationships, the
coincidence in the application of both affixes on the original word is
significant in the proposed examples. The words *afrancés / *francesar, *encajón / *cajonar, *amujer / *mujerado do not exist, therefore,
the principle of simultaneity is conclusive.
francés --> a‑frances‑ar
cajón --> en‑cajon‑ar
mujer --> a‑mujer‑ado
Consequently,
a morpholexical relationship between an original word and one related by means
of both a prefixal process and a suffixal one is established, when none of
these processes have caused separately the existence of an intermediate word.
Those words which, by prefixal alteration, have caused a new word with its own
identity, and have also suffered a suffixal alteration or vice versa are not
estimated as sharing those relationships, since the principle of simultaneity
is lost. The intermediate word is called in this work as previous-word.
salir --> sobre‑salir -->
sobresal‑iente
América --> americ‑ano --> anti‑americano
verbo --> verb‑al --> de‑verbal
mar --> mar‑ino -->
sub‑marino
músculo --> muscul‑ar --> intra‑muscular
The
lexicogenesis of some words ―deverbal, submarino, intramuscular―
cast doubt and has been doubly interpreted by different authors as parasynthetic
―mainly by means of semantic characterization― or as prefixal
―mainly by means of a formal characterization in the continuity of the
formation. In those cases, which are likely of this double analysis, the
prefixal interpretation is preferred, since the following criterion will be
applied to establish the relationships: if there is not a previous-word between
two forms, a parasynthetic relationship is established, otherwise, a suffixal
or prefixal relationship, depending on the suitability.
It is obvious
to exclude from the previous‑word concept the coinciding words
lexicographically which do not have the suitable semantics. A parasynthetic
morpholexical relationship is established between agua and enaguar, in spite of the existence
of the word enagua
which does not have to be confused with a previous‑word of enaguar because
it does not has neither a semantic nor an etymological relationship neither
with agua
nor with enaguar.
The same happens with broma
and abromar
with respect to abroma.
Those which
have a suffixal or prefixal relationship parallel to the parasynthetic are also
excluded. For example, since desgarbo has the same meaning as desgarbado (‘without grace’), it
is preferred to relate morpholexically the two words directly with garbo
―prefixal and parasynthetic respectively― and desgarbo is not considered
previous‑word of desgarbado. The same happens
with atóxico
and atoxicar
with respect to tóxico.
The prefix a- in atóxico (‘not toxic’) has the
meaning of negation, whereas in atoxicar (‘poison with something toxic’), the prefix does
not have a meaning of negation.
Certainly,
we consider as parasynthetic relationship the one where there could be a
possible previous word which, however, has not been consolidated in the lexicon
studied, unless there are several words in the same situation with respect to
the possible previous-word. In this case, a non existent word is kept to
maintain the relationship among them, as in the example of hipotiroideo and hipotiroidismo
with respect to tiroides.
Some words,
irrespective of the formation process which they have suffered, keep a close
semantic and functional relationship, as explained in the previous section, and
therefore, they have to be considered in the applications which are expected to
be developed. To offer an illustration, entenebrecer is closely related to tiniebla and there are not
previous-words: *tenebrecer,
*entenebro,
*tineblecer,
*entinieblar,
nor *tenebro.
In computer applications, it is essential to consider this type of
relationships if it is expected to cover a homogeneous set of families of words
related by the same objective. This type of relationships, which have a common
point in the etymologic history of the words involved and an analogy with
parasynthetic relationships as for the affixes used and the semantics which
they add, are treated as extended morpholexical relationships classified as
parasynthesis in this work. Some examples of parasynthetic formation analogous
to the case of entenebrecer
are endurecer
from duro, embellecer from
5. Parasynthetic families
Once we have established the parasynthetic extended morpholexical
relationships between two words, it can be considered two groups of words which
have the same kind of extended morpholexical relationship as regards as a
common word called original word.
They are words which belong to the same morpholexical field. This way, all the
words which are parasyntheticly related with a word given will be call
parasynthetic family, figure 1.
Figure
1. Parasynthetic family of plaza.
Since a word can be related to an original word and at the same time to be the original word
in relationships linked to other words, a family tie is established between
different families through this word. All the families related in this way make
a clan.
5.1 Logical Structure
The word formation richness of Spanish along time, its irregularities
and its peculiarities make it difficult ―not impossible― to
represent the extended morpholexical relationships between the elements of the
lexicon in a non-diachronic way. In order to represent the different types of
relationships which are deduced from the Spanish word formation rules and from
the applied extended criteria, we have opted for a directed graph. The nodes
identify the Spanish words, the arcs show that there is an extended
morpholexical relationship between them; the direction of each arc corresponds
to the relationship between the nodes, and the labels of the arcs classify the
type of extended morpholexical relationship. The words in Spanish are, in this
way, grouped together by disjunctive sets of interconnected elements
―connected components of the graph.
Exceptionally,
there are nodes with more than one arc directed towards it, which breaks the possible
structure of a tree and the representation by means of a directed graph is necessary.
This way, the word incomodar
is parasynthetic of cómodo
and a verbalization of incómodo
which is, at the same time, a prefixation of cómodo ―there are two ways
to go through the graph to get to incomodar from cómodo, figure 2.
Figure
2. Graph that represents the morpholexical relationships with incomodar.
Exceptionally,
two groups of extended morpholexical relationships can lose its connection
―in a solely synchronic analysis―, because of the non-existence
between them of the link node. This happens when extended morpholexical relationships
have been established with other words from a non-existent original word in the
present Spanish. A word is considered non-existent when it is not mentioned in
the reference sources because its use has been lost throughout time, the word
has been substituted by another word with a different morphology or it simply
has not been consolidated in Spanish language. In order to not to lose these
relationships without incorporating new elements ―archaisms or neologisms
not considered by the reference sources―, the existence of the link node
―labelled as non-existent― between the two groups of related nodes
is established and its label is not visible for the final user of the
application, although it keeps the conceptual interconnection in a connected
component. In this way, we can establish morpholexical relationships between nihilidad, nihilismo, nihilista, aniquilar and anihilar through
a non-existent word *nihil
―the Greek word that means ‘nothing’―, figure 3.
Figure
3. Morpholexical relationships through a non-existent word.
Here we show
the connected component of a graph which represents a clan of extended morpholexical
relationships. The suffixal and prefixal morpholexical relationships (Santana
et al., 2004) are also displayed in order to understand better the structures
and paths, figure 4.
5.2 Navigation
The information has been properly structured and catalogued in a way
that lets effective access. The graph can be covered in any direction. From any
node, we can get to any other with same related component of the graph knowing
at any time the extended morpholexical relationships ―arcs― which
are crossing to get to the destination. Starting from this premise, the
different linguistic possibilities which are clear from this system are
detailed. It has to be stressed that the non-existent nodes are kept internally
to let the navigation among morpholexically related words.
5.2.1 Direction
It is linguistically interesting to know the family of words
morpholexically related with a word given at certain proximity. According to
the path which we follow in the graph ―upward, downward, or
horizontal―, we classify both proximity and morphology. From a word we
obtain the ones which have suffered less number of derivative processes
―upward―, the ones which have suffered the same number of
alterations ―horizontal―, and the ones which have suffered more
derivative processes from it ―downward. Now the different types
considered are detailed.
5.2.2 Direct Ancestry
It is regarded as so, the reverse method to the parasynthesis, the
process of obtaining the original word with which a specific word has been
related to. To recognize this process in the words, all that needs to be done
is following up the graph a level, if it exists. Therefore, the direct ancestry
of the verb desquiciar
is the substantive quicio.
If the direct ancestry is applied twice, we obtain the original of the original
of the current node. This way, the two levels direct ancestry of the adjective desquiciador is quicio.
5.2.3 Indirect Ancestry
The words morpholexically related to the direct ancestry and which are
in the same level in the graph are regarded as indirect ancestries. The related
words which have suffered one alteration less than the current word can be
retrieved. In the quicio
clan, the indirect ancestries of the adjective desquiciador are the verb enquiciar and the
substantive quicial.
Just like in the case of the direct ancestry, several levels of morpholexical
relationships would be applied here.
5.2.4 Horizontal Direction
The words morpholexically related with the same original word and,
therefore, which have suffered the same number of alterations, are regarded as
horizontal direction. We manage to get them by retrieving the direct ancestry
and going only one level down all the arcs of that node. This option retrieves
all the members of a family from one of them ―it does not include the
original word. The verbs desquiciar
and enquiciar
are obtained from the substantive quicial. However, the words directly related to some indirect
ancestry of level one are regarded as horizontal of second level. The
substantive quicialera
and the adjective enquiciado
are obtained from the substantive desquicio.
5.2.5 Descendents
The descendents are the morpholexical relationships of a family from a
given original word. The descendents of level two include the words that have a
previous relationship with a same original word: it recovers the descendent
family of each one of the members of the descendent family of the original
word. Of the quicio
clan, the descendent of the substantive quicial would be the substantive quicialera. The descendents of
level two of the substantive quicio is all the substantives and adjectives of the nodes
of the base of the graph of morpholexical relationships shown previously.
5.3 Filters
The output derived from the different kinds of navigation from a specific word can, sometimes, provide such a volume of information that it could make it difficult to find the words which are being looked up and the relationships which we want to see. These filters, in the extended morpholexical relationships, allow the selective discrimination of the navigation output. All the results, as consequence of the different types of navigation, are liable to be subject to different kinds of filters ―functional, regularity, and by affixes.
6. results
Because of the non-existence of similar tools
for Spanish, this system has been tried with a textual corpus containing more
than 8 thousand texts, both literary and not
literary (narrative, theatre, poetry, law, politics, history...), containing
more than three hundred million words, of which more than half a million are
different, proceeding from more than one hundred thousand canonical different
forms.
It is proceeded to recognize morphologically the words to study the incidence of the morpholexical relationships that handles this system, on the lexical
searches in this corpus. To every recognized word the processor of
morpholexical relationships is applied to and its canonical morpholexical relationships
are obtained. More than 3,300 millions of canonical morpholexical relationships
from the first level are identified —150,243 are different—, of which more than
164 million are parasynthetic ones —6,603 are different. With extending the canonical morpholexical relationships
to all their inflected words, generated by our morphologic processor, more than
50 billion of relationships are obtained —suffixal, prefixal and the
parasynthetic ones but only those from first level.
With the treatment of prefixes and enclitic pronouns, witch can be
carried out by our morphologic processor; the identified morpholexical
relationships would overcome the trillion. Although with weaker relations, their number can keep on rising, as the considered levels
increase, until getting to include the whole clan —the whole morpholexical
paradigm of a word, see the example of the previously exhibited clan. This
makes clear the potential of this system, for example, to locate words in a
text.
7. conclusions
A taxonomic, exhaustive and systematic study is made about affixes used
in the derivative, prefixal and parasynthetic morphology of the Spanish on
lexicon sufficiently wide that it ensures all the casuistry of each one of the
affixes existing in this language. The way for use the affixes, the transcategorization,
the meaning and the lexicographical regularity in the relation provides a overview
of the formative behaviour of the Spanish words, since the affixes implied in
the main processes of formation appear in a web application ―suffixation,
prefixation and parasynthesis. The web application is designed to be of utility
to those who works with documents in Spanish: lexicologists, analysts of style,
extractors of textual information, translators, etc. It is important to
emphasize that all the irregularities and exceptions of lexicon of the section
2 have been studied, which are many in a highly inflectional language
―20% of irregularity in the suffixation, 7% of irregularity in the
prefixation and 15% of irregularity in the parasynthesis.
The resulting system is characterized by simple and flexible algorithms
with 100% success in the recognition and generation of morpholexical relationships
of the language independently of the complexity of the data they are handling. This system can be incorporated
easily to other tools from aid to the oriented document treatment to solve problems
of the natural language processing. The principal potential of this application is
that it allows lexical searches beyond the lexicographical regularity.
This kind of
systems supposes a first step towards the multiple possibilities in computer science
and specialized programs that they must be developed on this knowledge base.
REFERENCES
Alcoba Rueda,
S. 1992. Tema verbal y formación de palabras en español. Actas do XIX Congreso Internacional de Lingüística e filoloxía románicas, Vol. II. Universidad de Santiago
de Compostela,
Almela Pérez, R. 1999. Procedimientos de formación de palabras en español. Ed. Ariel, Barcelona.
Alvar Ezquerra, M.
2002. La formación de las palabras en español. 5th edn.
Ed. Arco/Libros, Madrid.
Alvar Ezquerra, M.
2003. Nuevo diccionario de voces de uso
actual. Ed. Arco/Libros,
Bajo Pérez, E. 1997. La derivación
nominal en español. Ed. Arco/Libros,
Biblograf, s.a. 1997. Diccionario General de
Casares, J. 1990. Diccionario Ideológico
de
Clave SM.1997.
Diccionario de Uso del Español Actual. Clave SM, edición en CD‑ROM. Madrid.
Dee,
J. 1997. Volume I Introduction and Lexicon. A lexicon of latin
derivatives in Italian, Spanish, French and English.
Dee,
J. 1997. Volume II Index. A lexicon of latin
derivatives in Italian, Spanish, French and English.
Espasa‑Calpe. 1991. Gran Diccionario de Sinónimos y Antónimos,
4ª edic. Espasa‑Calpe,
Faitelson-Weiser, S. 1993. Sufijación y derivación sufijal:
sentido y forma. La formación de palabras. Varela (ed.), Taurus,
Lang, M. 1992. Formación de
palabras en español. Morfología derivativa productiva en léxico moderno.
Larousse Planeta, s.a. 1996. Gran Diccionario de
Moliner, M. 1996. Diccionario de Uso
del Español, edición en CD‑ROM.
Polguère,
A. 2000. Towards a theoretically-motivated general public dictionary of
semantic derivations and collocations for French. Proceedings of EURALEX'2000. Stuttgart, pp. 517-528.
Real Academia
Española and Espasa‑Calpe.
1995. Diccionario de
Santana O., Carreras,
F., Pérez, J. and Rodríguez, G. 2003. Relaciones morfoléxicas sufijales
Santana, O., Pérez,
J., Carreras, F., Duque, J., Hernández, Z. and Rodríguez,
G. 1999. FLANOM: Flexionador y lematizador
automático de formas
nominales. Lingüística Española Actual,
XXI, Vol. 2, Ed. Arco/Libros, S.L.
Madrid. pp. 253‑297.
Santana, O., Pérez,
J., Carreras, F. and Rodríguez, G. 2004. Relaciones
morfoléxicas prefijales del español. Procesamiento de Lenguaje Natural, 32, Ed. SEPLN, Madrid.
pp. 9-36.
Santana, O., Pérez,
J., Hernández, Z., Carreras, F. and Rodríguez, G.
1997. FLAVER: Flexionador y lematizador automático de formas verbales. Lingüística
Española Actual, XIX, Vol. 2, Ed. Arco/Libros, S.L. Madrid. pp.
229‑282.
Santana, O., Pérez,
J., Carreras, F. and Rodríguez, G. 2004. Suffixal and Prefixal Morpholexical
Relationships of the Spanish, Lecture
Notes in Artificial Inteligence, 3230, Ed. Springer-Verlag. pp. 407‑418
Seco, M. 1991. Diccionario
de dudas y dificultades de la lengua española, 9ª Edición. Espasa‑Calpe, Madrid.
Socrates,
D., Baldzis, S., Kolalas,
A. and Eumeridou, E. 2005. The Computational Modern
Greek Morphological Lexicon ―An Efficient and Comprehensive System for
Morphological Analysis and Synthesis. Literary and Linguistic Computing, Vol 2, No. 20, pp. 153-187.
Varela, S. (ed.). 1993. La formación de palabras. Taurus,