Ir al contenido

Documat


Statistical approaches for natural language modelling and monotone statistical machine translation

  • Autores: Jesus Andres Ferrer Árbol académico
  • Directores de la Tesis: Alfons Juan Císcar (dir. tes.) Árbol académico, Francisco Casacuberta Nolla (dir. tes.) Árbol académico
  • Lectura: En la Universitat Politècnica de València ( España ) en 2010
  • Idioma: inglés
  • Tribunal Calificador de la Tesis: Enrique Vidal Ruiz (presid.) Árbol académico, José Oncina Carratalá (secret.) Árbol académico, Marcello Federico (voc.) Árbol académico, Philipp Koehn (voc.) Árbol académico, Hermann Ney (voc.) Árbol académico
  • Enlaces
    • Tesis en acceso abierto en: RiuNet
  • Resumen
    • This thesis gathers some contributions to statistical pattern recognition and, more specifically, to several natural language processing (NLP) tasks, Several well-known statistical techniques are revisited in this thesis: parameter estimation, loss function design and probability modelling. The former techniques are applied to several NLP tasks such as text classification (TC), language modelling (LM) and statistical machine translation (SMT).

      In parameter estimation, we tackle the smoothing problem by proposing a constrained domain maximum likelihood estimation (CDMLE) technique.

      The CDMLE avoids the need of the smoothing stage that makes the maximum likelihood estimation (MLE) to lose its good theoretical properties. This technique is applied to text classification by mean of the Naive Bayes classifier. Afterwards, the CDMLE technique is extended to leaving-one-out MLE and, then, applied to LM smoothing. The results obtained in several LM tasks reported an improvement in terms of perplexity compared with the standard smoothing techniques.

      Concerning the loss function, we carefully study the design of loss functions different from the 0-1 loss. We focus our study on those loss functions that while retaining a similar decoding complexity than the 0-1 loss function, provide more flexibility.

      Many candidate loss functions are presented and analysed in several statistical machine translation tasks and for several translation models. We also analyse some outstanding translations rules such as the direct translation rule; and we give a further insight into the log-linear models, which are, in fact, particular cases of loss functions.

      Finally, several monotone translation models are proposed based on well-known modelling techniques. Firstly, an extension to the GIATI technique is proposed to infer finite state transducers (FST). Afterwards, a phrased-based monotone translation model inspired in hidden Markov models is proposed. Lastly, a phrased-based hidden semi-Markov model is introduced. The latter model produces slightly improvements over the baseline under some circumstances.


Fundación Dialnet

Mi Documat

Opciones de tesis

Opciones de compartir

Opciones de entorno