8 votes

Remove diacritics in JavaScript. Remove accents, hyphens, umlauts, diaeresis, cedillas, etc.

Context. I try to eliminate accents and hyphens in order to compare 2 words. I did the function:

let sinDiacriticos = (function(){
    let de = 'ÁÃÀÄÂÉËÈÊÍÏÌÎÓÖÒÔÚÜÙÛÑÇáãàäâéëèêíïìîóöòôúüùûñç',
         a = 'AAAAAEEEEIIIIOOOOUUUUNCaaaaaeeeeiiiioooouuuunc',
        re = new RegExp('['+de+']' , 'ug');

    return texto =>
        texto.replace(
            re, 
            match => a.charAt(de.indexOf(match))
        );
})();

let prue = 'Épico año de mal agüero, sólo Óscar y Ángel ganarán ésta. -Ímpetú Úrsula. ¡Ñañdú corre rápido por ahí!';
console.log(sinDiacriticos(prue));
    // ->   Epico ano de mal aguero, solo Oscar y Angel ganaran esta. -Impetu Ursula. ¡Nandu corre rapido por ahi!

**Is there a direct way to replace **any [diacritical mark][1]** without the need to manually generate a replacement map? I am interested in covering diacritical marks in any language.

Question 2. Taking into account that in Spanish the ñ is a different letter from the n can the diacritics be removed except if it is a ñ ?
<sup>* Question asked by <a href="https://es.stackoverflow.com/users/7176/blonfu">blonfu </a>at <a href="https://es.stackoverflow.com/questions/62031/eliminar-signos-diacr%C3%ADticos-en-javascript-eliminar-tildes-acentos-ortogr%C3%A1ficos/62032?noredirect=1#comment241819_62032">comments</a></sup>

0 votes

I posted because 1- we have commented/talked about this several times, and 2 - so that you can add alternative answers that other methods can provide.

27voto

Mariano Points 21056

Since ECMAScript 6 (2015), you can use. String.prototype.normalize() to lead to the decomposed form of standardization in Unicode (see compatibility ).

This means that a character ( actually a "code point". ) can be broken down into its base character equivalence, followed by its markings. For example:

Source -> NFD NFC

Both forms are equivalent and print the same.

In the form NFD diacritics are code points (~characters) different.
And the important thing is that all the diacritical marks are in the range of U+0300 - U+036F .


Code (for all languages)

Leads to the decomposed form, and removes the block. Combining Diacritial Marks .

// Elimina los diacríticos de un texto (ES6)
//
function eliminarDiacriticos(texto) {
    return texto.normalize('NFD').replace(/[\u0300-\u036f]/g,"");
}

Tests:

// Elimina los diacríticos de un texto (ES6)
//
function eliminarDiacriticos(texto) {
    return texto.normalize('NFD').replace(/[\u0300-\u036f]/g,"");
}

//  Prueba

function mostrarSinDiacriticos(inp){
    document.getElementById('muestra')
        .innerText = eliminarDiacriticos(inp.value);
}

mostrarSinDiacriticos(texto);

Texto:
<input id="texto" oninput="mostrarSinDiacriticos(this)" style="width:100%" value="áéíóúñüÁÉÍÓÚÑÜ">
Sin Diacríticos:
<div id="muestra" />

Diacritics except in the ñ (Spanish only)

  1. We can remove only the accents in vowels or the umlaut in the ü .
    We decompose, eliminate the diacritics exclusively from áéíóúü and we go back to composing:

    texto.normalize('NFD')
         .replace(/([aeio])\u0301|(u)[\u0301\u0308]/gi,"$1$2")
         .normalize();
  2. Or we can remove any diacritic (for any language) except if it's a ñ :

    // Elimina los diacríticos de un texto excepto si es una "ñ" (ES6)
    //
    function eliminarDiacriticosEs(texto) {
        return texto
               .normalize('NFD')
               .replace(/([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+/gi,"$1")
               .normalize();
    }

    Tests:

    // Elimina los diacríticos de un texto excepto si es una "ñ" (ES6)
    //
    function eliminarDiacriticosEs(texto) {
        return texto
               .normalize('NFD')
               .replace(/([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+/gi,"$1")
               .normalize();
    }
    
    //  Prueba
    
    function mostrarSinDiacriticos(inp){
        document.getElementById('muestra')
            .innerText = eliminarDiacriticosEs(inp.value);
    }
    
    mostrarSinDiacriticos(texto);
    
    Texto:
    <input id="texto" oninput="mostrarSinDiacriticos(this)" style="width:100%" value="áéíóúñüÁÉÍÓÚÑÜ">
    Sin Diacríticos:
    <div id="muestra" />

2 votes

One question, how do we maintain the ñ ? The virgulilla of the ñ is not something added to a letter but is part of the letter, and we do not want to change the ñ by a n : It is not the same year that year , bow that monkey , nail that a etc.

1 votes

Very good idea @blonfu. I edited question and answer, leaving both options. I'm sure there are use cases for the 2 alternatives (with or without ñ ).

HolaDevs.com

HolaDevs is an online community of programmers and software lovers.
You can check other people responses or create a new question if you don't find a solution

Powered by:

X