tag - Cadenas de JavaScript fuera del BMP

title of page html (4)

De acuerdo con JavaScript: las buenas partes :

JavaScript se creó en un momento en que Unicode era un conjunto de caracteres de 16 bits, por lo que todos los caracteres en JavaScript tienen 16 bits de ancho.

Esto me lleva a creer que JavaScript usa UCS-2 (¡no UTF-16!) Y solo puede manejar caracteres hasta U + FFFF.

La investigación adicional confirma esto:

> String.fromCharCode(0x20001);

El método fromCharCode parece utilizar solo los 16 bits más bajos al devolver el carácter Unicode. Intentar obtener U + 20001 (ideograma unificado CJK 20001) en su lugar devuelve U + 0001.

Pregunta: ¿es posible manejar caracteres post-BMP en JavaScript?

2011-07-31: slide doce de Unicode Support Shootout: The Good, The Bad, y (en su mayoría) Ugly cubre bastante bien los problemas relacionados con esto:

Depende de lo que quiere decir con ''apoyo''. Ciertamente puedes poner caracteres que no sean UCS-2 en una cadena JS usando sustitutos, y los navegadores los mostrarán si pueden.

Pero, cada elemento en una cadena JS es una unidad de código UTF-16 por separado. No hay soporte de nivel de idioma para manejar caracteres completos: todos los miembros estándar de String ( length , split , slice , etc.) tratan todas las unidades de código, no los caracteres, por lo que muy felizmente dividirán parejas sustitutas o mantendrán secuencias sustitutas no válidas.

Si quieres métodos de adopción de información alternativa, ¡me temo que tendrás que empezar a escribirlos tú mismo! Por ejemplo:

String.prototype.getCodePointLength= function() { return this.length-this.split(/[/uD800-/uDBFF][/uDC00-/uDFFF]/g).length+1; }; String.fromCodePoint= function() { var chars= Array.prototype.slice.call(arguments); for (var i= chars.length; i-->0;) { var n = chars[i]-0x10000; if (n>=0) chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF)); } return String.fromCharCode.apply(null, chars); };

Llegué a la misma conclusión que Bobince. Si desea trabajar con cadenas que contienen caracteres Unicode fuera del BMP, debe volver a implementar los métodos String de javascript. Esto es porque javascript cuenta caracteres como cada valor de código de 16 bits. Los símbolos fuera del BMP necesitan dos valores de código para ser representados. Por lo tanto, se encontrará con un caso en el que algunos símbolos cuentan como dos caracteres y algunos cuentan solo como uno.

He vuelto a implementar los siguientes métodos para tratar cada punto de código Unicode como un solo carácter: .length, .charCodeAt, .fromCharCode, .charAt, .indexOf, .lastIndexOf, .splice y .split.

Puedes verlo en jsfiddle: http://jsfiddle.net/Y89Du/

Aquí está el código sin comentarios. Lo probé, pero aún puede tener errores. Los comentarios son bienvenidos

if (!String.prototype.ucLength) { String.prototype.ucLength = function() { // this solution was taken from // http://.com/questions/3744721/javascript-strings-outside-of-the-bmp return this.length - this.split(/[/uD800-/uDBFF][/uDC00-/uDFFF]/g).length + 1; }; } if (!String.prototype.codePointAt) { String.prototype.codePointAt = function (ucPos) { if (isNaN(ucPos)){ ucPos = 0; } var str = String(this); var codePoint = null; var pairFound = false; var ucIndex = -1; var i = 0; while (i < str.length){ ucIndex += 1; var code = str.charCodeAt(i); var next = str.charCodeAt(i + 1); pairFound = (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF); if (ucIndex == ucPos){ codePoint = pairFound ? ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000 : code; break; } else{ i += pairFound ? 2 : 1; } } return codePoint; }; } if (!String.fromCodePoint) { String.fromCodePoint = function () { var strChars = [], codePoint, offset, codeValues, i; for (i = 0; i < arguments.length; ++i) { codePoint = arguments[i]; offset = codePoint - 0x10000; if (codePoint > 0xFFFF){ codeValues = [0xD800 + (offset >> 10), 0xDC00 + (offset & 0x3FF)]; } else{ codeValues = [codePoint]; } strChars.push(String.fromCharCode.apply(null, codeValues)); } return strChars.join(""); }; } if (!String.prototype.ucCharAt) { String.prototype.ucCharAt = function (ucIndex) { var str = String(this); var codePoint = str.codePointAt(ucIndex); var ucChar = String.fromCodePoint(codePoint); return ucChar; }; } if (!String.prototype.ucIndexOf) { String.prototype.ucIndexOf = function (searchStr, ucStart) { if (isNaN(ucStart)){ ucStart = 0; } if (ucStart < 0){ ucStart = 0; } var str = String(this); var strUCLength = str.ucLength(); searchStr = String(searchStr); var ucSearchLength = searchStr.ucLength(); var i = ucStart; while (i < strUCLength){ var ucSlice = str.ucSlice(i,i+ucSearchLength); if (ucSlice == searchStr){ return i; } i++; } return -1; }; } if (!String.prototype.ucLastIndexOf) { String.prototype.ucLastIndexOf = function (searchStr, ucStart) { var str = String(this); var strUCLength = str.ucLength(); if (isNaN(ucStart)){ ucStart = strUCLength - 1; } if (ucStart >= strUCLength){ ucStart = strUCLength - 1; } searchStr = String(searchStr); var ucSearchLength = searchStr.ucLength(); var i = ucStart; while (i >= 0){ var ucSlice = str.ucSlice(i,i+ucSearchLength); if (ucSlice == searchStr){ return i; } i--; } return -1; }; } if (!String.prototype.ucSlice) { String.prototype.ucSlice = function (ucStart, ucStop) { var str = String(this); var strUCLength = str.ucLength(); if (isNaN(ucStart)){ ucStart = 0; } if (ucStart < 0){ ucStart = strUCLength + ucStart; if (ucStart < 0){ ucStart = 0;} } if (typeof(ucStop) == ''undefined''){ ucStop = strUCLength - 1; } if (ucStop < 0){ ucStop = strUCLength + ucStop; if (ucStop < 0){ ucStop = 0;} } var ucChars = []; var i = ucStart; while (i < ucStop){ ucChars.push(str.ucCharAt(i)); i++; } return ucChars.join(""); }; } if (!String.prototype.ucSplit) { String.prototype.ucSplit = function (delimeter, limit) { var str = String(this); var strUCLength = str.ucLength(); var ucChars = []; if (delimeter == ''''){ for (var i = 0; i < strUCLength; i++){ ucChars.push(str.ucCharAt(i)); } ucChars = ucChars.slice(0, 0 + limit); } else{ ucChars = str.split(delimeter, limit); } return ucChars; }; }

Los motores de JavaScript más recientes tienen String. fromCodePoint fromCodePoint .

const ideograph = String.fromCodePoint( 0x20001 ); // outside the BMP

También un iterador de punto de código , que le proporciona la longitud del punto de código.

function countCodePoints( str ) { const i = str[Symbol.iterator](); let count = 0; while( !i.next().done ) ++count; return count; } console.log( ideograph.length ); // gives ''2'' console.log( countCodePoints(ideograph) ); // ''1''

Sí tu puedes. Aunque la compatibilidad con los caracteres que no son BMP directamente en los documentos fuente es opcional de acuerdo con el estándar ECMAScript, los navegadores modernos te permiten usarlos. Naturalmente, la codificación del documento debe declararse correctamente y, para la mayoría de los propósitos prácticos, deberá utilizar la codificación UTF-8. Además, necesita un editor que pueda manejar UTF-8, y necesita algún método de entrada; ver, por ejemplo, mi utilidad completa de entrada Unicode .

Usando herramientas y configuraciones adecuadas, puede escribir var foo = ''𠀁'' .

Los caracteres no BMP se representarán internamente como pares de sustitución, por lo que cada carácter que no sea BMP cuenta como 2 en la longitud de la cadena.