utf8 - ã php
PHP: convierte el punto de código unicode a UTF-8 (9)
PHP 7+
A partir de PHP 7, puede usar la sintaxis de escape del punto de código Unicode para hacer esto.
echo "/u{597D}";
salidas 好
.
Tengo mis datos en este formato: U+597D
o así U+6211
. Quiero convertirlos a UTF-8 (los caracteres originales son 好 y 我). ¿Cómo puedo hacerlo?
Acabo de escribir un polyfill
para las versiones perdidas de multibyte de ord
y chr
con lo siguiente en mente:
Define las funciones
mb_ord
ymb_chr
solo si aún no existen. Si existen en su marco o en alguna versión futura de PHP, se ignorará el polyfill.Utiliza la extensión
mbstring
ampliamente utilizada para realizar la conversión. Si la extensiónmbstring
no está cargada, usará la extensióniconv
lugar.
También agregué funciones para HTMLentities codificación / decodificación y codificación / decodificación al formato JSON, así como algunos códigos de demostración de cómo usar estas funciones
Código
if (!function_exists(''codepoint_encode'')) {
function codepoint_encode($str) {
return substr(json_encode($str), 1, -1);
}
}
if (!function_exists(''codepoint_decode'')) {
function codepoint_decode($str) {
return json_decode(sprintf(''"%s"'', $str));
}
}
if (!function_exists(''mb_internal_encoding'')) {
function mb_internal_encoding($encoding = NULL) {
return ($from_encoding === NULL) ? iconv_get_encoding() : iconv_set_encoding($encoding);
}
}
if (!function_exists(''mb_convert_encoding'')) {
function mb_convert_encoding($str, $to_encoding, $from_encoding = NULL) {
return iconv(($from_encoding === NULL) ? mb_internal_encoding() : $from_encoding, $to_encoding, $str);
}
}
if (!function_exists(''mb_chr'')) {
function mb_chr($ord, $encoding = ''UTF-8'') {
if ($encoding === ''UCS-4BE'') {
return pack("N", $ord);
} else {
return mb_convert_encoding(mb_chr($ord, ''UCS-4BE''), $encoding, ''UCS-4BE'');
}
}
}
if (!function_exists(''mb_ord'')) {
function mb_ord($char, $encoding = ''UTF-8'') {
if ($encoding === ''UCS-4BE'') {
list(, $ord) = (strlen($char) === 4) ? @unpack(''N'', $char) : @unpack(''n'', $char);
return $ord;
} else {
return mb_ord(mb_convert_encoding($char, ''UCS-4BE'', $encoding), ''UCS-4BE'');
}
}
}
if (!function_exists(''mb_htmlentities'')) {
function mb_htmlentities($string, $hex = true, $encoding = ''UTF-8'') {
return preg_replace_callback(''/[/x{80}-/x{10FFFF}]/u'', function ($match) use ($hex) {
return sprintf($hex ? ''&#x%X;'' : ''&#%d;'', mb_ord($match[0]));
}, $string);
}
}
if (!function_exists(''mb_html_entity_decode'')) {
function mb_html_entity_decode($string, $flags = null, $encoding = ''UTF-8'') {
return html_entity_decode($string, ($flags === NULL) ? ENT_COMPAT | ENT_HTML401 : $flags, $encoding);
}
}
Cómo utilizar
echo "/nGet string from numeric DEC value/n";
var_dump(mb_chr(25105));
var_dump(mb_chr(22909));
echo "/nGet string from numeric HEX value/n";
var_dump(mb_chr(0x6211));
var_dump(mb_chr(0x597D));
echo "/nGet numeric value of character as DEC int/n";
var_dump(mb_ord(''我''));
var_dump(mb_ord(''好''));
echo "/nGet numeric value of character as HEX string/n";
var_dump(dechex(mb_ord(''我'')));
var_dump(dechex(mb_ord(''好'')));
echo "/nEncode / decode to DEC based HTML entities/n";
var_dump(mb_htmlentities(''我好'', false));
var_dump(mb_html_entity_decode(''我好''));
echo "/nEncode / decode to HEX based HTML entities/n";
var_dump(mb_htmlentities(''我好''));
var_dump(mb_html_entity_decode(''我好''));
echo "/nUse JSON encoding / decoding/n";
var_dump(codepoint_encode("我好"));
var_dump(codepoint_decode(''/u6211/u597d''));
Salida
Get string from numeric DEC value
string(3) "我"
string(3) "好"
Get string from numeric HEX value
string(3) "我"
string(3) "好"
Get numeric value of character as DEC string
int(25105)
int(22909)
Get numeric value of character as HEX string
string(4) "6211"
string(4) "597d"
Encode / decode to DEC based HTML entities
string(16) "我好"
string(6) "我好"
Encode / decode to HEX based HTML entities
string(16) "我好"
string(6) "我好"
Use JSON encoding / decoding
string(12) "/u6211/u597d"
string(6) "我好"
Con la ayuda de la siguiente tabla:
http://en.wikipedia.org/wiki/UTF-8#Description
no puede ser más simple :)
Simplemente enmascare los números Unicode según el rango en el que quepan.
Estaba en la posición que necesitaba para filtrar los caracteres específicos sin afectar el html porque estaba usando un editor de wysiwig, pero las personas que copian pegar de la palabra agregarían algunos personajes bonitos e indignos al contenido.
Mi solución se reduce a simples listas de reemplazo.
class ReplaceIllegal {
public static $find = array ( 0 => ''/x0'', 1 => ''/x1'', 2 => ''/x2'', 3 => ''/x3'', 4 => ''/x4'', 5 => ''/x5'', 6 => ''/x6'', 7 => ''/x7'', 8 => ''/x8'', 9 => ''/x9'', 10 => ''/xA'', 11 => ''/xB'', 12 => ''/xC'', 13 => ''/xD'', 14 => ''/xE'', 15 => ''/xF'', 16 => ''/x10'', 17 => ''/x11'', 18 => ''/x12'', 19 => ''/x13'', 20 => ''/x14'', 21 => ''/x15'', 22 => ''/x16'', 23 => ''/x17'', 24 => ''/x18'', 25 => ''/x19'', 26 => ''/x1A'', 27 => ''/x1B'', 28 => ''/x1C'', 29 => ''/x1D'', 30 => ''/x1E'', 31 => ''/x80'', 32 => ''/x81'', 33 => ''/x82'', 34 => ''/x83'', 35 => ''/x84'', 36 => ''/x85'', 37 => ''/x86'', 38 => ''/x87'', 39 => ''/x88'', 40 => ''/x89'', 41 => ''/x8A'', 42 => ''/x8B'', 43 => ''/x8C'', 44 => ''/x8D'', 45 => ''/x8E'', 46 => ''/x8F'', 47 => ''/x90'', 48 => ''/x91'', 49 => ''/x92'', 50 => ''/x93'', 51 => ''/x94'', 52 => ''/x95'', 53 => ''/x96'', 54 => ''/x97'', 55 => ''/x98'', 56 => ''/x99'', 57 => ''/x9A'', 58 => ''/x9B'', 59 => ''/x9C'', 60 => ''/x9D'', 61 => ''/x9E'', 62 => ''/x9F'', 63 => ''/xA0'', 64 => ''/xA1'', 65 => ''/xA2'', 66 => ''/xA3'', 67 => ''/xA4'', 68 => ''/xA5'', 69 => ''/xA6'', 70 => ''/xA7'', 71 => ''/xA8'', 72 => ''/xA9'', 73 => ''/xAA'', 74 => ''/xAB'', 75 => ''/xAC'', 76 => ''/xAD'', 77 => ''/xAE'', 78 => ''/xAF'', 79 => ''/xB0'', 80 => ''/xB1'', 81 => ''/xB2'', 82 => ''/xB3'', 83 => ''/xB4'', 84 => ''/xB5'', 85 => ''/xB6'', 86 => ''/xB7'', 87 => ''/xB8'', 88 => ''/xB9'', 89 => ''/xBA'', 90 => ''/xBB'', 91 => ''/xBC'', 92 => ''/xBD'', 93 => ''/xBE'', 94 => ''/xBF'', 95 => ''/xC0'', 96 => ''/xC1'', 97 => ''/xC2'', 98 => ''/xC3'', 99 => ''/xC4'', 100 => ''/xC5'', 101 => ''/xC6'', 102 => ''/xC7'', 103 => ''/xC8'', 104 => ''/xC9'', 105 => ''/xCA'', 106 => ''/xCB'', 107 => ''/xCC'', 108 => ''/xCD'', 109 => ''/xCE'', 110 => ''/xCF'', 111 => ''/xD0'', 112 => ''/xD1'', 113 => ''/xD2'', 114 => ''/xD3'', 115 => ''/xD4'', 116 => ''/xD5'', 117 => ''/xD6'', 118 => ''/xD7'', 119 => ''/xD8'', 120 => ''/xD9'', 121 => ''/xDA'', 122 => ''/xDB'', 123 => ''/xDC'', 124 => ''/xDD'', 125 => ''/xDE'', 126 => ''/xDF'', 127 => ''/xE0'', 128 => ''/xE1'', 129 => ''/xE2'', 130 => ''/xE3'', 131 => ''/xE4'', 132 => ''/xE5'', 133 => ''/xE6'', 134 => ''/xE7'', 135 => ''/xE8'', 136 => ''/xE9'', 137 => ''/xEA'', 138 => ''/xEB'', 139 => ''/xEC'', 140 => ''/xED'', 141 => ''/xEE'', 142 => ''/xEF'', 143 => ''/xF0'', 144 => ''/xF1'', 145 => ''/xF2'', 146 => ''/xF3'', 147 => ''/xF4'', 148 => ''/xF5'', 149 => ''/xF6'', 150 => ''/xF7'', 151 => ''/xF8'', 152 => ''/xF9'', 153 => ''/xFA'', 154 => ''/xFB'', 155 => ''/xFC'', 156 => ''/xFD'', 157 => ''/xFE'', );
private static $replace = array ( 0 => ''�'', 1 => '''', 2 => '''', 3 => '''', 4 => '''', 5 => '''', 6 => '''', 7 => '''', 8 => '''', 9 => ''	'', 10 => '' '', 11 => '''', 12 => '''', 13 => '' '', 14 => '''', 15 => '''', 16 => '''', 17 => '''', 18 => '''', 19 => '''', 20 => '''', 21 => '''', 22 => '''', 23 => '''', 24 => '''', 25 => '''', 26 => '''', 27 => '''', 28 => '''', 29 => '''', 30 => '''', 31 => ''€'', 32 => '''', 33 => ''‚'', 34 => ''ƒ'', 35 => ''„'', 36 => ''…'', 37 => ''†'', 38 => ''‡'', 39 => ''ˆ'', 40 => ''‰'', 41 => ''Š'', 42 => ''‹'', 43 => ''Œ'', 44 => '''', 45 => ''Ž'', 46 => '''', 47 => '''', 48 => ''‘'', 49 => ''’'', 50 => ''“'', 51 => ''”'', 52 => ''•'', 53 => ''–'', 54 => ''—'', 55 => ''˜'', 56 => ''™'', 57 => ''š'', 58 => ''›'', 59 => ''œ'', 60 => '''', 61 => ''ž'', 62 => ''Ÿ'', 63 => '' '', 64 => ''¡'', 65 => ''¢'', 66 => ''£'', 67 => ''¤'', 68 => ''¥'', 69 => ''¦'', 70 => ''§'', 71 => ''¨'', 72 => ''©'', 73 => ''ª'', 74 => ''«'', 75 => ''¬'', 76 => ''­'', 77 => ''®'', 78 => ''¯'', 79 => ''°'', 80 => ''±'', 81 => ''²'', 82 => ''³'', 83 => ''´'', 84 => ''µ'', 85 => ''¶'', 86 => ''·'', 87 => ''¸'', 88 => ''¹'', 89 => ''º'', 90 => ''»'', 91 => ''¼'', 92 => ''½'', 93 => ''¾'', 94 => ''¿'', 95 => ''À'', 96 => ''Á'', 97 => ''Â'', 98 => ''Ã'', 99 => ''Ä'', 100 => ''Å'', 101 => ''Æ'', 102 => ''Ç'', 103 => ''È'', 104 => ''É'', 105 => ''Ê'', 106 => ''Ë'', 107 => ''Ì'', 108 => ''Í'', 109 => ''Î'', 110 => ''Ï'', 111 => ''Ð'', 112 => ''Ñ'', 113 => ''Ò'', 114 => ''Ó'', 115 => ''Ô'', 116 => ''Õ'', 117 => ''Ö'', 118 => ''×'', 119 => ''Ø'', 120 => ''Ù'', 121 => ''Ú'', 122 => ''Û'', 123 => ''Ü'', 124 => ''Ý'', 125 => ''Þ'', 126 => ''ß'', 127 => ''à'', 128 => ''á'', 129 => ''â'', 130 => ''ã'', 131 => ''ä'', 132 => ''å'', 133 => ''æ'', 134 => ''ç'', 135 => ''è'', 136 => ''é'', 137 => ''ê'', 138 => ''ë'', 139 => ''ì'', 140 => ''í'', 141 => ''î'', 142 => ''ï'', 143 => ''ð'', 144 => ''ñ'', 145 => ''ò'', 146 => ''ó'', 147 => ''ô'', 148 => ''õ'', 149 => ''ö'', 150 => ''÷'', 151 => ''ø'', 152 => ''ù'', 153 => ''ú'', 154 => ''û'', 155 => ''ü'', 156 => ''ý'', 157 => ''þ'', );
/*
* replace illegal characters for escaped html character but don''t touch anything else.
*/
public static function getSaveValue($value) {
return str_replace(self::$find, self::$replace, $value);
}
public static function makeIllegal($find,$replace) {
self::$find[] = $find;
self::$replace[] = $replace;
}
}
Esto funcionó bien para mi. Si tiene una cadena "Cartas u00e1 u00e9 etc." reemplazar por "Letras á é".
function unicode2html($str){
// Set the locale to something that''s UTF-8 capable
setlocale(LC_ALL, ''en_US.UTF-8'');
// Convert the codepoints to entities
$str = preg_replace("/u([0-9a-fA-F]{4})/", "&#x//1;", $str);
// Convert the entities to a UTF-8 string
return iconv("UTF-8", "ISO-8859-1//TRANSLIT", $str);
}
$utf8string = html_entity_decode(preg_replace("/U/+([0-9A-F]{4})/", "&#x//1;", $string), ENT_NOQUOTES, ''UTF-8'');
es probablemente la solución más simple.
<?php
function chr_utf8($n,$f=''C*''){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'''')));
}
$your_input=''U+597D'';
echo (chr_utf8(hexdec(ltrim($your_input,''U+''))));
// Output 好
Si desea utilizar una función de devolución de llamada, puede intentarlo:
<?php
// Note: function chr_utf8 shown above is required
$your_input=''U+597DU+6211'';
$result=preg_replace_callback(''#U/+([a-f0-9]+)#i'',function($a){return chr_utf8(hexdec($a[1]));},$your_input);
echo $result;
// Output 好我
Verifíquelo en https://eval.in/748187
function utf8($num)
{
if($num<=0x7F) return chr($num);
if($num<=0x7FF) return chr(($num>>6)+192).chr(($num&63)+128);
if($num<=0xFFFF) return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<=0x1FFFFF) return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128).chr(($num&63)+128);
return '''';
}
function uniord($c)
{
$ord0 = ord($c{0}); if ($ord0>=0 && $ord0<=127) return $ord0;
$ord1 = ord($c{1}); if ($ord0>=192 && $ord0<=223) return ($ord0-192)*64 + ($ord1-128);
$ord2 = ord($c{2}); if ($ord0>=224 && $ord0<=239) return ($ord0-224)*4096 + ($ord1-128)*64 + ($ord2-128);
$ord3 = ord($c{3}); if ($ord0>=240 && $ord0<=247) return ($ord0-240)*262144 + ($ord1-128)*4096 + ($ord2-128)*64 + ($ord3-128);
return false;
}
utf8 () y uniord () intentan duplicar las funciones chr () y ord () en php:
echo utf8(0x6211)."/n";
echo uniord(utf8(0x6211))."/n";
echo "U+".dechex(uniord(utf8(0x6211)))."/n";
//In your case:
$wo=''U+6211'';
$hao=''U+597D'';
echo utf8(hexdec(str_replace("U+","", $wo)))."/n";
echo utf8(hexdec(str_replace("U+","", $hao)))."/n";
salida:
我
25105
U+6211
我
好
mb_convert_encoding(
preg_replace("/U/+([0-9A-F]*)/"
,"&#x//1;"
,''U+597DU+6211''
)
,"UTF-8"
,"HTML-ENTITIES"
);
funciona bien, también.