java - emojis - emoticons unicode

Elimine ✅, 🔥, ✈, ♛ y otros emojis/imágenes/signos similares de las cadenas Java (7)

Tengo algunas cadenas con todo tipo de diferentes emojis / imágenes / signos en ellas.

No todas las cadenas están en inglés; algunas están en otros idiomas no latinos, por ejemplo:

▓ railway?? → Cats and dogs I''m on 🔥 Apples ⚛ ✅ Vi sign ♛ I''m the king ♛ Corée ♦ du Nord ☁ (French) gjør at både ◄╗ (Norwegian) Star me ★ Star ⭐ once more 早上好 ♛ (Chinese) Καλημέρα ✂ (Greek) another ✓ sign ✓ добрай раніцы ✪ (Belarus) ◄ शुभ प्रभात ◄ (Hindi) ✪ ✰ ❈ ❧ Let''s get together ★. We shall meet at 12/10/2018 10:00 AM at Tony''s.❉

... y muchos más de estos.

Me gustaría deshacerme de todos estos signos / imágenes y mantener solo las letras (y la puntuación) en los diferentes idiomas.

Traté de limpiar los signos usando la biblioteca EmojiParser :

String withoutEmojis = EmojiParser.removeAllEmojis(input);

El problema es que EmojiParser no puede eliminar la mayoría de los signos. El signo ♦ es el único que encontré hasta ahora que se eliminó. No se eliminan otros signos como ✪ ❉ ★ ✰ ❈ ❧ ✂ ❋ ⓡ ✿ ♛ 🔥.

¿Hay alguna forma de eliminar todos estos signos de las cadenas de entrada y mantener solo las letras y la puntuación en los diferentes idiomas ?

Basado en la lista completa de Emoji, v11.0 tiene 1644 puntos de código Unicode diferentes para eliminar. Por ejemplo, ✅ está en esta lista como U+2705 .

Tener la lista completa de emojis necesita filtrarlos usando puntos de código . Iterar sobre un solo char o byte no funcionará, ya que un único punto de código puede abarcar varios bytes. Debido a que Java usa emojis UTF-16, generalmente tomará dos caracteres.

String input = "ab✅cd"; for (int i = 0; i < input.length();) { int cp = input.codePointAt(i); // filter out if matches i += Character.charCount(cp); }

El mapeo desde el punto de código Unicode U+2705 a Java int es sencillo:

int viSign = 0x2705;

o como Java admite cadenas Unicode:

int viSign = "✅".codePointAt(0);

Di algunos ejemplos a continuación, y pensé que el latín es suficiente, pero ...

¿Hay alguna forma de eliminar todos estos signos de la cadena de entrada y mantener solo las letras y la puntuación en los diferentes idiomas?

Después de editar, desarrolló una nueva solución, utilizando el método Character.getType , y esa parece ser la mejor opción para esto.

package zmarcos.emoji; import java.util.Arrays; import java.util.HashSet; import java.util.Set; public class TestEmoji { public static void main(String[] args) { String[] arr = {"Remove ✅, 🔥, ✈ , ♛ and other such signs from Java string", "→ Cats and dogs", "I''m on 🔥", "Apples ⚛ ", "✅ Vi sign", "♛ I''m the king ♛ ", "Star me ★", "Star ⭐ once more", "早上好 ♛", "Καλημέρα ✂"}; System.out.println("---only letters and spaces alike---/n"); for (String input : arr) { int[] filtered = input.codePoints().filter((cp) -> Character.isLetter(cp) || Character.isWhitespace(cp)).toArray(); String result = new String(filtered, 0, filtered.length); System.out.println(input); System.out.println(result); } System.out.println("/n---unicode blocks white---/n"); Set<Character.UnicodeBlock> whiteList = new HashSet<>(); whiteList.add(Character.UnicodeBlock.BASIC_LATIN); for (String input : arr) { int[] filtered = input.codePoints().filter((cp) -> whiteList.contains(Character.UnicodeBlock.of(cp))).toArray(); String result = new String(filtered, 0, filtered.length); System.out.println(input); System.out.println(result); } System.out.println("/n---unicode blocks black---/n"); Set<Character.UnicodeBlock> blackList = new HashSet<>(); blackList.add(Character.UnicodeBlock.EMOTICONS); blackList.add(Character.UnicodeBlock.MISCELLANEOUS_TECHNICAL); blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS); blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_ARROWS); blackList.add(Character.UnicodeBlock.MISCELLANEOUS_SYMBOLS_AND_PICTOGRAPHS); blackList.add(Character.UnicodeBlock.ALCHEMICAL_SYMBOLS); blackList.add(Character.UnicodeBlock.TRANSPORT_AND_MAP_SYMBOLS); blackList.add(Character.UnicodeBlock.GEOMETRIC_SHAPES); blackList.add(Character.UnicodeBlock.DINGBATS); for (String input : arr) { int[] filtered = input.codePoints().filter((cp) -> !blackList.contains(Character.UnicodeBlock.of(cp))).toArray(); String result = new String(filtered, 0, filtered.length); System.out.println(input); System.out.println(result); } System.out.println("/n---category---/n"); int[] category = {Character.COMBINING_SPACING_MARK, Character.COMBINING_SPACING_MARK, Character.CONNECTOR_PUNCTUATION, /*Character.CONTROL,*/ Character.CURRENCY_SYMBOL, Character.DASH_PUNCTUATION, Character.DECIMAL_DIGIT_NUMBER, Character.ENCLOSING_MARK, Character.END_PUNCTUATION, Character.FINAL_QUOTE_PUNCTUATION, /*Character.FORMAT,*/ Character.INITIAL_QUOTE_PUNCTUATION, Character.LETTER_NUMBER, Character.LINE_SEPARATOR, Character.LOWERCASE_LETTER, /*Character.MATH_SYMBOL,*/ Character.MODIFIER_LETTER, /*Character.MODIFIER_SYMBOL,*/ Character.NON_SPACING_MARK, Character.OTHER_LETTER, Character.OTHER_NUMBER, Character.OTHER_PUNCTUATION, /*Character.OTHER_SYMBOL,*/ Character.PARAGRAPH_SEPARATOR, /*Character.PRIVATE_USE,*/ Character.SPACE_SEPARATOR, Character.START_PUNCTUATION, /*Character.SURROGATE,*/ Character.TITLECASE_LETTER, /*Character.UNASSIGNED,*/ Character.UPPERCASE_LETTER}; Arrays.sort(category); for (String input : arr) { int[] filtered = input.codePoints().filter((cp) -> Arrays.binarySearch(category, Character.getType(cp)) >= 0).toArray(); String result = new String(filtered, 0, filtered.length); System.out.println(input); System.out.println(result); } } }

Salida:

---only letters and spaces alike--- Remove ✅, 🔥, ✈ , ♛ and other such signs from Java string Remove and other such signs from Java string → Cats and dogs Cats and dogs I''m on 🔥 Im on Apples ⚛ Apples ✅ Vi sign Vi sign ♛ I''m the king ♛ Im the king Star me ★ Star me Star ⭐ once more Star once more 早上好 ♛ 早上好 Καλημέρα ✂ Καλημέρα ---unicode blocks white--- Remove ✅, 🔥, ✈ , ♛ and other such signs from Java string Remove , , , and other such signs from Java string → Cats and dogs Cats and dogs I''m on 🔥 I''m on Apples ⚛ Apples ✅ Vi sign Vi sign ♛ I''m the king ♛ I''m the king Star me ★ Star me Star ⭐ once more Star once more 早上好 ♛ Καλημέρα ✂ ---unicode blocks black--- Remove ✅, 🔥, ✈ , ♛ and other such signs from Java string Remove , , , and other such signs from Java string → Cats and dogs → Cats and dogs I''m on 🔥 I''m on Apples ⚛ Apples ✅ Vi sign Vi sign ♛ I''m the king ♛ I''m the king Star me ★ Star me Star ⭐ once more Star once more 早上好 ♛ 早上好 Καλημέρα ✂ Καλημέρα ---category--- Remove ✅, 🔥, ✈ , ♛ and other such signs from Java string Remove , , , and other such signs from Java string → Cats and dogs Cats and dogs I''m on 🔥 I''m on Apples ⚛ Apples ✅ Vi sign Vi sign ♛ I''m the king ♛ I''m the king Star me ★ Star me Star ⭐ once more Star once more 早上好 ♛ 早上好 Καλημέρα ✂ Καλημέρα

El código funciona transmitiendo la cadena a puntos de código. Luego, usando lambdas para filtrar caracteres en una matriz int , luego convertimos la matriz a String.

Las letras y los espacios se utilizan utilizando los métodos de Carácter para filtrar, no son buenos con la puntuación. Intento fallido .

El Unicode bloquea el filtro blanco utilizando los bloques Unicode que el programador especifica como permitidos. Intento fallido .

El Unicode bloquea el filtro negro utilizando los bloques Unicode que el programador especifica como no permitidos. Intento fallido .

El filtro de categoría con el método estático Character.getType . El programador puede definir en la matriz de category qué tipos están permitidos. FUNCIONA 😨😱😰😲😀.

En lugar de incluir en la lista negra algunos elementos, ¿qué le parece crear una lista blanca de los personajes que desea conservar? De esta manera, no necesita preocuparse por cada nuevo emoji que se agregue.

String characterFilter = "[^//p{L}//p{M}//p{N}//p{P}//p{Z}//p{Cf}//p{Cs}//s]"; String emotionless = aString.replaceAll(characterFilter,"");

Asi que:

[//p{L}//p{M}//p{N}//p{P}//p{Z}//p{Cf}//p{Cs}//s] es un rango representando todos los números ( //p{N} ), letra ( //p{L} ), marca ( //p{M} ), puntuación ( //p{P} ), espacio en blanco / separador ( //p{Z} ), otro formato ( //p{Cf} ) y otros caracteres por encima de U+FFFF en Unicode ( //p{Cs} ) y caracteres de nueva línea ( //s ). //p{L} incluye específicamente los caracteres de otros alfabetos como cirílico, latín, kanji, etc.
La ^ en el conjunto de caracteres regex niega la coincidencia.

Ejemplo:

String str = "hello world _# 皆さん、こんにちは！　私はジョンと申します。🔥"; System.out.print(str.replaceAll("[^//p{L}//p{M}//p{N}//p{P}//p{Z}//p{Cf}//p{Cs}//s]","")); // Output: // "hello world _# 皆さん、こんにちは！　私はジョンと申します。"

Si necesita más información, consulte la documentation Java para expresiones regulares.

ICU4J es tu amigo.

UCharacter.hasBinaryProperty(UProperty.EMOJI);

Recuerde mantener actualizada su versión de icu4j y tenga en cuenta que esto solo filtrará los emoji Unicode oficiales, no los caracteres de símbolos. Combine con el filtrado de otros tipos de caracteres como desee.

Más información: http://icu-project.org/apiref/icu4j/com/ibm/icu/lang/UProperty.html#EMOJI

No estoy muy interesado en Java, por lo que no intentaré escribir código de ejemplo en línea, pero la forma en que lo haría es verificar lo que Unicode llama "la categoría general" de cada carácter. Hay un par de letras y categorías de puntuación.

Puede usar Character.getType para encontrar la categoría general de un personaje dado. Probablemente deberías retener los caracteres que caen en estas categorías generales:

COMBINING_SPACING_MARK CONNECTOR_PUNCTUATION CURRENCY_SYMBOL DASH_PUNCTUATION DECIMAL_DIGIT_NUMBER ENCLOSING_MARK END_PUNCTUATION FINAL_QUOTE_PUNCTUATION FORMAT INITIAL_QUOTE_PUNCTUATION LETTER_NUMBER LINE_SEPARATOR LOWERCASE_LETTER MATH_SYMBOL MODIFIER_LETTER MODIFIER_SYMBOL NON_SPACING_MARK OTHER_LETTER OTHER_NUMBER OTHER_PUNCTUATION PARAGRAPH_SEPARATOR SPACE_SEPARATOR START_PUNCTUATION TITLECASE_LETTER UPPERCASE_LETTER

(Todos los caracteres que enumeró como que desea eliminar específicamente tienen la categoría general OTHER_SYMBOL , que no OTHER_SYMBOL en la lista blanca de la categoría anterior).

Prueba este proyecto simple-emoji-4j

Compatible con Emoji 12.0 (2018.10.15)

Simple con:

EmojiUtils.removeEmoji(str)

Use un complemento jQuery llamado RM-Emoji. Así es como funciona:

$(''#text'').remove(''emoji'').fast()

Este es el modo rápido que puede perder algunos emojis, ya que utiliza algoritmos heurísticos para encontrar emojis en el texto. Use el método .full() para escanear una cadena completa y eliminar todos los emojis garantizados.