bash - que - ¿Cómo crear una lista de frecuencias de cada palabra en un archivo?

partes de un ecualizador (8)

¡Usemos AWK!

Esta función enumera la frecuencia de cada palabra que aparece en el archivo proporcionado en orden descendente:

function wordfrequency() { awk '' BEGIN { FS="[^a-zA-Z]+" } { for (i=1; i<=NF; i++) { word = tolower($i) words[word]++ } } END { for (w in words) printf("%3d %s/n", words[w], w) } '' | sort -rn }

Puedes llamarlo en tu archivo así:

$ cat your_file.txt | wordfrequency

Fuente: AWK-ward Ruby

Tengo un archivo como este:

This is a file with many words. Some of the words appear more than once. Some of the words only appear one time.

Me gustaría generar una lista de dos columnas. La primera columna muestra qué palabras aparecen, la segunda columna muestra con qué frecuencia aparecen, por ejemplo:

this@1 is@1 a@1 file@1 with@1 many@1 words3 some@2 of@2 the@2 only@1 appear@2 more@1 than@1 one@1 once@1 time@1

Para hacer este trabajo más simple, antes de procesar la lista, eliminaré toda puntuación y cambiaré todo el texto a letras minúsculas.
A menos que haya una solución simple a su alrededor, las words y las word pueden contar como dos palabras separadas.

Hasta ahora, tengo esto:

sed -i "s/ //n/g" ./file1.txt # put all words on a new line while read line do count="$(grep -c $line file1.txt)" echo $line"@"$count >> file2.txt # add word and frequency to file done < ./file1.txt sort -u -d # remove duplicate lines

Por alguna razón, esto solo muestra "0" después de cada palabra.

¿Cómo puedo generar una lista de cada palabra que aparece en un archivo, junto con la información de frecuencia?

¡Vamos a hacerlo en Python 3!

"""Counts the frequency of each word in the given text; words are defined as entities separated by whitespaces; punctuations and other symbols are ignored; case-insensitive; input can be passed through stdin or through a file specified as an argument; prints highest frequency words first""" # Case-insensitive # Ignore punctuations `~!@#$%^&*()_-+={}[]/|:;"''<>,.?/ import sys # Find if input is being given through stdin or from a file lines = None if len(sys.argv) == 1: lines = sys.stdin else: lines = open(sys.argv[1]) D = {} for line in lines: for word in line.split(): word = ''''.join(list(filter( lambda ch: ch not in "`~!@#$%^&*()_-+={}[]//|:;/"''<>,.?/", word))) word = word.lower() if word in D: D[word] += 1 else: D[word] = 1 for word in sorted(D, key=D.get, reverse=True): print(word + '' '' + str(D[word]))

Nombraremos este script "frequency.py" y agregaremos una línea a "~ / .bash_aliases":

alias freq="python3 /path/to/frequency.py"

Ahora, para encontrar las palabras de frecuencia en su archivo "content.txt", haga lo siguiente:

freq content.txt

También puede canalizar la salida a ella:

cat content.txt | freq

E incluso analizar el texto de varios archivos:

cat content.txt story.txt article.txt | freq

Si está utilizando Python 2, simplemente reemplace

''''.join(list(filter(args...))) con filter(args...)
python3 con python
print(whatever) con print whatever

Contenido del archivo de entrada

$ cat inputFile.txt This is a file with many words. Some of the words appear more than once. Some of the words only appear one time.

Utilizando sed | sort | uniq sed | sort | uniq

$ sed ''s//.//g;s//(.*/)//L/1/;s// //n/g'' inputFile.txt | sort | uniq -c 1 a 2 appear 1 file 1 is 1 many 1 more 2 of 1 once 1 one 1 only 2 some 1 than 2 the 1 this 1 time 1 with 3 words

uniq -ic contará e ignorará el caso, pero la lista de resultados tendrá This lugar de this .

El ordenamiento requiere GNU AWK ( gawk ). Si tiene otro AWK sin asort() , esto puede ajustarse fácilmente y luego canalizarse para sort .

awk ''{gsub(//./, ""); for (i = 1; i <= NF; i++) {w = tolower($i); count[w]++; words[w] = w}} END {qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]]}'' inputfile

Roto en múltiples líneas:

awk ''{ gsub(//./, ""); for (i = 1; i <= NF; i++) { w = tolower($i); count[w]++; words[w] = w } } END { qty = asort(words); for (w = 1; w <= qty; w++) print words[w] "@" count[words[w]] }'' inputfile

Esto podría funcionar para usted:

No sed y grep , sino tr , sort , uniq y awk :

% (tr '' '' ''/n'' | sort | uniq -c | awk ''{print $2"@"$1}'') <<EOF This is a file with many words. Some of the words appear more than once. Some of the words only appear one time. EOF a@1 appear@2 file@1 is@1 many@1 more@1 of@2 once.@1 one@1 only@1 Some@2 than@1 the@2 This@1 time.@1 with@1 words@2 words.@1

uniq -c ya hace lo que quiere, solo ordene la entrada:

echo ''a s d s d a s d s a a d d s a s d d s a'' | tr '' '' ''/n'' | sort | uniq -c

salida:

6 a 7 d 7 s

#!/usr/bin/env bash declare -A map words="$1" [[ -f $1 ]] || { echo "usage: $(basename $0 wordfile)"; exit 1 ;} while read line; do for word in $line; do ((map[$word]++)) done; done < <(cat $words ) for key in ${!map[@]}; do echo "the word $key appears ${map[$key]} times" done|sort -nr -k5