nlp - para - project gutenberg id

Cómo quitar encabezados/pies de página de los textos del Proyecto Gutenberg? (3)

He intentado varios métodos para quitar la licencia de los textos del Proyecto Gutenberg, para usar como corpus para un proyecto de aprendizaje de idiomas, pero no puedo pensar en un enfoque confiable y sin supervisión. La mejor heurística que he encontrado hasta ahora es quitar las primeras veintiocho líneas y las últimas 398, que funcionaron para una gran cantidad de textos. Cualquier sugerencia sobre cómo puedo eliminar automáticamente el texto (que es muy similar para muchos textos, pero con pequeñas diferencias en cada caso, y algunas plantillas diferentes, también), así como sugerencias sobre cómo verificar que el el texto ha sido eliminado con precisión, sería muy útil.

No estabas bromeando. Es casi como si estuvieran tratando de hacer el trabajo AI-complete. Solo puedo pensar en dos enfoques, ninguno de ellos perfecto.

1) Configure una secuencia de comandos en, por ejemplo, Perl, para abordar los patrones más comunes (por ejemplo, busque la frase "producido por", siga bajando a la siguiente línea en blanco y corte allí) pero ponga en muchas afirmaciones sobre lo que es esperado (por ejemplo, el próximo texto debe ser el título o autor). De esa manera, cuando el patrón falla, lo sabrá. La primera vez que falla un patrón, hágalo a mano. La segunda vez, modifique la secuencia de comandos.

2) Prueba Amazon Mechanical Turk .

También he querido una herramienta para quitar los encabezados y pies de página del Proyecto Gutenberg durante años por jugar con el procesamiento del lenguaje natural sin contaminar el análisis con un texto repetitivo mezclado con el etxt. Después de leer esta pregunta, finalmente saqué mi dedo y escribí un filtro Perl que puede pasar a través de cualquier otra herramienta.

Está hecho como una máquina de estados usando expresiones regulares por línea. Está escrito para ser fácil de entender, ya que la velocidad no es un problema con el tamaño típico de los etexts. Hasta ahora, funciona en los dos docenas de etextos que tengo aquí, pero en la naturaleza seguramente habrá muchas más variaciones que deben agregarse. Con suerte, el código es lo suficientemente claro como para que cualquiera pueda agregarle:

#!/usr/bin/perl # stripgutenberg.pl < in.txt > out.txt # # designed for piping # Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010 use strict; my $debug = 0; my $state = ''beginning''; my $print = 0; my $printed = 0; while (1) { $_ = <>; last unless $_; # strip UTF-8 BOM if ($. == 1 && index($_, "/xef/xbb/xbf") == 0) { $_ = substr($_, 3); } if ($state eq ''beginning'') { if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg''s )/) { $state = ''normal pg header''; $debug && print "state: beginning -> normal pg header/n"; $print = 0; } elsif (/^$/) { $state = ''beginning blanks''; $debug && print "state: beginning -> beginning blanks/n"; } else { die "unrecognized beginning: $_"; } } elsif ($state eq ''normal pg header'') { if (/^/*/*/*/ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) { $state = ''end of normal header''; $debug && print "state: normal pg header -> end of normal pg header/n"; } else { # body of normal pg header } } elsif ($state eq ''end of normal header'') { if (/^(Produced by|Transcribed from)/) { $state = ''post header''; $debug && print "state: end of normal pg header -> post header/n"; } elsif (/^$/) { # blank lines } else { $state = ''etext body''; $debug && print "state: end of normal header -> etext body/n"; $print = 1; } } elsif ($state eq ''post header'') { if (/^$/) { $state = ''blanks after post header''; $debug && print "state: post header -> blanks after post header/n"; } else { # multiline Produced / Transcribed } } elsif ($state eq ''blanks after post header'') { if (/^$/) { # more blank lines } else { $state = ''etext body''; $debug && print "state: blanks after post header -> etext body/n"; $print = 1; } } elsif ($state eq ''beginning blanks'') { if (//) { $state = ''header include''; $debug && print "state: beginning blanks -> header include/n"; } elsif (/^Title: /) { $state = ''aus header''; $debug && print "state: beginning blanks -> aus header/n"; } elsif (/^$/) { # more blanks } else { die "unexpected stuff after beginning blanks: $_"; } } elsif ($state eq ''header include'') { if (/^$/) { # blanks after header include } else { $state = ''aus header''; $debug && print "state: header include -> aus header/n"; } } elsif ($state eq ''aus header'') { if (/^To contact Project Gutenberg of Australia go to http:////gutenberg/.net/.au$/) { $state = ''end of aus header''; $debug && print "state: aus header -> end of aus header/n"; } elsif (/^A Project Gutenberg of Australia eBook$/) { $state = ''end of aus header''; $debug && print "state: aus header -> end of aus header/n"; } } elsif ($state eq ''end of aus header'') { if (/^((Title|Author): .*)?$/) { # title, author, or blank line } else { $state = ''etext body''; $debug && print "state: end of aus header -> etext body/n"; $print = 1; } } elsif ($state eq ''etext body'') { # here''s the stuff if (/^$/) { $state = ''footer''; $debug && print "state: etext body -> footer/n"; $print = 0; } elsif (/^(/*/*/* ?)?end of (the )?project/i) { $state = ''footer''; $debug && print "state: etext body -> footer/n"; $print = 0; } } elsif ($state eq ''footer'') { # nothing more of interest } else { die "unknown state ''$state''"; } if ($print) { print; ++$printed; } else { $debug && print "## $_"; } }

Wow, esta pregunta es tan vieja ahora. Sin embargo, el paquete gutenbergr en R parece hacer un buen trabajo eliminando encabezados, incluyendo basura después del final "oficial" del encabezado.

Primero tendrá que instalar R / Rstudio, luego

install.packages(''gutenbergr'') library(gutenbergr) t <- gutenberg_download(''25519'') # give it the id number of the text

El strip_headers arg es T por defecto. También es probable que desee eliminar ilustraciones:

library(data.table) t <- as.data.table(t) # I hate tibbles -- datatables are easier to work with head(t) # get the column names # filter out lines that are illustrations and joins all lines with a space # the //[ searches for the [ character, the // are used to ''escape'' the special [ character # the !like() means find rows where the text column is not like the search string no_il <- t[!like(text, ''//[Illustration''), ''text''] # collapse the text into a single character string t_cln <- do.call(paste, c(no_il, collapse = '' ''))