strip_tags - Análisis de enormes archivos XML en PHP

strip_tags wordpress (5)

Estoy tratando de analizar los archivos XML de contenido / estructuras DMOZ en MySQL, pero todos los scripts existentes para hacer esto son muy antiguos y no funcionan bien. ¿Cómo puedo abrir un archivo XML grande (+1GB) en PHP para analizar?

Esta es una pregunta muy similar a la mejor forma de procesar XML grande en PHP, pero con una muy buena respuesta específica upvoted abordando el problema específico del análisis del catálogo DMOZ. Sin embargo, dado que este es un buen hit de Google para grandes XML en general, volveré a publicar mi respuesta de la otra pregunta también:

Mi opinión sobre esto:

https://github.com/prewk/XmlStreamer

Una clase simple que extraerá todos los elementos secundarios al elemento raíz XML al transmitir el archivo. Probado en 108 MB de archivo XML de pubmed.com.

class SimpleXmlStreamer extends XmlStreamer { public function processNode($xmlString, $elementName, $nodeIndex) { $xml = simplexml_load_string($xmlString); // Do something with your SimpleXML object return true; } } $streamer = new SimpleXmlStreamer("myLargeXmlFile.xml"); $streamer->parse();

Esta no es una gran solución, solo arrojar otra opción:

Puede dividir muchos archivos XML grandes en fragmentos, especialmente aquellos que en realidad son solo listas de elementos similares (ya que sospecho que el archivo con el que está trabajando sería).

por ejemplo, si su documento tiene el siguiente aspecto:

<dmoz> <listing>....</listing> <listing>....</listing> <listing>....</listing> <listing>....</listing> <listing>....</listing> <listing>....</listing> ... </dmoz>

Puedes leerlo en un meg o dos a la vez, envolver artificialmente las pocas etiquetas <listing> completas que cargaste en una etiqueta de nivel raíz, y luego cargarlas a través de simplexml / domxml (utilicé domxml, al tomar este enfoque).

Francamente, prefiero este enfoque si estás usando PHP <5.1.2. Con 5.1.2 y superior, XMLReader está disponible, que es probablemente la mejor opción, pero antes de eso, estás atrapado ya sea con la estrategia de fragmentación anterior o con la antigua SAX / expat lib. Y no sé sobre el resto de ustedes, pero ODIO escribir / mantener analizadores de SAX / expatriados.

Sin embargo, tenga en cuenta que este enfoque NO es realmente práctico cuando su documento no consta de muchos elementos idénticos de nivel inferior (por ejemplo, funciona muy bien para cualquier tipo de lista de archivos, URL, etc., pero no lo haría). sentido para analizar un documento HTML grande)

Recientemente tuve que analizar algunos documentos XML bastante grandes y necesitaba un método para leer un elemento a la vez.

Si tiene el siguiente archivo complex-test.xml :

<?xml version="1.0" encoding="UTF-8"?> <Complex> <Object> <Title>Title 1</Title> <Name>It''s name goes here</Name> <ObjectData> <Info1></Info1> <Info2></Info2> <Info3></Info3> <Info4></Info4> </ObjectData> <Date></Date> </Object> <Object></Object> <Object> <AnotherObject></AnotherObject> <Data></Data> </Object> <Object></Object> <Object></Object> </Complex>

Y quería devolver el <Object/> s

PHP:

require_once(''class.chunk.php''); $file = new Chunk(''complex-test.xml'', array(''element'' => ''Object'')); while ($xml = $file->read()) { $obj = simplexml_load_string($xml); // do some parsing, insert to DB whatever } ########### Class File ########### <?php /** * Chunk * * Reads a large file in as chunks for easier parsing. * * The chunks returned are whole <$this->options[''element'']/>s found within file. * * Each call to read() returns the whole element including start and end tags. * * Tested with a 1.8MB file, extracted 500 elements in 0.11s * (with no work done, just extracting the elements) * * Usage: * <code> * // initialize the object * $file = new Chunk(''chunk-test.xml'', array(''element'' => ''Chunk'')); * * // loop through the file until all lines are read * while ($xml = $file->read()) { * // do whatever you want with the string * $o = simplexml_load_string($xml); * } * </code> * * @package default * @author Dom Hastings */ class Chunk { /** * options * * @var array Contains all major options * @access public */ public $options = array( ''path'' => ''./'', // string The path to check for $file in ''element'' => '''', // string The XML element to return ''chunkSize'' => 512 // integer The amount of bytes to retrieve in each chunk ); /** * file * * @var string The filename being read * @access public */ public $file = ''''; /** * pointer * * @var integer The current position the file is being read from * @access public */ public $pointer = 0; /** * handle * * @var resource The fopen() resource * @access private */ private $handle = null; /** * reading * * @var boolean Whether the script is currently reading the file * @access private */ private $reading = false; /** * readBuffer * * @var string Used to make sure start tags aren''t missed * @access private */ private $readBuffer = ''''; /** * __construct * * Builds the Chunk object * * @param string $file The filename to work with * @param array $options The options with which to parse the file * @author Dom Hastings * @access public */ public function __construct($file, $options = array()) { // merge the options together $this->options = array_merge($this->options, (is_array($options) ? $options : array())); // check that the path ends with a / if (substr($this->options[''path''], -1) != ''/'') { $this->options[''path''] .= ''/''; } // normalize the filename $file = basename($file); // make sure chunkSize is an int $this->options[''chunkSize''] = intval($this->options[''chunkSize'']); // check it''s valid if ($this->options[''chunkSize''] < 64) { $this->options[''chunkSize''] = 512; } // set the filename $this->file = realpath($this->options[''path''].$file); // check the file exists if (!file_exists($this->file)) { throw new Exception(''Cannot load file: ''.$this->file); } // open the file $this->handle = fopen($this->file, ''r''); // check the file opened successfully if (!$this->handle) { throw new Exception(''Error opening file for reading''); } } /** * __destruct * * Cleans up * * @return void * @author Dom Hastings * @access public */ public function __destruct() { // close the file resource fclose($this->handle); } /** * read * * Reads the first available occurence of the XML element $this->options[''element''] * * @return string The XML string from $this->file * @author Dom Hastings * @access public */ public function read() { // check we have an element specified if (!empty($this->options[''element''])) { // trim it $element = trim($this->options[''element'']); } else { $element = ''''; } // initialize the buffer $buffer = false; // if the element is empty if (empty($element)) { // let the script know we''re reading $this->reading = true; // read in the whole doc, cos we don''t know what''s wanted while ($this->reading) { $buffer .= fread($this->handle, $this->options[''chunkSize'']); $this->reading = (!feof($this->handle)); } // return it all return $buffer; // we must be looking for a specific element } else { // set up the strings to find $open = ''<''.$element.''>''; $close = ''</''.$element.''>''; // let the script know we''re reading $this->reading = true; // reset the global buffer $this->readBuffer = ''''; // this is used to ensure all data is read, and to make sure we don''t send the start data again by mistake $store = false; // seek to the position we need in the file fseek($this->handle, $this->pointer); // start reading while ($this->reading && !feof($this->handle)) { // store the chunk in a temporary variable $tmp = fread($this->handle, $this->options[''chunkSize'']); // update the global buffer $this->readBuffer .= $tmp; // check for the open string $checkOpen = strpos($tmp, $open); // if it wasn''t in the new buffer if (!$checkOpen && !($store)) { // check the full buffer (in case it was only half in this buffer) $checkOpen = strpos($this->readBuffer, $open); // if it was in there if ($checkOpen) { // set it to the remainder $checkOpen = $checkOpen % $this->options[''chunkSize'']; } } // check for the close string $checkClose = strpos($tmp, $close); // if it wasn''t in the new buffer if (!$checkClose && ($store)) { // check the full buffer (in case it was only half in this buffer) $checkClose = strpos($this->readBuffer, $close); // if it was in there if ($checkClose) { // set it to the remainder plus the length of the close string itself $checkClose = ($checkClose + strlen($close)) % $this->options[''chunkSize'']; } // if it was } elseif ($checkClose) { // add the length of the close string itself $checkClose += strlen($close); } // if we''ve found the opening string and we''re not already reading another element if ($checkOpen !== false && !($store)) { // if we''re found the end element too if ($checkClose !== false) { // append the string only between the start and end element $buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen)); // update the pointer $this->pointer += $checkClose; // let the script know we''re done $this->reading = false; } else { // append the data we know to be part of this element $buffer .= substr($tmp, $checkOpen); // update the pointer $this->pointer += $this->options[''chunkSize'']; // let the script know we''re gonna be storing all the data until we find the close element $store = true; } // if we''ve found the closing element } elseif ($checkClose !== false) { // update the buffer with the data upto and including the close tag $buffer .= substr($tmp, 0, $checkClose); // update the pointer $this->pointer += $checkClose; // let the script know we''re done $this->reading = false; // if we''ve found the closing element, but half in the previous chunk } elseif ($store) { // update the buffer $buffer .= $tmp; // and the pointer $this->pointer += $this->options[''chunkSize'']; } } } // return the element (or the whole file if we''re not looking for elements) return $buffer; } }

Solo hay dos API php que son realmente adecuadas para procesar archivos de gran tamaño. El primero es el antiguo expat api, y el segundo es el nuevo XMLreader funciones. Estas apis leen flujos continuos en lugar de cargar todo el árbol en la memoria (que es lo que hace simplexml y DOM).

Por ejemplo, es posible que desee ver este analizador parcial del catálogo DMOZ:

<?php class SimpleDMOZParser { protected $_stack = array(); protected $_file = ""; protected $_parser = null; protected $_currentId = ""; protected $_current = ""; public function __construct($file) { $this->_file = $file; $this->_parser = xml_parser_create("UTF-8"); xml_set_object($this->_parser, $this); xml_set_element_handler($this->_parser, "startTag", "endTag"); } public function startTag($parser, $name, $attribs) { array_push($this->_stack, $this->_current); if ($name == "TOPIC" && count($attribs)) { $this->_currentId = $attribs["R:ID"]; } if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) { echo $attribs["R:RESOURCE"] . "/n"; } $this->_current = $name; } public function endTag($parser, $name) { $this->_current = array_pop($this->_stack); } public function parse() { $fh = fopen($this->_file, "r"); if (!$fh) { die("Epic fail!/n"); } while (!feof($fh)) { $data = fread($fh, 4096); xml_parse($this->_parser, $data, feof($fh)); } } } $parser = new SimpleDMOZParser("content.rdf.u8"); $parser->parse();

Sugiero usar un analizador basado en SAX en lugar de un análisis basado en DOM.

Información sobre el uso de SAX en PHP: http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm