javascript - span - innerhtml que es

¿Puedo cargar un documento HTML completo en un fragmento de documento en Internet Explorer? (7)

Aquí hay algo con lo que he tenido un poco de dificultad. Tengo un script local del lado del cliente que necesita permitir que un usuario busque una página web remota y busque los formularios en la página resultante. Para hacer esto (sin regex), necesito analizar el documento en un objeto DOM totalmente transitable.

Algunas limitaciones que me gustaría destacar:

No quiero usar bibliotecas (como jQuery). Hay demasiada hinchazón por lo que tengo que hacer aquí.
Bajo ninguna circunstancia se deben ejecutar scripts desde la página remota (por razones de seguridad).
Las API DOM, como getElementsByTagName , deben estar disponibles.
Solo necesita funcionar en Internet Explorer, pero en 7 como mínimo.
Supongamos que no tengo acceso a un servidor. Sí, pero no puedo usarlo para esto.

Lo que he intentado

Suponiendo que tengo una cadena completa de documentos HTML (incluida la declaración DOCTYPE) en la variable html , esto es lo que he intentado hasta ahora:

var frag = document.createDocumentFragment(), div = frag.appendChild(document.createElement("div")); div.outerHTML = html; //-> results in an empty fragment div.insertAdjacentHTML("afterEnd", html); //-> HTML is not added to the fragment div.innerHTML = html; //-> Error (expected, but I tried it anyway) var doc = new ActiveXObject("htmlfile"); doc.write(html); doc.close(); //-> JavaScript executes

También intenté extraer los nodos <head> y <body> del HTML y agregarlos a un elemento <HTML> dentro del fragmento, todavía no tuve suerte.

¿Alguien tiene alguna idea?

Acabo de pasar por esta página, estoy un poco tarde para ser útil :) pero lo siguiente debería ayudar a cualquier persona con un problema similar en el futuro ... sin embargo, IE7 / 8 realmente debería ser ignorado por ahora y hay métodos mucho mejores soportados por los navegadores más modernos.

Lo siguiente funciona en casi todo lo que he probado: los únicos dos lados negativos son:

He agregado funciones getElementById y getElementsByName medida al elemento div raíz, por lo que no aparecerán como se espera más adelante en el árbol (a menos que el código se modifique para atender esto) .
El doctype será ignorado, sin embargo, no creo que esto suponga una gran diferencia, ya que mi experiencia es que el doctype no afectará a cómo está estructurado el dom, sino a cómo se representa (lo que obviamente no sucederá con este método) .

Básicamente, el sistema se basa en el hecho de que <tag> y <namespace:tag> son tratados de manera diferente por los publicitarios. Como se ha descubierto, ciertas etiquetas especiales no pueden existir dentro de un elemento div, por lo que se eliminan. Los elementos de espacio de nombres se pueden colocar en cualquier lugar (a menos que exista una DTD que indique lo contrario) . Si bien estas etiquetas de espacio de nombres en realidad no se comportarán como las etiquetas reales en cuestión, teniendo en cuenta que solo las estamos usando realmente para su posición estructural en el documento, realmente no es un problema.

el marcado y el código son los siguientes:

<!DOCTYPE html> <html> <head> <script> /// function for parsing HTML source to a dom structure /// Tested in Mac OSX, Win 7, Win XP with FF, IE 7/8/9, /// Chrome, Safari & Opera. function parseHTML(src){ /// create a random div, this will be our root var div = document.createElement(''div''), /// specificy our namespace prefix ns = ''faux:'', /// state which tags we will treat as "special" stn = [''html'',''head'',''body'',''title'']; /// the reg exp for replacing the special tags re = new RegExp(''<(/?)(''+stn.join(''|'')+'')([^>]*)?>'',''gi''), /// remember the getElementsByTagName function before we override it gtn = div.getElementsByTagName; /// a quick function to namespace certain tag names var nspace = function(tn){ if ( stn.indexOf ) { return stn.indexOf(tn) != -1 ? ns + tn : tn; } else { return (''|''+stn.join(''|'')+''|'').indexOf(tn) != -1 ? ns + tn : tn; } }; /// search and replace our source so that special tags are namespaced ///   required for IE7/8 to render tags before first text found /// <faux:check /> tag added so we can test how namespaces work src = '' <''+ns+''check />'' + src.replace(re,''<$1''+ns+''$2$3>''); /// inject to the div div.innerHTML = src; /// quick test to see how we support namespaces in TagName searches if ( !div.getElementsByTagName(ns+''check'').length ) { ns = ''''; } /// create our replacement getByName and getById functions var createGetElementByAttr = function(attr, collect){ var func = function(a,w){ var i,c,e,f,l,o; w = w||[]; if ( this.nodeType == 1 ) { if ( this.getAttribute(attr) == a ) { if ( collect ) { w.push(this); } else { return this; } } } else { return false; } if ( (c = this.childNodes) && (l = c.length) ) { for( i=0; i<l; i++ ){ if( (e = c[i]) && (e.nodeType == 1) ) { if ( (f = func.call( e, a, w )) && !collect ) { return f; } } } } return (w.length?w:false); } return func; } /// apply these replacement functions to the div container, obviously /// you could add these to prototypes for browsers the support element /// constructors. For other browsers you could step each element and /// apply the functions through-out the node tree... however this would /// be quite messy, far better just to always call from the root node - /// or use div.getElementsByTagName.call( localElement, ''tag'' ); div.getElementsByTagName = function(t){return gtn.call(this,nspace(t));} div.getElementsByName = createGetElementByAttr(''name'', true); div.getElementById = createGetElementByAttr(''id'', false); /// return the final element return div; } window.onload = function(){ /// parse the HTML source into a node tree var dom = parseHTML( document.getElementById(''source'').innerHTML ); /// test some look ups :) var a = dom.getElementsByTagName(''head''), b = dom.getElementsByTagName(''title''), c = dom.getElementsByTagName(''script''), d = dom.getElementById(''body''); /// alert the result alert(a[0].innerHTML); alert(b[0].innerHTML); alert(c[0].innerHTML); alert(d.innerHTML); } </script> </head> <body> <xmp id="source"> <!DOCTYPE html> <html> <head>  <meta charset="utf-8"> <meta name="robots" content="index, follow"> <title>An example</title> <link href="test.css" /> <script>alert(''of parsing..'');</script> </head> <body id="body"> <b>in a similar way to createDocumentFragment</b> </body> </html> </xmp> </body> </html>

No estoy seguro de por qué te estás metiendo con documentFragments, puedes simplemente configurar el texto HTML como el innerHTML de un nuevo elemento div. Entonces puedes usar ese elemento div para getElementsByTagName etc. sin agregar el div a DOM:

var htmlText= ''<html><head><title>Test</title></head><body><div id="test_ele1">this is test_ele1 content</div><div id="test_ele2">this is test_ele content2</div></body></html>''; var d = document.createElement(''div''); d.innerHTML = htmlText; console.log(d.getElementsByTagName(''div''));

Si está realmente casado con la idea de un documentFragment, puede usar este código, pero igual tendrá que envolverlo en un div para obtener las funciones de DOM que busca:

function makeDocumentFragment(htmlText) { var range = document.createRange(); var frag = range.createContextualFragment(htmlText); var d = document.createElement(''div''); d.appendChild(frag); return d; }

No estoy seguro de si IE admite document.implementation.createHTMLDocument , pero si lo hace, utilice este algoritmo (adaptado de mi extensión DOMParser HTML ). Tenga en cuenta que el DOCTYPE no se conservará .:

var doc = document.implementation.createHTMLDocument("") , doc_elt = doc.documentElement , first_elt ; doc_elt.innerHTML = your_html_here; first_elt = doc_elt.firstElementChild; if ( // are we dealing with an entire document or a fragment? doc_elt.childElementCount === 1 && first_elt.tagName.toLowerCase() === "html" ) { doc.replaceChild(first_elt, doc_elt); } // doc is an HTML document // you can now reference stuff like doc.title, etc.

Para usar capacidades completas de HTML DOM sin necesidad de activar solicitudes, sin tener que lidiar con incompatibilidades:

var doc = document.cloneNode(); if (!doc.documentElement) { doc.appendChild(doc.createElement(''html'')); doc.documentElement.appendChild(doc.createElement(''head'')); doc.documentElement.appendChild(doc.createElement(''body'')); }

Todo listo ! doc es un documento html, pero no está en línea.

Suponiendo que el HTML también es XML válido, puede usar loadXML()

DocumentFragment no es compatible con getElementsByTagName ; solo es compatible con Document .

Es posible que necesite usar una biblioteca como jsdom , que proporciona una implementación del DOM y mediante la cual puede buscar utilizando getElementsByTagName y otras API de DOM. Y puede configurarlo para que no ejecute scripts. Sí, es ''pesado'' y no sé si funciona en IE 7.

Fiddle : http://jsfiddle.net/JFSKe/6/

DocumentFragment no implementa métodos DOM. El uso de document.createElement junto con innerHTML elimina las etiquetas <head> y <body> (incluso cuando el elemento creado es un elemento raíz, <html> ). Por lo tanto, la solución debe buscarse en otra parte. He creado una función de cadena a DOM entre navegadores , que hace uso de un marco en línea invisible.

Todos los recursos externos y scripts estarán deshabilitados. Vea la Explicación del código para más información.

Código

/* @param String html The string with HTML which has be converted to a DOM object @param func callback (optional) Callback(HTMLDocument doc, function destroy) @returns undefined if callback exists, else: Object HTMLDocument doc DOM fetched from Parameter:html function destroy Removes HTMLDocument doc. */ function string2dom(html, callback){ /* Sanitise the string */ html = sanitiseHTML(html); /*Defined at the bottom of the answer*/ /* Create an IFrame */ var iframe = document.createElement("iframe"); iframe.style.display = "none"; document.body.appendChild(iframe); var doc = iframe.contentDocument || iframe.contentWindow.document; doc.open(); doc.write(html); doc.close(); function destroy(){ iframe.parentNode.removeChild(iframe); } if(callback) callback(doc, destroy); else return {"doc": doc, "destroy": destroy}; } /* @name sanitiseHTML @param String html A string representing HTML code @return String A new string, fully stripped of external resources. All "external" attributes (href, src) are prefixed by data- */ function sanitiseHTML(html){ /* Adds a <!-/"''--> before every matched tag, so that unterminated quotes aren''t preventing the browser from splitting a tag. Test case: ''<input style="foo;b:url(0);><input onclick="<input type=button onclick="too() href=;>">'' */ var prefix = ""; /*Attributes should not be prefixed by these characters. This list is not complete, but will be sufficient for this function. (see http://www.w3.org/TR/REC-xml/#NT-NameChar) */ var att = "[^-a-z0-9:._]"; var tag = "<[a-z]"; var any = "(?:[^<>/"'']*(?:/"[^/"]*/"|''[^'']*''))*?[^<>]*"; var etag = "(?:>|(?=<))"; /* @name ae @description Converts a given string in a sequence of the original input and the HTML entity @param String string String to convert */ var entityEnd = "(?:;|(?!//d))"; var ents = {" ":"(?://s| ?|&#0*32"+entityEnd+"|&#x0*20"+entityEnd+")", "(":"(?://(|&#0*40"+entityEnd+"|&#x0*28"+entityEnd+")", ")":"(?://)|&#0*41"+entityEnd+"|&#x0*29"+entityEnd+")", ".":"(?://.|&#0*46"+entityEnd+"|&#x0*2e"+entityEnd+")"}; /*Placeholder to avoid tricky filter-circumventing methods*/ var charMap = {}; var s = ents[" "]+"*"; /* Short-hand space */ /* Important: Must be pre- and postfixed by < and >. RE matches a whole tag! */ function ae(string){ var all_chars_lowercase = string.toLowerCase(); if(ents[string]) return ents[string]; var all_chars_uppercase = string.toUpperCase(); var RE_res = ""; for(var i=0; i<string.length; i++){ var char_lowercase = all_chars_lowercase.charAt(i); if(charMap[char_lowercase]){ RE_res += charMap[char_lowercase]; continue; } var char_uppercase = all_chars_uppercase.charAt(i); var RE_sub = [char_lowercase]; RE_sub.push("&#0*" + char_lowercase.charCodeAt(0) + entityEnd); RE_sub.push("&#x0*" + char_lowercase.charCodeAt(0).toString(16) + entityEnd); if(char_lowercase != char_uppercase){ RE_sub.push("&#0*" + char_uppercase.charCodeAt(0) + entityEnd); RE_sub.push("&#x0*" + char_uppercase.charCodeAt(0).toString(16) + entityEnd); } RE_sub = "(?:" + RE_sub.join("|") + ")"; RE_res += (charMap[char_lowercase] = RE_sub); } return(ents[string] = RE_res); } /* @name by @description second argument for the replace function. */ function by(match, group1, group2){ /* Adds a data-prefix before every external pointer */ return group1 + "data-" + group2 } /* @name cr @description Selects a HTML element and performs a search-and-replace on attributes @param String selector HTML substring to match @param String attribute RegExp-escaped; HTML element attribute to match @param String marker Optional RegExp-escaped; marks the prefix @param String delimiter Optional RegExp escaped; non-quote delimiters @param String end Optional RegExp-escaped; forces the match to end before an occurence of <end> when quotes are missing */ function cr(selector, attribute, marker, delimiter, end){ if(typeof selector == "string") selector = new RegExp(selector, "gi"); marker = typeof marker == "string" ? marker : "//s*="; delimiter = typeof delimiter == "string" ? delimiter : ""; end = typeof end == "string" ? end : ""; var is_end = end && "?"; var re1 = new RegExp("("+att+")("+attribute+marker+"(?://s*/"[^/""+delimiter+"]*/"|//s*''[^''"+delimiter+"]*''|[^//s"+delimiter+"]+"+is_end+")"+end+")", "gi"); html = html.replace(selector, function(match){ return prefix + match.replace(re1, by); }); } /* @name cri @description Selects an attribute of a HTML element, and performs a search-and-replace on certain values @param String selector HTML element to match @param String attribute RegExp-escaped; HTML element attribute to match @param String front RegExp-escaped; attribute value, prefix to match @param String flags Optional RegExp flags, default "gi" @param String delimiter Optional RegExp-escaped; non-quote delimiters @param String end Optional RegExp-escaped; forces the match to end before an occurence of <end> when quotes are missing */ function cri(selector, attribute, front, flags, delimiter, end){ if(typeof selector == "string") selector = new RegExp(selector, "gi"); flags = typeof flags == "string" ? flags : "gi"; var re1 = new RegExp("("+att+attribute+"//s*=)((?://s*/"[^/"]*/"|//s*''[^'']*''|[^//s>]+))", "gi"); end = typeof end == "string" ? end + ")" : ")"; var at1 = new RegExp(''(")(''+front+''[^"]+")'', flags); var at2 = new RegExp("('')("+front+"[^'']+'')", flags); var at3 = new RegExp("()("+front+''(?:"[^"]+"|/'[^/']+/'|(?:(?!''+delimiter+'').)+)''+end, flags); var handleAttr = function(match, g1, g2){ if(g2.charAt(0) == ''"'') return g1+g2.replace(at1, by); if(g2.charAt(0) == "''") return g1+g2.replace(at2, by); return g1+g2.replace(at3, by); }; html = html.replace(selector, function(match){ return prefix + match.replace(re1, handleAttr); }); } /* <meta http-equiv=refresh content=" ; url= " > */ html = html.replace(new RegExp("<meta"+any+att+"http-equiv//s*=//s*(?:/""+ae("refresh")+"/""+any+etag+"|''"+ae("refresh")+"''"+any+etag+"|"+ae("refresh")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "gi"), ""); /* Stripping all scripts */ html = html.replace(new RegExp("<script"+any+">//s*////s*<//[CDATA//[[//S//s]*?]]>//s*</script[^>]*>", "gi"), ""); html = html.replace(/<script[/S/s]+?<//script/s*>/gi, ""); cr(tag+any+att+"on[-a-z0-9:_.]+="+any+etag, "on[-a-z0-9:_.]+"); /* Event listeners */ cr(tag+any+att+"href//s*="+any+etag, "href"); /* Linked elements */ cr(tag+any+att+"src//s*="+any+etag, "src"); /* Embedded elements */ cr("<object"+any+att+"data//s*="+any+etag, "data"); /* <object data= > */ cr("<applet"+any+att+"codebase//s*="+any+etag, "codebase"); /* <applet codebase= > */ /* <param name=movie value= >*/ cr("<param"+any+att+"name//s*=//s*(?:/""+ae("movie")+"/""+any+etag+"|''"+ae("movie")+"''"+any+etag+"|"+ae("movie")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "value"); /* <style> and < style= > url()*/ cr(/<style[^>]*>(?:[^"'']*(?:"[^"]*"|''[^'']*''))*?[^''"]*(?:<//style|$)/gi, "url", "//s*//(//s*", "", "//s*//)"); cri(tag+any+att+"style//s*="+any+etag, "style", ae("url")+s+ae("(")+s, 0, s+ae(")"), ae(")")); /* IE7- CSS expression() */ cr(/<style[^>]*>(?:[^"'']*(?:"[^"]*"|''[^'']*''))*?[^''"]*(?:<//style|$)/gi, "expression", "//s*//(//s*", "", "//s*//)"); cri(tag+any+att+"style//s*="+any+etag, "style", ae("expression")+s+ae("(")+s, 0, s+ae(")"), ae(")")); return html.replace(new RegExp("(?:"+prefix+")+", "g"), prefix); }

Explicación del código

La función replace_all_rel_by_abs se basa en mi función replace_all_rel_by_abs (ver esta respuesta ). La función sanitiseHTML está completamente reescrita, para lograr la máxima eficiencia y fiabilidad.

Además, se agrega un nuevo conjunto de RegExps para eliminar todos los scripts y controladores de eventos (incluida la expression() CSS expression() , IE7-). Para asegurarse de que todas las etiquetas se analizan como se espera, las etiquetas ajustadas tienen el prefijo  . Este prefijo es necesario para analizar correctamente los" manejadores de eventos "anidados junto con las comillas sin terminar: <a id="><input onclick="<div onmousemove=evil()>"> .

Estos RegExps se crean dinámicamente usando una función interna cr / cri ( C reate R eplace [ I nline]). Estas funciones aceptan una lista de argumentos y crean y ejecutan un reemplazo avanzado de RE. Para asegurarse de que las entidades HTML no están rompiendo un RegExp ( refresh en <meta http-equiv=refresh> podría escribirse de varias maneras), los RegExps creados dinámicamente se construyen parcialmente por la función ae ( A ny E ntity).
Los reemplazos reales se realizan por función by (reemplazar por ). En esta implementación, agrega data- antes de todos los atributos coincidentes.

Todas las ocurrencias <script>//<[CDATA[ .. //]]></script> tienen rayas. Este paso es necesario, porque las secciones CDATA permiten cadenas </script> dentro del código. Después de que se haya ejecutado este reemplazo, es seguro pasar al próximo reemplazo:
Las etiquetas <script>...</script> restantes se eliminan.
Se quita la etiqueta <meta http-equiv=refresh .. >
Todos los detectores de eventos y punteros / atributos externos ( href , src , url() ) son prefijados por data- , como se describió anteriormente.
Se crea un objeto IFrame . IFrames es menos probable que pierda memoria (al contrario que htmlfile ActiveXObject). El IFrame se vuelve invisible y se adjunta al documento para poder acceder al DOM. document.write() se utilizan para escribir HTML en el IFrame. document.open() y document.close() se utilizan para vaciar los contenidos anteriores del documento, de modo que el documento generado sea una copia exacta de la cadena html dada.
Si se ha especificado una función de devolución de llamada, se llamará a la función con dos argumentos. El primer argumento es una referencia al objeto del document generado. El segundo argumento es una función que destruye el árbol DOM generado cuando se llama. Se debe llamar a esta función cuando ya no necesite el árbol.
Si no se especifica la función de devolución de llamada, la función devuelve un objeto que consta de dos propiedades ( doc y destroy ), que se comportan de la misma manera que los argumentos mencionados anteriormente.

Notas adicionales

Establecer la propiedad designMode en "On" impedirá que un marco ejecute scripts (no es compatible con Chrome). Si tiene que conservar las etiquetas <script> por un motivo específico, puede usar iframe.designMode = "On" lugar de la característica de eliminación de scripts.
No pude encontrar una fuente confiable para htmlfile activeXObject . De acuerdo con esta fuente , htmlfile es más lento que IFrames y más susceptible a fugas de memoria.
Todos los atributos afectados ( href , src , ...) están prefijados por data- . Un ejemplo de obtener / cambiar estos atributos se muestra para data-href :
elem.getAttribute("data-href") y elem.setAttribute("data-href", "...")
elem.dataset.href y elem.dataset.href = "..." .
Los recursos externos han sido deshabilitados. Como resultado, la página puede verse completamente diferente:
~~<link rel="stylesheet" href="main.css" />~~ Sin estilos externos
~~<script>document.body.bgColor="red";</script>~~ Sin estilos con guiones
<img src="128x128.png" /> Sin imágenes: el tamaño del elemento puede ser completamente diferente.

Ejemplos

sanitiseHTML(html)
Pega este bookmarklet en la barra de la ubicación. Ofrecerá una opción para inyectar un área de texto, mostrando la cadena HTML desinfectada.

javascript:void(function(){var s=document.createElement("script");s.src="http://rob.lekensteyn.nl/html-sanitizer.js";document.body.appendChild(s)})();

Ejemplos de código - string2dom(html) :

string2dom("<html><head><title>Test</title></head></html>", function(doc, destroy){ alert(doc.title); /* Alert: "Test" */ destroy(); }); var test = string2dom("<div id=''secret''></div>"); alert(test.doc.getElementById("secret").tagName); /* Alert: "DIV" */ test.destroy();

Referencias notables

SO: JS RE para cambiar todas las URL relativas a absolutas . La función sanitiseHTML(html) se basa en mi función de replace_all_rel_by_abs(html) anterior_all_rel_by_abs replace_all_rel_by_abs(html) .
Elementos - Contenido incrustado - Una lista completa de elementos incrustados estándar
Elementos - Elementos HTML anteriores - Una lista adicional de elementos (en desuso) (como <applet> )
El objeto ActiveX de htmlfile : "Cajas de arena más lentas que el iframe. Fugas de memoria si no se administran"