Encode/Decode URLs en C++

urlencode urldecode (21)

¿Alguien sabe de algún buen código de C ++ que hace esto?

Agregar un seguimiento a la recomendación de Bill para usar libcurl: gran sugerencia, y para ser actualizado:
después de 3 años, la función curl_escape está en desuso, por lo que para usarla en el futuro es mejor usar curl_easy_escape .

El algoritmo de codificación / decodificación de URL no es tan difícil.

Comenzaré por la especificación:

Codificación de URL en Wikipedia

Si quieres un código precocinado, solo busca en Internets:

http://www.google.it/search?hl=it&q=Encode+Decode+URLs+in+C%2B%2B&meta=

(Sí, esa dirección está codificada en url)

El otro día me enfrenté a la mitad de la codificación de este problema. Descontento con las opciones disponibles, y después de echar un vistazo a este código de muestra C , decidí rodar mi propia función de codificación de url de C ++:

#include <cctype> #include <iomanip> #include <sstream> #include <string> using namespace std; string url_encode(const string &value) { ostringstream escaped; escaped.fill(''0''); escaped << hex; for (string::const_iterator i = value.begin(), n = value.end(); i != n; ++i) { string::value_type c = (*i); // Keep alphanumeric and other accepted characters intact if (isalnum(c) || c == ''-'' || c == ''_'' || c == ''.'' || c == ''~'') { escaped << c; continue; } // Any other characters are percent-encoded escaped << uppercase; escaped << ''%'' << setw(2) << int((unsigned char) c); escaped << nouppercase; } return escaped.str(); }

La implementación de la función de decodificación se deja como un ejercicio para el lector. :PAG

Esta versión es pura C y puede opcionalmente normalizar la ruta del recurso. Usarlo con C ++ es trivial:

#include <string> #include <iostream> int main(int argc, char** argv) { const std::string src("/some.url/foo/../bar/%2e/"); std::cout << "src=/"" << src << "/"" << std::endl; // either do it the C++ conformant way: char* dst_buf = new char[src.size() + 1]; urldecode(dst_buf, src.c_str(), 1); std::string dst1(dst_buf); delete[] dst_buf; std::cout << "dst1=/"" << dst1 << "/"" << std::endl; // or in-place with the &[0] trick to skip the new/delete std::string dst2; dst2.resize(src.size() + 1); dst2.resize(urldecode(&dst2[0], src.c_str(), 1)); std::cout << "dst2=/"" << dst2 << "/"" << std::endl; }

Productos:

src="/some.url/foo/../bar/%2e/" dst1="/some.url/bar/" dst2="/some.url/bar/"

Y la función real:

#include <stddef.h> #include <ctype.h> /** * decode a percent-encoded C string with optional path normalization * * The buffer pointed to by @dst must be at least strlen(@src) bytes. * Decoding stops at the first character from @src that decodes to null. * Path normalization will remove redundant slashes and slash+dot sequences, * as well as removing path components when slash+dot+dot is found. It will * keep the root slash (if one was present) and will stop normalization * at the first questionmark found (so query parameters won''t be normalized). * * @param dst destination buffer * @param src source buffer * @param normalize perform path normalization if nonzero * @return number of valid characters in @dst * @author Johan Lindh <[email protected]> * @legalese BSD licensed (http://opensource.org/licenses/BSD-2-Clause) */ ptrdiff_t urldecode(char* dst, const char* src, int normalize) { char* org_dst = dst; int slash_dot_dot = 0; char ch, a, b; do { ch = *src++; if (ch == ''%'' && isxdigit(a = src[0]) && isxdigit(b = src[1])) { if (a < ''A'') a -= ''0''; else if(a < ''a'') a -= ''A'' - 10; else a -= ''a'' - 10; if (b < ''A'') b -= ''0''; else if(b < ''a'') b -= ''A'' - 10; else b -= ''a'' - 10; ch = 16 * a + b; src += 2; } if (normalize) { switch (ch) { case ''/'': if (slash_dot_dot < 3) { /* compress consecutive slashes and remove slash-dot */ dst -= slash_dot_dot; slash_dot_dot = 1; break; } /* fall-through */ case ''?'': /* at start of query, stop normalizing */ if (ch == ''?'') normalize = 0; /* fall-through */ case ''/0'': if (slash_dot_dot > 1) { /* remove trailing slash-dot-(dot) */ dst -= slash_dot_dot; /* remove parent directory if it was two dots */ if (slash_dot_dot == 3) while (dst > org_dst && *--dst != ''/'') /* empty body */; slash_dot_dot = (ch == ''/'') ? 1 : 0; /* keep the root slash if any */ if (!slash_dot_dot && dst == org_dst && *dst == ''/'') ++dst; } break; case ''.'': if (slash_dot_dot == 1 || slash_dot_dot == 2) { ++slash_dot_dot; break; } /* fall-through */ default: slash_dot_dot = 0; } } *dst++ = ch; } while(ch); return (dst - org_dst) - 1; }

Inspirado por xperroni, escribí un decodificador. Gracias por el puntero.

#include <iostream> #include <sstream> #include <string> using namespace std; char from_hex(char ch) { return isdigit(ch) ? ch - ''0'' : tolower(ch) - ''a'' + 10; } string url_decode(string text) { char h; ostringstream escaped; escaped.fill(''0''); for (auto i = text.begin(), n = text.end(); i != n; ++i) { string::value_type c = (*i); if (c == ''%'') { if (i[1] && i[2]) { h = from_hex(i[1]) << 4 | from_hex(i[2]); escaped << h; i += 2; } } else if (c == ''+'') { escaped << '' ''; } else { escaped << c; } } return escaped.str(); } int main(int argc, char** argv) { string msg = "J%C3%B8rn!"; cout << msg << endl; string decodemsg = url_decode(msg); cout << decodemsg << endl; return 0; }

editar: se eliminó el cctype y el iomainip innecesarios.

La API de Windows tiene las funciones UrlEscape / UrlUnescape , exportadas por shlwapi.dll, para esta tarea.

No pude encontrar un decodificador / unescape URI aquí que también decodifica secuencias de 2 y 3 bytes. Aportando mi propia versión de alto rendimiento, que sobre la marcha convierte la entrada de aguijón en un wstring:

#include <string> const char HEX2DEC[55] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,-1,-1, -1,-1,-1,-1, -1,10,11,12, 13,14,15,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,10,11,12, 13,14,15 }; #define __x2d__(s) HEX2DEC[*(s)-48] #define __x2d2__(s) __x2d__(s) << 4 | __x2d__(s+1) std::wstring decodeURI(const char * s) { unsigned char b; std::wstring ws; while (*s) { if (*s == ''%'') if ((b = __x2d2__(s + 1)) >= 0x80) { if (b >= 0xE0) { // three byte codepoint ws += ((b & 0b00001111) << 12) | ((__x2d2__(s + 4) & 0b00111111) << 6) | (__x2d2__(s + 7) & 0b00111111); s += 9; } else { // two byte codepoint ws += (__x2d2__(s + 4) & 0b00111111) | (b & 0b00000011) << 6; s += 6; } } else { // one byte codepoints ws += b; s += 3; } else { // no % ws += *s; s++; } } return ws; }

Otra solución está disponible usando la biblioteca de locura de Facebook : folly::uriEscape y folly::uriUnescape .

Por lo general, agregar ''%'' al valor int de un char no funcionará cuando se codifique, el valor se supone equivalente al hexadecimal. por ejemplo, ''/'' es ''% 2F'' no ''% 47''.

Creo que esta es la mejor y más concisa solución para codificación y decodificación de URL (sin muchas dependencias de encabezado).

string urlEncode(string str){ string new_str = ""; char c; int ic; const char* chars = str.c_str(); char bufHex[10]; int len = strlen(chars); for(int i=0;i<len;i++){ c = chars[i]; ic = c; // uncomment this if you want to encode spaces with + /*if (c=='' '') new_str += ''+''; else */if (isalnum(c) || c == ''-'' || c == ''_'' || c == ''.'' || c == ''~'') new_str += c; else { sprintf(bufHex,"%X",c); if(ic < 16) new_str += "%0"; else new_str += "%"; new_str += bufHex; } } return new_str; } string urlDecode(string str){ string ret; char ch; int i, ii, len = str.length(); for (i=0; i < len; i++){ if(str[i] != ''%''){ if(str[i] == ''+'') ret += '' ''; else ret += str[i]; }else{ sscanf(str.substr(i + 1, 2).c_str(), "%x", &ii); ch = static_cast<char>(ii); ret += ch; i = i + 2; } } return ret; }

Puede usar la función "g_uri_escape_string ()" provista glib.h. https://developer.gnome.org/glib/stable/glib-URI-Functions.html

#include <stdio.h> #include <stdlib.h> #include <glib.h> int main() { char *uri = "http://www.example.com?hello world"; char *encoded_uri = NULL; //as per wiki (https://en.wikipedia.org/wiki/Percent-encoding) char *escape_char_str = "!*''();:@&=+$,/?#[]"; encoded_uri = g_uri_escape_string(uri, escape_char_str, TRUE); printf("[%s]/n", encoded_uri); free(encoded_uri); return 0; }

compilarlo con:

gcc encoding_URI.c `pkg-config --cflags --libs glib-2.0`

Respondiendo mi propia pregunta ...

libcurl tiene curl_easy_escape para la codificación.

Para decodificar, curl_easy_unescape

Sé que la pregunta requiere un método C ++, pero para aquellos que lo necesiten, se me ocurrió una función muy corta en C simple para codificar una cadena. No crea una nueva cadena, sino que altera la existente, lo que significa que debe tener el tamaño suficiente para contener la nueva cadena. Muy fácil de mantener.

void urlEncode(char *string) { char charToEncode; int posToEncode; while (((posToEncode=strspn(string,"1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_.~"))!=0) &&(posToEncode<strlen(string))) { charToEncode=string[posToEncode]; memmove(string+posToEncode+3,string+posToEncode+1,strlen(string+posToEncode)); string[posToEncode]=''%''; string[posToEncode+1]="0123456789ABCDEF"[charToEncode>>4]; string[posToEncode+2]="0123456789ABCDEF"[charToEncode&0xf]; string+=posToEncode+3; } }

Terminé con esta pregunta cuando buscaba una API para decodificar la URL en una aplicación win32 c ++. Dado que la pregunta no especifica exactamente la plataforma, asumir ventanas no es algo malo.

InternetCanonicalizeUrl es la API para programas de Windows. Más información here

LPTSTR lpOutputBuffer = new TCHAR[1]; DWORD dwSize = 1; BOOL fRes = ::InternetCanonicalizeUrl(strUrl, lpOutputBuffer, &dwSize, ICU_DECODE | ICU_NO_ENCODE); DWORD dwError = ::GetLastError(); if (!fRes && dwError == ERROR_INSUFFICIENT_BUFFER) { delete lpOutputBuffer; lpOutputBuffer = new TCHAR[dwSize]; fRes = ::InternetCanonicalizeUrl(strUrl, lpOutputBuffer, &dwSize, ICU_DECODE | ICU_NO_ENCODE); if (fRes) { //lpOutputBuffer has decoded url } else { //failed to decode } if (lpOutputBuffer !=NULL) { delete [] lpOutputBuffer; lpOutputBuffer = NULL; } } else { //some other error OR the input string url is just 1 char and was successfully decoded }

InternetCrackUrl ( here ) también parece tener indicadores para especificar si decodificar URL

Tuve que hacerlo en un proyecto sin Boost. Entonces, terminé escribiendo el mío. Simplemente lo pondré en GitHub: https://github.com/corporateshark/LUrlParser

clParseURL URL = clParseURL::ParseURL( "https://name:[email protected]:80/path/res" ); if ( URL.IsValid() ) { cout << "Scheme : " << URL.m_Scheme << endl; cout << "Host : " << URL.m_Host << endl; cout << "Port : " << URL.m_Port << endl; cout << "Path : " << URL.m_Path << endl; cout << "Query : " << URL.m_Query << endl; cout << "Fragment : " << URL.m_Fragment << endl; cout << "User name : " << URL.m_UserName << endl; cout << "Password : " << URL.m_Password << endl; }

Y el código fuente ...

http://www.codeguru.com/cpp/cpp/string/conversions/article.php/c12759

El cuerpo debe tener al menos 30 caracteres

[Modo Nigromante en]
Tropecé con esta pregunta cuando estaba buscando una solución rápida, moderna, independiente de la plataforma y elegante. No me gustó ninguno de los anteriores, cpp-netlib sería el ganador, pero tiene una vulnerabilidad de memoria horrible en la función "decodificada". Así que se me ocurrió la solución espíritu qi / karma de boost.

namespace bsq = boost::spirit::qi; namespace bk = boost::spirit::karma; bsq::int_parser<unsigned char, 16, 2, 2> hex_byte; template <typename InputIterator> struct unescaped_string : bsq::grammar<InputIterator, std::string(char const *)> { unescaped_string() : unescaped_string::base_type(unesc_str) { unesc_char.add("+", '' ''); unesc_str = *(unesc_char | "%" >> hex_byte | bsq::char_); } bsq::rule<InputIterator, std::string(char const *)> unesc_str; bsq::symbols<char const, char const> unesc_char; }; template <typename OutputIterator> struct escaped_string : bk::grammar<OutputIterator, std::string(char const *)> { escaped_string() : escaped_string::base_type(esc_str) { esc_str = *(bk::char_("a-zA-Z0-9_.~-") | "%" << bk::right_align(2,0)[bk::hex]); } bk::rule<OutputIterator, std::string(char const *)> esc_str; };

El uso de arriba como sigue:

std::string unescape(const std::string &input) { std::string retVal; retVal.reserve(input.size()); typedef std::string::const_iterator iterator_type; char const *start = ""; iterator_type beg = input.begin(); iterator_type end = input.end(); unescaped_string<iterator_type> p; if (!bsq::parse(beg, end, p(start), retVal)) retVal = input; return retVal; } std::string escape(const std::string &input) { typedef std::back_insert_iterator<std::string> sink_type; std::string retVal; retVal.reserve(input.size() * 3); sink_type sink(retVal); char const *start = ""; escaped_string<sink_type> g; if (!bk::generate(sink, g(start), input)) retVal = input; return retVal; }

[Modo Nigromante desactivado]

EDIT01: arregló las cosas de relleno cero - gracias especiales a Hartmut Kaiser
EDIT02: Live on CoLiRu

los jugosos pedacitos

#include <ctype.h> // isdigit, tolower from_hex(char ch) { return isdigit(ch) ? ch - ''0'' : tolower(ch) - ''a'' + 10; } char to_hex(char code) { static char hex[] = "0123456789abcdef"; return hex[code & 15]; }

señalando que

char d = from_hex(hex[0]) << 4 | from_hex(hex[1]);

como en

// %7B = ''{'' char d = from_hex(''7'') << 4 | from_hex(''B'');

simplemente puede usar la función AtlEscapeUrl () desde atlutil.h, solo revise su documentación sobre cómo usarla.

cpp-netlib tiene funciones

namespace boost { namespace network { namespace uri { inline std::string decoded(const std::string &input); inline std::string encoded(const std::string &input); } } }

permiten codificar y decodificar cadenas de URL muy fácil.

CGICC incluye métodos para codificar y decodificar URL. form_urlencode y form_urldecode

string urlDecode(string &SRC) { string ret; char ch; int i, ii; for (i=0; i<SRC.length(); i++) { if (int(SRC[i])==37) { sscanf(SRC.substr(i+1,2).c_str(), "%x", &ii); ch=static_cast<char>(ii); ret+=ch; i=i+2; } else { ret+=SRC[i]; } } return (ret); }

no es el mejor, pero funciona bien ;-)