sql server - Calcular el hash MD5 de una cadena UTF8

sql-server tsql (2)

Debe crear un UDF para convertir los datos de NVARCHAR a bytes en la representación UTF-8. Digamos que se llama dbo.NCharToUTF8Binary entonces puede hacer:

hashbytes(''md5'', dbo.NCharToUTF8Binary(N''abc'', 1))

Aquí hay un UDF que hará eso:

create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit) returns varbinary(max) as begin -- Note: This is not the fastest possible routine. -- If you want a fast routine, use SQLCLR set @modified = isnull(@modified, 0) -- First shred into a table. declare @chars table ( ix int identity primary key, codepoint int, utf8 varbinary(6) ) declare @ix int set @ix = 0 while @ix < datalength(@txt)/2 -- trailing spaces begin set @ix = @ix + 1 insert @chars(codepoint) select unicode(substring(@txt, @ix, 1)) end -- Now look for surrogate pairs. -- If we find a pair (lead followed by trail) we will pair them -- High surrogate is /uD800 to /uDBFF -- Low surrogate is /uDC00 to /uDFFF -- Look for high surrogate followed by low surrogate and update the codepoint update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000 from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF -- Get rid of the trailing half of the pair where found delete c2 from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 where c1.codepoint >= 0x10000 -- Now we utf-8 encode each codepoint. -- Lone surrogate halves will still be here -- so they will be encoded as if they were not surrogate pairs. update c set utf8 = case -- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding) when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0) then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6)) -- Two-byte encodings when codepoint <= 0x07ff then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) -- Three-byte encodings when codepoint <= 0x0ffff then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) -- Four-byte encodings when codepoint <= 0x1FFFFF then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1) + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1) end from @chars c -- Finally concatenate them all and return. declare @ret varbinary(max) set @ret = cast('''' as varbinary(max)) select @ret = @ret + utf8 from @chars c order by ix return @ret end

Tengo una tabla SQL en la que almaceno valores de cadena grandes que deben ser únicos. Para garantizar la unicidad, tengo un índice único en una columna en la que almaceno una representación de cadena del hash MD5 de la cadena grande.

La aplicación C # que guarda estos registros utiliza el siguiente método para realizar el hash:

public static string CreateMd5HashString(byte[] input) { var hashBytes = MD5.Create().ComputeHash(input); return string.Join("", hashBytes.Select(b => b.ToString("X"))); }

Para llamar a esto, primero convierto la string al byte[] usando la codificación UTF-8:

// this is what I use in my app CreateMd5HashString(Encoding.UTF8.GetBytes("abc")) // result: 90150983CD24FB0D6963F7D28E17F72

Ahora me gustaría poder implementar esta función de hashing en SQL, usando la función HASHBYTES , pero obtengo un valor diferente:

print hashbytes(''md5'', N''abc'') -- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315

Esto se debe a que SQL calcula el MD5 de la representación UTF-16 de la cadena. Obtengo el mismo resultado en C # si hago CreateMd5HashString(Encoding.Unicode.GetBytes("abc")) .

No puedo cambiar la forma en que se realiza el hash en la aplicación.

¿Hay alguna manera de hacer que SQL Server calcule el hash MD5 de los bytes UTF-8 de la cadena?

Busqué preguntas similares, intenté usar intercalaciones, pero hasta ahora no tuve suerte.

SQL Server no admite de forma nativa el uso de cadenas UTF-8, y no lo ha hecho durante bastante tiempo . Como notó, NCHAR y NVARCHAR usan UCS-2 en lugar de UTF-8 .

Si insiste en usar la función HASHBYTES , debe poder pasar el byte[] UTF-8 byte[] como VARBINARY de su código C # para preservar la codificación. HASHBYTES acepta VARBINARY en lugar de NVARCHAR . Esto podría lograrse con una función CLR que acepte NVARCHAR y devuelva los resultados de Encoding.UTF8.GetBytes como VARBINARY .

Dicho esto, sugiero mantener este tipo de reglas de negocio aisladas dentro de su aplicación en lugar de la base de datos. Sobre todo porque la aplicación ya está realizando esta lógica.