performance - Optimización del rendimiento de matriz numérica en Haskell

floating-point polymorphism (1)

Estoy trabajando en un algoritmo de generación de terreno para un mundo parecido a Minecraft. Actualmente, estoy usando el ruido simple basado en la implementación en el documento ''Simplex Noise Demystified'' [PDF] , ya que se supone que el ruido simple es más rápido y tiene menos artefactos que el ruido Perlin. Esto se ve bastante decente (ver imagen), pero hasta ahora también es bastante lento.

Ejecutar la función de ruido 10 veces (necesito ruido con diferentes longitudes de onda para cosas como altura del terreno, temperatura, ubicación del árbol, etc.) con 3 octavas de ruido por cada bloque en un bloque (16x16x128 bloques), o cerca de 1 millón de llamadas al la función de ruido en total, toma alrededor de 700-800 ms. Esto es al menos un orden de magnitud demasiado lento para los propósitos de generar terreno con cualquier tipo de velocidad decente, a pesar del hecho de que no hay operaciones obvias y costosas en el algoritmo (al menos para mí). Solo piso, módulo, algunas búsquedas en matriz y aritmética básica. El algoritmo (escrito en Haskell) se enumera a continuación. Los comentarios del SCC son para perfilar. He omitido las funciones de ruido 2D, ya que funcionan de la misma manera.

g3 :: (Floating a, RealFrac a) => a g3 = 1/6 {-# INLINE int #-} int :: (Integral a, Num b) => a -> b int = fromIntegral grad3 :: (Floating a, RealFrac a) => V.Vector (a,a,a) grad3 = V.fromList $ [(1,1,0),(-1, 1,0),(1,-1, 0),(-1,-1, 0), (1,0,1),(-1, 0,1),(1, 0,-1),(-1, 0,-1), (0,1,1),( 0,-1,1),(0, 1,-1),( 0,-1,-1)] {-# INLINE dot3 #-} dot3 :: Num a => (a, a, a) -> a -> a -> a -> a dot3 (a,b,c) x y z = a * x + b * y + c * z {-# INLINE fastFloor #-} fastFloor :: RealFrac a => a -> Int fastFloor x = truncate (if x > 0 then x else x - 1) --Generate a random permutation for use in the noise functions perm :: Int -> Permutation perm seed = V.fromList . concat . replicate 2 . shuffle'' [0..255] 256 $ mkStdGen seed --Generate 3D noise between -0.5 and 0.5 simplex3D :: (Floating a, RealFrac a) => Permutation -> a -> a -> a -> a simplex3D p x y z = {-# SCC "out" #-} 16 * (n gi0 (x0,y0,z0) + n gi1 xyz1 + n gi2 xyz2 + n gi3 xyz3) where (i,j,k) = {-# SCC "ijk" #-} (s x, s y, s z) where s a = fastFloor (a + (x + y + z) / 3) (x0,y0,z0) = {-# SCC "x0-z0" #-} (x - int i + t, y - int j + t, z - int k + t) where t = int (i + j + k) * g3 (i1,j1,k1,i2,j2,k2) = {-# SCC "i1-k2" #-} if x0 >= y0 then if y0 >= z0 then (1,0,0,1,1,0) else if x0 >= z0 then (1,0,0,1,0,1) else (0,0,1,1,0,1) else if y0 < z0 then (0,0,1,0,1,1) else if x0 < z0 then (0,1,0,0,1,1) else (0,1,0,1,1,0) xyz1 = {-# SCC "xyz1" #-} (x0 - int i1 + g3, y0 - int j1 + g3, z0 - int k1 + g3) xyz2 = {-# SCC "xyz2" #-} (x0 - int i2 + 2*g3, y0 - int j2 + 2*g3, z0 - int k2 + 2*g3) xyz3 = {-# SCC "xyz3" #-} (x0 - 1 + 3*g3, y0 - 1 + 3*g3, z0 - 1 + 3*g3) (ii,jj,kk) = {-# SCC "iijjkk" #-} (i .&. 255, j .&. 255, k .&. 255) gi0 = {-# SCC "gi0" #-} mod (p V.! (ii + p V.! (jj + p V.! kk ))) 12 gi1 = {-# SCC "gi1" #-} mod (p V.! (ii + i1 + p V.! (jj + j1 + p V.! (kk + k1)))) 12 gi2 = {-# SCC "gi2" #-} mod (p V.! (ii + i2 + p V.! (jj + j2 + p V.! (kk + k2)))) 12 gi3 = {-# SCC "gi3" #-} mod (p V.! (ii + 1 + p V.! (jj + 1 + p V.! (kk + 1 )))) 12 {-# INLINE n #-} n gi (x'',y'',z'') = {-# SCC "n" #-} (/a -> if a < 0 then 0 else a*a*a*a*dot3 (grad3 V.! gi) x'' y'' z'') $ 0.6 - x''*x'' - y''*y'' - z''*z'' harmonic :: (Num a, Fractional a) => Int -> (a -> a) -> a harmonic octaves noise = f octaves / (2 - 1 / int (2 ^ (octaves - 1))) where f 0 = 0 f o = let r = int $ 2 ^ (o - 1) in noise r / r + f (o - 1) --Generate harmonic 3D noise between -0.5 and 0.5 harmonicNoise3D :: (RealFrac a, Floating a) => Permutation -> Int -> a -> a -> a -> a -> a harmonicNoise3D p octaves l x y z = harmonic octaves (/f -> simplex3D p (x * f / l) (y * f / l) (z * f / l))

Para el perfil, utilicé el siguiente código,

q _ = let p = perm 0 in sum [harmonicNoise3D p 3 l x y z :: Float | l <- [1..10], y <- [0..127], x <- [0..15], z <- [0..15]] main = do start <- getCurrentTime print $ q () end <- getCurrentTime print $ diffUTCTime end start

que produce la siguiente información:

COST CENTRE MODULE %time %alloc simplex3D Main 18.8 21.0 n Main 18.0 19.6 out Main 10.1 9.2 harmonicNoise3D Main 9.8 4.5 harmonic Main 6.4 5.8 int Main 4.0 2.9 gi3 Main 4.0 3.0 xyz2 Main 3.5 5.9 gi1 Main 3.4 3.4 gi0 Main 3.4 2.7 fastFloor Main 3.2 0.6 xyz1 Main 2.9 5.9 ijk Main 2.7 3.5 gi2 Main 2.7 3.3 xyz3 Main 2.6 4.1 iijjkk Main 1.6 2.5 dot3 Main 1.6 0.7

Para comparar, también porté el algoritmo a C #. El rendimiento allí fue de 3 a 4 veces más rápido, así que imagino que debo estar haciendo algo mal. Pero incluso así, no es tan rápido como me gustaría. Entonces mi pregunta es esta: ¿alguien puede decirme si hay alguna forma de acelerar mi implementación y / o el algoritmo en general o alguien sabe de un algoritmo de ruido diferente que tenga mejores características de rendimiento pero un aspecto similar?

Actualizar:

Después de seguir algunas de las sugerencias que se ofrecen a continuación, el código ahora se ve de la siguiente manera:

module Noise ( Permutation, perm , noise3D, simplex3D ) where import Data.Bits import qualified Data.Vector.Unboxed as UV import System.Random import System.Random.Shuffle type Permutation = UV.Vector Int g3 :: Double g3 = 1/6 {-# INLINE int #-} int :: Int -> Double int = fromIntegral grad3 :: UV.Vector (Double, Double, Double) grad3 = UV.fromList $ [(1,1,0),(-1, 1,0),(1,-1, 0),(-1,-1, 0), (1,0,1),(-1, 0,1),(1, 0,-1),(-1, 0,-1), (0,1,1),( 0,-1,1),(0, 1,-1),( 0,-1,-1)] {-# INLINE dot3 #-} dot3 :: (Double, Double, Double) -> Double -> Double -> Double -> Double dot3 (a,b,c) x y z = a * x + b * y + c * z {-# INLINE fastFloor #-} fastFloor :: Double -> Int fastFloor x = truncate (if x > 0 then x else x - 1) --Generate a random permutation for use in the noise functions perm :: Int -> Permutation perm seed = UV.fromList . concat . replicate 2 . shuffle'' [0..255] 256 $ mkStdGen seed --Generate 3D noise between -0.5 and 0.5 noise3D :: Permutation -> Double -> Double -> Double -> Double noise3D p x y z = 16 * (n gi0 (x0,y0,z0) + n gi1 xyz1 + n gi2 xyz2 + n gi3 xyz3) where (i,j,k) = (s x, s y, s z) where s a = fastFloor (a + (x + y + z) / 3) (x0,y0,z0) = (x - int i + t, y - int j + t, z - int k + t) where t = int (i + j + k) * g3 (i1,j1,k1,i2,j2,k2) = if x0 >= y0 then if y0 >= z0 then (1,0,0,1,1,0) else if x0 >= z0 then (1,0,0,1,0,1) else (0,0,1,1,0,1) else if y0 < z0 then (0,0,1,0,1,1) else if x0 < z0 then (0,1,0,0,1,1) else (0,1,0,1,1,0) xyz1 = (x0 - int i1 + g3, y0 - int j1 + g3, z0 - int k1 + g3) xyz2 = (x0 - int i2 + 2*g3, y0 - int j2 + 2*g3, z0 - int k2 + 2*g3) xyz3 = (x0 - 1 + 3*g3, y0 - 1 + 3*g3, z0 - 1 + 3*g3) (ii,jj,kk) = (i .&. 255, j .&. 255, k .&. 255) gi0 = rem (UV.unsafeIndex p (ii + UV.unsafeIndex p (jj + UV.unsafeIndex p kk ))) 12 gi1 = rem (UV.unsafeIndex p (ii + i1 + UV.unsafeIndex p (jj + j1 + UV.unsafeIndex p (kk + k1)))) 12 gi2 = rem (UV.unsafeIndex p (ii + i2 + UV.unsafeIndex p (jj + j2 + UV.unsafeIndex p (kk + k2)))) 12 gi3 = rem (UV.unsafeIndex p (ii + 1 + UV.unsafeIndex p (jj + 1 + UV.unsafeIndex p (kk + 1 )))) 12 {-# INLINE n #-} n gi (x'',y'',z'') = (/a -> if a < 0 then 0 else a*a*a*a*dot3 (UV.unsafeIndex grad3 gi) x'' y'' z'') $ 0.6 - x''*x'' - y''*y'' - z''*z'' harmonic :: Int -> (Double -> Double) -> Double harmonic octaves noise = f octaves / (2 - 1 / int (2 ^ (octaves - 1))) where f 0 = 0 f o = let r = 2 ^^ (o - 1) in noise r / r + f (o - 1) --3D simplex noise --syntax: simplex3D permutation number_of_octaves wavelength x y z simplex3D :: Permutation -> Int -> Double -> Double -> Double -> Double -> Double simplex3D p octaves l x y z = harmonic octaves (/f -> noise3D p (x * f / l) (y * f / l) (z * f / l))

Junto con la reducción de mi tamaño de porción a 8x8x128, la generación de nuevos trozos de terreno ahora se produce a unos 10-20 fps, lo que significa que moverse ya no es tan problemático como antes. Por supuesto, cualquier otra mejora en el rendimiento es bienvenida.

Lo que se destaca inicialmente es que tu código es altamente polimórfico. Debe especializar su tipo de coma flotante uniformemente en Double , para que GHC (y LLVM) tengan la oportunidad de aplicar optimizaciones más agresivas.

Nota, para aquellos que intentan reproducir, este código importa:

import qualified Data.Vector as V import Data.Bits import Data.Time.Clock import System.Random import System.Random.Shuffle type Permutation = V.Vector Int

De acuerdo. Hay muchas cosas que puedes intentar para mejorar este código.

Mejoras

Representación de datos

Especializarse en un tipo de punto de coma flotante concreto, en lugar de funciones de coma flotante polimórficas
Reemplazar tupla (a,a,a) con triple T !Double !Double !Double sin caja T !Double !Double !Double
Cambiar de Data.Array a Data.Array.Unboxed para las Permutations
Reemplazar el uso de matriz en caja de tripletas con matriz unboxed multidimensional del paquete repa

Banderas del compilador

Compilar con -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native (o equivalente con -fllvm )
Aumentar el umbral de constr especificación - -fspec-constr-count=16

Funciones de biblioteca más eficientes

Use mersenne-random en lugar de StdGen para generar randoms
Reemplazar mod con rem
Reemplazar V.! indexación con indización no comprobada VU.unsafeIndex (después de pasar a Data.Vector.Unboxed

Configuración de tiempo de ejecución

Aumente el área de asignación predeterminada: -A20M o -H

Además, verifique que su algoritmo sea idéntico al C # uno, y está usando las mismas estructuras de datos.