HPC (High Performance Computing) bookmarks

# HPC (High Performance Computing) bookmarks + [Learning and practice of high performance computing](https://github.com/cjmcv/hpc) + CFD + [Algebraic Flux Correction I. Scalar Conservation Laws](http://www.mathematik.tu-dortmund.de/lsiii/cms/papers/Kuzmin2011b.pdf) + [Algebraic Flux Correction II. Compressible Euler Equations](https://www.researchgate.net/publication/226067892_Algebraic_Flux_Correction_II_Compressible_Euler_Equations) + [CFD Notes by Hiroaki Nishikawa](http://www.cfdnotes.com/) + [CFD codes in f90](http://www.cfdbooks.com/cfdcodes.html) + [Tim Warburton's github repositories](https://github.com/tcew?tab=repositories) + [Nodal Discontinuous Galerkin](https://github.com/tcew/nodal-dg) + [Hybrid and Easy Discontinuous Galerkin Environment](https://mathema.tician.de/software/hedge/) + [**Element-based-Galerkin-Methods**](https://github.com/fxgiraldo/Element-based-Galerkin-Methods) + [Extreme-scale Discontinuous Galerkin Environment (EDGE)](https://github.com/3343/edge) + [The Development, Verification, and Validation of a Discontinuous Galerkin Method for the Navier-Stokes Equations](https://dataspace.princeton.edu/handle/88435/dsp01pk02cd90z) + [What are possible methods to solve compressible Euler equations](http://scicomp.stackexchange.com/questions/283/what-are-possible-methods-to-solve-compressible-euler-equations/305) + [I do like CFD, VOL.1, Second Edition](http://www.cfdbooks.com/cfdbooks.html) + [Free CFD Codes](http://www.cfdbooks.com/cfdcodes.html) + [CFD Julia: A Learning Module Structuring an Introductory Course on Computational Fluid Dynamics](https://www.mdpi.com/2311-5521/4/3/159/htm) + [CFD_Julia](https://github.com/surajp92/CFD_Julia) + Lattice Boltzmann codes + [A lattice Boltzmann code for complex fluids](https://github.com/ludwig-cf/ludwig) + [Hydrodynamics in OpenCL](http://christopheremoore.net/hydrodynamics-cl/) + [Roe, HLL, HLLC, Burgers Scheme; 1D, 2D, 3D; Euler Equation, (~Maxwell), (~MHD), ADM Solver in OpenCL](https://github.com/thenumbernine/HydrodynamicsGPU) + [Differential Geometry Tensor Library](https://github.com/thenumbernine/Tensor) + [Implementing the discontinuous Galerkin method in CUDA](https://github.com/martyfuhry/DGCUDA) + [Marty Fuhry's Homepage](http://www.martyfuhry.blogspot.co.uk/p/another-page.html) + [Master's Thesis: An Implementation of the Discontinuous Galerkin Method on Graphics Processing Units, defended May, 2013](https://uwspace.uwaterloo.ca/bitstream/handle/10012/7523/Fuhry_Martin.pdf?sequence=1) + [A GPU-accelerated adaptive discontinuous Galerkin method for level set equation](https://www.researchgate.net/publication/299471213_A_GPU-accelerated_adaptive_discontinuous_Galerkin_method_for_level_set_equation) + [A GPU Accelerated Discontinuous Galerkin Incompressible Flow Solver](https://www.researchgate.net/publication/322221361_A_GPU_Accelerated_Discontinuous_Galerkin_Incompressible_Flow_Solver) + [The Development, Verification, and Validation of a Discontinuous Galerkin Method for the Navier-Stokes Equations](https://dataspace.princeton.edu/handle/88435/dsp01pk02cd90z) + [SU2](http://su2.stanford.edu/) + [github repo for SU2](https://github.com/su2code/SU2) + [HiFiLES: High Fidelity Large Eddy Simulation](https://hifiles.stanford.edu/) + [github repo for HiFiLES](https://github.com/HiFiLES/HiFiLES-solver) + [PyWENO + PyPFASST](https://github.com/memmett) + [Clawpack Repositories](https://github.com/clawpack/) + [Riemann Problems and Jupyter Solutions](https://github.com/clawpack/riemann_book#installation) + [Shenfun is a high performance computing platform for solving partial differential equations (PDEs) by the spectral Galerkin method](https://github.com/spectralDNS/shenfun) + [Multilayered Abstractions for Partial Differential Equations by Graham Robert Markall: good review on Nektar++ etc.](http://www.big-grey.co.uk/g_markall_phd_thesis.pdf) + [Making Faster FEM Solvers, Faster MPhil Transfer Report By Graham Markall](http://www.doc.ic.ac.uk/~grm08/g_markall_mphil_transfer.pdf) + [Graham Markall](http://www.doc.ic.ac.uk/~grm08/) + Nektar++ + [Nektar++: An efficient h to p finite element framework](http://www.nektar.info/) + [Nektar++: a high-order finite element framework](https://xyloid.org/assets/talks/2014-06-ices.pdf) + [Simple (and not-so-simple) CFD solvers written in Fortran with Python plotting routines](https://github.com/JOThurgood/SimpleCFD) + [MAESTROeX solves the equations of low Mach number hydrodynamics for stratified atmospheres/full spherical stars with a general equation of state, and nuclear reaction networks in an adaptive-grid finite-volume framework. It includes reactions and thermal diffusion and can be used on anything from a single core to 100,000s of processor cores with MPI + OpenMP or 1,000s of GPUs](https://github.com/AMReX-Astro/MAESTROeX) + [Model stars and atomspheres with MAESTROeX](https://amrex-astro.github.io/MAESTROeX/) + [Is there a good tutorial or textbook-like source on implementing ENO/WENO with limiters in one (and more than one) dimension?](https://scicomp.stackexchange.com/questions/8706/is-there-a-good-tutorial-or-textbook-like-source-on-implementing-eno-weno-with-l/8709#8709) + [PyWENO](https://pyweno.readthedocs.io/en/latest/) + [blitzdg is an open-source library offering discontinuous Galerkin (dg) solvers for common partial differential equations systems using blitz++ for array and tensor manipulations in a C++ environment or NumPy as a Python 3 library](https://github.com/WQCG/blitzdg) + [Derek Steinmoeller's blog](https://dsteinmo.github.io/) + [DGSWEM V2](https://github.com/UT-CHG/dgswemv2) + [adaptive multiresolution DG](https://github.com/JuntaoHuang/adaptive-multiresolution-DG) + [Exasim - Generating Discontinuous Galerkin Codes For Extreme Scalable Simulations](https://github.com/exapde/Exasim) + Chebyshev Pseudo-Spectral Method (PSM) + [Chebyshev Polynomials J.C. Mason D.C. Handscomb](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-FM.pdf) + [Chapter 1. Definitions](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch01.pdf) + [Chapter 2. Basic Properties and Formulae](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch02.pdf) + [Chapter 3. The Minimax Property and Its Applications](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch03.pdf) + [Chapter 4. Orthogonality and Least-Squares Approximation](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch04.pdf) + [Chapter 5. Chebyshev Series](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch05.pdf) + [Chapter 6. Chebyshev Interpolation](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch06.pdf) + [Chapter 7. Near-Best L∞, L1 and Lp Approximations](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch07.pdf) + [Chapter 8. Integration Using Chebyshev Polynomials](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch08.pdf) + [Chapter 9. Solution of Integral Equations](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch09.pdf) + [Chapter 10. Solution of Ordinary Differential Equations](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch10.pdf) + [**Chapter 11. Chebyshev and Spectral Methods for Partial DifferentialEquations**](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch11.pdf) + [Chapter 12. Conclusion](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-Ch12.pdf) + [Appendices:](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-App.pdf) + [Summary of Notations, Definitions and ImportantProperties](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-App.pdf#[1,{%22name%22:%22FitH%22},690]) + [Tables of Coefficients](http://inis.jinr.ru/sl/M_Mathematics/MRef_References/Mason,%20Hanscomb.%20Chebyshev%20polynomials%20(2003)/C0355-App.pdf#[7,{%22name%22:%22FitH%22},677]) + [FFTW Discrete Cosine Transform Derivative](http://www.variousconsequences.com/2009/05/fftw-discrete-cosine-transform.html) + [FAST ALGORITHMS FOR DISCRETE POLYNOMIAL TRANSFORMS](https://www.ams.org/journals/mcom/1998-67-224/S0025-5718-98-00975-2/S0025-5718-98-00975-2.pdf) + [A New Method for Chebyshev Polynomial Interpolation Based on Cosine Transforms](https://link.springer.com/article/10.1007/s00034-015-0087-4) + [A brief introduction to pseudo-spectral methods: application to diffusion problems](https://arxiv.org/pdf/1606.05432.pdf) + [Spectral methods in python](http://cpraveen.github.io/teaching/chebpy.html) + [An Introduction to Domain Decomposition Methods:algorithms, theory and parallel implementation](https://hal.archives-ouvertes.fr/cel-01100932/file/bookddm.pdf) + [Chebyshev-Legendre Spectral Domain Decomposition Method for Two-Dimensional Vorticity Equations](https://www.cambridge.org/core/journals/communications-in-computational-physics/article/abs/chebyshevlegendre-spectral-domain-decomposition-method-for-twodimensional-vorticity-equations/18FEEF1F11DA2E8A8F134C8C2FE18052) + [Domain Decomposition Methods for Mortar Finite Elements](https://cs.nyu.edu/media/publications/TR2000-804.pdf) + [An efficient domain-decomposition pseudo-spectral method for solving elliptic differential equations](https://eprints.usq.edu.au/4568/) + [A Pseudospectral Multi-Domain Method for the Incompressible Navier-Stokes Equations](https://www.researchgate.net/publication/220395568_A_Pseudospectral_Multi-Domain_Method_for_the_Incompressible_Navier-Stokes_Equations) + [Deep Domain Decomposition Method: Elliptic Problems](https://arxiv.org/pdf/2004.04884.pdf) + [How to Design an Efficient Pseudospectral Code](https://www.math.ualberta.ca/~bowman/talks/caims19.pdf) + [code for **How to Design an Efficient Pseudospectral Code**](https://github.com/dealias/dns) + [Dedalus is a framework for solving a broad range of partial differential equations using spectral methods, including initial-value, boundary-value, and generalized eigenvalue problems](https://dedalus-project.org/about/) + [Dedalus is a flexible framework for solving partial differential equations using spectral methods](https://github.com/DedalusProject/dedalus) + [multiple-interval pseudospectral methods to solve optimal control problems](https://github.com/danielrherber/basic-multiple-interval-pseudospectral) + [pizza is a high-performance numerical code for quasi-geostrophic and non-rotating convection in a 2-D annulus geometry](https://github.com/magic-sph/pizza) + [FDBB (Fluid Dynamics Building Blocks) is a C++ expression template library for fluid dynamics](https://mmoelle1.gitlab.io/FDBB/) + [FDBB - Fluid Dynamics Building Blocks](https://gitlab.com/mmoelle1/FDBB) + Finite Element Methods (FEM) and Spectral Element Methods (SEM) + [deal.II — an open source finite element library](https://www.dealii.org/) + [Amandus: Simulations based on multilevel Schwarz methods Documentation](http://www.mathsim.eu/~gkanscha/amandus/) + [Feel++ finite element embedded library in C++](http://www.feelpp.org/) + [Feel++: Finite Element Embedded Library in C++](https://github.com/feelpp/feelpp) + [Veamy: an extensible object-oriented C++ library for the virtual element method](https://camlab.cl/software/veamy/) + [Veamy: an extensible object-oriented C++ library for the virtual element method](https://www.researchgate.net/publication/319057392_Veamy_an_extensible_object-oriented_C_library_for_the_virtual_element_method) + [Two dimensional high-order spectral element method fluid dynamics solver](https://github.com/horsescfd/HORSES2D) + [Two dimensional high-order spectral element method fluid dynamics solver](https://github.com/juanmanzanero/HORSES2D) + [github: ITHACA-SEM - In real Time Highly Advanced Computational Applications for Spectral Element Methods](https://github.com/mathLab/ITHACA-SEM) + [THACA-SEM - In real Time Highly Advanced Computational Applications for Spectral Element Methods](https://mathlab.sissa.it/ITHACA-SEM) + [AxiSEM is a parallel spectral-element method to solve 3D wave propagation in a sphere with axisymmetric or spherically symmetric visco-elastic, acoustic, anisotropic structures](https://github.com/geodynamics/axisem) + [HDGlab: An open-source implementation of the hybridisable discontinuous Galerkin method in MATLAB](https://ww2.lacan.upc.edu/scientificPublications/files/pdfs/ACME-GSH-20.pdf) + [HDGlab - A Matlab implementation of the hybridisable discontinuous Galerkin (HDG) method](https://git.lacan.upc.edu/hybridLab/HDGlab) + [Euler Equations for Ideal Gases](https://github.com/IANW-Projects/ConservationLaws/issues/11) + [Split form nodal discontinuous Galerkin schemes with summation-by-parts property for the compressible Euler equations](https://www.sciencedirect.com/science/article/pii/S0021999116304259) + Siemens + [Embedded Multicore Building Blocks (EMB²)](https://github.com/siemens/embb) + Maxeler + [Maxeler Technologies - Maximum Performance Computing](https://github.com/maxeler) + [AirfoilDFE - An unstructured mesh finite volume solver on DFE.](https://github.com/maxeler/Airfoil) + [Lattice QCD is a discretization of Quantum Chromodynamics](https://github.com/maxeler/LatticeQCD) + [LatticeBoltzmann](https://github.com/maxeler/LatticeBoltzmann) + [facilities to experiment with Discontinuous Petrov Galerkin (DPG) methods](https://github.com/jayggg/DPG) + [Research papers of Jay Gopalakrishnan](http://web.pdx.edu/~gjay/research/papers.html) + [Free CFD codes](https://www.cfd-online.com/Wiki/Codes) + [Code_Saturne](https://www.code-saturne.org/cms/download/Source-code-access) + [Large-Scale CFD Parallel Computing Dealing with Massive Mesh](https://www.hindawi.com/journals/je/2013/850148/)016c0bd28b2435d468ce3cd1771426de9f264af6 + [Open source tools in technical photorealistic large-scale visualisation](http://www.vtt.fi/inf/julkaisut/muut/2015/VTT-R-04911-15.pdf) + [An Open Source CFD-DEM Perspective](http://web.student.chalmers.se/groups/ofw5/Presentations/ChristophGonivaSlidesOFW5.pdf) + [3D, block structured, explicit/implicit, Navier-Stokes solver](https://github.com/mnucci32/aither) + [An evaluation of the Eigen linear algebra library for use in the aither CFD solver](https://github.com/mnucci32/eigenVsAither) + [A look at the performance of expression templates in C++: Eigen vs Blaze vs Fastor vs Armadillo vs XTensor](https://romanpoya.medium.com/a-look-at-the-performance-of-expression-templates-in-c-eigen-vs-blaze-vs-fastor-vs-armadillo-vs-2474ed38d982) + CFD + GPU + [Recent progress and challenges in exploiting graphics processors in computational fluid dynamics: slightly outdated but interesting](http://arxiv.org/pdf/1309.3018.pdf) + [Laplace solver running on GPU using CUDA, with CPU version for comparison, slightly outdated](https://github.com/kyleniemeyer/laplace_gpu) + PyFR + [PyFR is an open-source Python based framework for solving advection-diffusion type problems on streaming architectures using the Flux Reconstruction approach of Huynh](http://www.pyfr.org/) + [vincentlab/PyFR](https://github.com/vincentlab/PyFR) + [New PyFR Paper “Heterogeneous Computing on Mixed Unstructured Grids with PyFR”](http://www.techenablement.com/new-pyfr-paper-heterogeneous-computing-on-mixed-unstructured-grids-with-pyfr/) + [PyFR: An open source framework for solving advection–diffusion type problems on streaming architectures using the flux reconstruction approach](http://www.sciencedirect.com/science/article/pii/S0010465514002549) + [High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches](https://books.google.ru/books?id=MUZ0CAAAQBAJ&pg=PA261&lpg=PA261&dq=mako+python+examples+c%2B%2B&source=bl&ots=nBUXLR84mk&sig=FVLDhAaYRjzoEjDCQleT43deZv4&hl=en&sa=X&ved=0CC8Q6AEwCDgKahUKEwjCjMnGvprHAhXJjywKHQuiBnE#v=onepage&q=mako%20python%20examples%20c%2B%2B&f=false) + Camellia + [Camellia Discontinuous Petrov-Galerkin github repository](https://github.com/CamelliaDPG/Camellia/tree/master/docs/HPC_report) + Co-design at Lawrence Livermore National Lab + [Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)](https://codesign.llnl.gov/lulesh.php) + [DoE Exascale Co-Design Center for Materials in Extreme Environments : Extreme Materials at Extreme Scale](http://www.exmatex.org/) + [Programming Models - Languages and tools for developing multi-scale applicatins.](http://www.exmatex.org/prog-models.html) + [Terra is a new low-level system programming language that is designed to interoperate seamlessly with the Lua programming language](http://terralang.org/) + [List of quantum chemistry and solid-state physics software](https://en.wikipedia.org/wiki/List_of_quantum_chemistry_and_solid-state_physics_software) + CP2K + [Mirror of official svn repository at sourceforge. Synced every 5 minutes.](https://github.com/cp2k/cp2k) + [Accelerated Sparse Matrix Multiplication for Quantum Chemistry with CP2K on Hyprid Supercomputers](https://www.youtube.com/watch?v=5wppMHxF_Js) + [Evaluation of C, Go, and Rust in the HPC environment](https://news.ycombinator.com/item?id=9477014) + Modern Fortran + [NNSA, national labs team with Nvidia to develop open-source Fortran compiler technology](https://www.llnl.gov/news/nnsa-national-labs-team-nvidia-develop-open-source-fortran-compiler-technology) + [Flang is a ground-up implementation of a Fortran front end written in modern C++. It started off as the f18 project](https://github.com/llvm/llvm-project/tree/master/flang/) + [F18 is a front-end for Fortran intended to replace the existing front-end in the Flang compiler](https://github.com/flang-compiler/f18) tl;dr 301 moved The code from this repository can now be found at [flang](https://github.com/llvm/llvm-project/tree/master/flang/) + [Flang and F18](https://github.com/flang-compiler/flang/wiki) + [Installing LLVM Flang Fortran compiler](https://www.scivision.dev/flang-compiler-build-tips/) tl;dr ```sh git clone https://github.com/llvm/llvm-project mkdir -p llvm-project/build cd llvm-project/build cmake ../llvm -DLLVM_ENABLE_PROJECTS=flang ``` + [Unknown CMake command “tablegen”](https://stackoverflow.com/questions/59691069/unknown-cmake-command-tablegen) + [libCEED: the CEED Library: Code for Efficient Extensible Discretization](https://github.com/CEED/libCEED) + [CEED Library: Code for Efficient Extensible Discretization](https://ceed.exascaleproject.org/software/) + [MFEM is a free, lightweight, scalable C++ library for finite element methods](https://mfem.org/) + [MFEM is a free, lightweight, scalable C++ library for finite element methods: examples](https://mfem.org/examples/) + [**GPU support in MFEM**](https://mfem.org/gpu-tips-n-tricks/) + [Finite Element Discretization Library __ _ __ ___ / _| ___ _ __ ___ | '_ ` _ \ | |_ / _ \| '_ ` _ \ | | | | | || _|| __/| | | | | | |_| |_| |_||_| \___||_| |_| |_|](https://github.com/mfem/mfem) + [High-order Lagrangian Hydrodynamics Miniapp](https://github.com/CEED/Laghos) + [Modern trends in programming of GPUs DAQFEET 2021](https://indico.cern.ch/event/974424/contributions/4158315/attachments/2186808/3695101/modern-gpu.pdf) + [Toward Performance-Portable PETSc for GPU-based Exascale Systems](https://arxiv.org/pdf/2011.00715.pdf) + AMG + AMG intro + [Iteration methods](https://encyclopediaofmath.org/wiki/Iteration_methods) + [Algebraic multigrid method by smoothed agglomeration for a Stokes problem](http://perso.unifr.ch/ales.janka/papers/emg_slides.pdf) + [Convergence of Algebraic Multigrid Based on Smoothed Aggregation II: Extension to a Petrov-Galerkin Method](https://hal.inria.fr/inria-00072986) + [Lawrence Livermore National Laboratory Robert D. Falgout Center for Applied Scientific Computing An Algebraic Multigrid Tutorial](http://user.it.uu.se/~maya/Courses/NLA_Parallel/Slides_2013/AMG_parallel_Falgout.pdf) + [An Introduction to Algebraic Multigrid](https://www2.karlin.mff.cuni.cz/~hron/NMNV532/An_Introduction_to_Algebraic_Multigrid_Computing-Falgout-2006.pdf) + [An Algebraic Multigrid Tutorial IMA Tutorial – FastSolution Techniques November28-29, 2010](http://user.it.uu.se/~maya/Courses/NLA_Parallel/Slides_2013/AMG_parallel_Falgout.pdf) + [Multigrid Methods: From Geometrical to Algebraic Versions Gundolf HAASE](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.453.4097&rep=rep1&type=pdf) + [A root-node based algebraic multigrid method](https://arxiv.org/pdf/1610.03154.pdf) + [Iterative methods for linear, non-linear and eigenvalue problems](http://www.mcc.uiuc.edu/summerschool/2001/Eric%20de%20Sturler/desturler.htm) + [A Multigrid Tutorial by William L. Briggs](https://www.math.ust.hk/~mawang/teaching/math532/mgtut.pdf) + [Algebraic Multigrid Code](https://scicomp.stackexchange.com/questions/1300/algebraic-multigrid-code) + [Performance of Preconditioners for Large-Scale Simulations Using Nek5000](https://link.springer.com/chapter/10.1007/978-3-030-39647-3_20) + [Reducing Complexity in Parallel Algebraic Multigrid Preconditioners, Hans de Sterck, Ulrike Meier Yang and Jeffrey J. Heys](http://www.math.uwaterloo.ca/~hdesterc/websiteW/Data/publications/journal/pmisPreprint.pdf) + [3.2.5. Block Compressed Sparse Row Format (BSR)](https://docs.nvidia.com/cuda/cusparse/index.html#bsr-format) + [I don't find the LU decomposition on the device with cuSolver](https://stackoverflow.com/questions/32242677/i-dont-find-the-lu-decomposition-on-the-device-with-cusolver) + [AMGX](https://github.com/NVIDIA/AMGX) + [AMGX in Julia](https://github.com/JuliaGPU/AMGX.jl) + [pyamgx: Python interface to NVIDIA's AMGX library](https://github.com/shwina/pyamgx) + [pyamgx - GPU accelerated multigrid library for Python](https://pyamgx.readthedocs.io/en/latest/) + [AmgXWrapper](https://github.com/barbagroup/AmgXWrapper) + [An example and benchmark of AmgX and PETSc with Poisson system](https://github.com/barbagroup/AmgXWrapper/blob/master/example/poisson/src/main.cpp) + [PetIBM - toolbox and applications of the immersed-boundary method on distributed-memory architectures](https://github.com/barbagroup/PetIBM) + [geoclaw-landspill](https://github.com/barbagroup/geoclaw-landspill) + [High-productivity, high-performance workflow for virus-scale electrostatic simulations with Bempp-Exafmm](https://github.com/barbagroup/bempp_exafmm_paper) + [Alexa: Simulating Shock Hydrodynamics on the GPU using Kokkos](https://www.osti.gov/servlets/purl/1510909) + [GPGPU acceleration a case study of algebraic multigrid preconditioned GMRES](https://pure.tue.nl/ws/portalfiles/portal/142433633/Master_Thesis_Report_Lucas_Bekker_final_.pdf) + [AmgX: A Library for GPU Accelerated Algebraic Multigrid and Preconditioned Iterative Methods](https://www.researchgate.net/publication/283330199_AmgX_A_Library_for_GPU_Accelerated_Algebraic_Multigrid_and_Preconditioned_Iterative_Methods) + [Comparison of AMGX and Hypre](https://github.com/NVIDIA/AMGX/issues/112) + [rocALUTION is a sparse linear algebra library with focus on exploring fine-grained parallelism](https://rocalution.readthedocs.io/en/master/usermanual.html) + [amgcl](https://github.com/ddemidov/amgcl) + [amgcl](https://amgcl.readthedocs.io/en/latest/) + [C++ library for solving large sparse linear systems with algebraic multigrid method](https://bestofcpp.com/repo/ddemidov-amgcl-cpp-scientific-computing) + [Triggering C++11 support in NVCC with CMake](https://stackoverflow.com/questions/36551469/triggering-c11-support-in-nvcc-with-cmake) tl;dr ```diff diff --git a/CMakeLists.txt b/CMakeLists.txt index 6ca3264..b63e326 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -161,9 +161,9 @@ if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" OR CMAKE_CXX_COMPILER_ID MATCHES "MSVC") if (CMAKE_CXX_COMPILER_ID MATCHES "GNU") list(APPEND CUDA_NVCC_FLAGS - ${CUDA_ARCH_FLAGS} -std=c++11 -Wno-deprecated-gpu-targets) + ${CUDA_ARCH_FLAGS} -std=c++17 -Wno-deprecated-gpu-targets) - list(APPEND CUDA_NVCC_FLAGS -Xcompiler -std=c++11 -Xcompiler -fPIC -Xcompiler -Wno-vla) + list(APPEND CUDA_NVCC_FLAGS -Xcompiler -std=c++17 -Xcompiler -fPIC -Xcompiler -Wno-vla) endif() add_library(cuda_target INTERFACE) ``` + [Stokes problem gives NaN by AMG but GMRES works fine](https://github.com/ddemidov/amgcl/issues/144) + [Pressure projection solver for Incompressible Navier-Stokes FEM](https://github.com/ddemidov/amgcl/issues/151) + [**how to perform matrix construction in GPU deveces without the data transfer**](https://github.com/ddemidov/amgcl/issues/164) + [Block preconditioners](https://github.com/ddemidov/amgcl/issues/37) + [amg_corrector_solver](https://github.com/Andlon/crest/blob/master/include/crest/basis/amg_corrector_solver.hpp) + [schur pressure correction](https://github.com/ddemidov/cppstokes_benchmarks/blob/master/amgcl_spc_pre.cpp) + [code accompanying "Accelerating linear solvers for Stokes problems with C++ metaprogramming"](https://github.com/ddemidov/cppstokes_benchmarks/) + [Accelerating linear solvers for Stokes problems with C++ metaprogramming](https://arxiv.org/pdf/2006.06052.pdf) + [SPARSH-AMG](https://github.com/cmgcds/SParSH-AMG) + [SPARSH-AMG: A LIBRARY FOR HYBRID CPU-GPU ALGEBRAIC MULTIGRID AND PRECONDITIONED ITERATIVE METHODS](https://arxiv.org/pdf/2007.00056.pdf) + [Ginkgo is a high-performance linear algebra library for manycore systems, with a focus on sparse solution of linear systems. It is implemented using modern C++ (you will need at least C++14 compliant compiler to build it), with GPU kernels implemented in CUDA and HIP. HAS support for AMG](https://github.com/ginkgo-project/ginkgo) + [GPGPU acceleration - a case study of algebraic multigrid preconditioned GMRES](https://pure.tue.nl/ws/portalfiles/portal/142433633/Master_Thesis_Report_Lucas_Bekker_final_.pdf) + [BootCMatchG](https://github.com/bootcmatch/BootCMatchG) + [multigrid solver for solving elliptic PDEs using finite differences on a rectangular grid](https://github.com/jesserobertson/multigrid) + [Multigrid HowTo (Part I): A simple Multigrid solver in C++ in less than 200 lines of code](https://www10.cs.fau.de/publications/reports/TechRep_2008-03.pdf) + [Multigrid HowTo (Part II): An Open Source Algebraic Multigrid Solver in C++](https://www10.cs.fau.de/publications/reports/TechRep_2009-02.pdf) + [Multigrid solver prototype (GMG) and simple Lid Cavity solver](https://discourse.julialang.org/t/multigrid-solver-prototype-gmg-and-simple-lid-cavity-solver/41969) + [ExaStencils: Advanced Multigrid Solver Generation](https://link.springer.com/chapter/10.1007/978-3-030-47956-5_14) + [EvoStencils - Constructing efficient multigrid solvers through evolutionary computation](https://github.com/jonas-schmitt/evostencils) + Sparse Linear System Solvers on GPUs + [SPARSE LINEAR SYSTEM SOLVERS ON GPUS: PARALLEL PRECONDITIONING, WORKLOAD BALANCING, AND COMMUNICATION REDUCTION](https://www.tdx.cat/bitstream/handle/10803/667096/2019_Tesis_Flegar_Goran.pdf) + [High performance sparse multifrontal solvers on modern GPUs](https://www.sciencedirect.com/science/article/abs/pii/S0167819122000059) + [STRUMPACK -- STRUctured Matrix PACKage, Copyright (c) 2014-2021](https://github.com/pghysels/strumpack) + [Как SpaceX использует GPU для обсчёта ракетных двигателей](http://habrahabr.ru/post/256081/) + [Rockets Shake And Rattle, So SpaceX Rolls Homegrown CFD](http://www.nextplatform.com/2015/03/27/rockets-shake-and-rattle-so-spacex-rolls-homegrown-cfd/) + [Modern C++ Parallel Task Programming](https://github.com/cpp-taskflow/cpp-taskflow) + [docs for Modern C++ Parallel Task Programming](https://cpp-taskflow.github.io/cpp-taskflow/index.html) + [Freud, a tool to create Performance Annotations for C/C++ programs](https://github.com/usi-systems/freud) + [Eyal Rozenberg, Ph.D.](https://eyalroz.github.io/) + [Eyal Rozenberg](https://github.com/eyalroz) + [Thin C++-flavored wrappers for the CUDA APIs: Runtime, Driver, NVRTC and NVTX](https://github.com/eyalroz/cuda-api-wrappers) + [GPU Kernel Runner](https://github.com/eyalroz/gpu-kernel-runner) + [RAPIDS - Open GPU Data Science](https://github.com/rapidsai) + [RAFT: Reusable Accelerated Functions and Tools](https://github.com/rapidsai/raft) + [cuDF - GPU DataFrames](https://github.com/rapidsai/cudf) tl;dr ```sh cd cpp && mkdir -p build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release -DOPENSSL_INCLUDE_DIR=/usr/include/openssl -DOPENSSL_CRYPTO_LIBRARY=/usr/lib/libcrypto.so -DOPENSSL_SSL_LIBRARY=/usr/lib/libssl.so ``` + [cuSpatial - GPU-Accelerated Spatial and Trajectory Data Management and Analytics Library](https://github.com/rapidsai/cuspatial) + CUDA rehab & NVidia docs + [Documentation of NVIDIA chip/hardware interfaces](https://github.com/NVIDIA/open-gpu-doc) + [CS344 : CUDA Programming in C](https://classroom.udacity.com/courses/cs344) + [UD281 : High Performance Computing](https://classroom.udacity.com/courses/ud281) + [Parallel Computer Architecture and Programming (CMU 15-418/618)](http://15418.courses.cs.cmu.edu/spring2016/) + [Parallel Computer Architecture and Programming (CMU 15-418/618)](https://github.com/cmu15418) + [CMU 15418 Assignment 1: Analyzing Program Performance on a Multi-Core CPU](https://github.com/cmu15418/assignment1) + [Assignment 1: Analyzing Program Performance on a Multi-Core CPU](http://15418.courses.cs.cmu.edu/spring2016/article/3) + [Assignment 2: A Simple CUDA Renderer](http://15418.courses.cs.cmu.edu/spring2017/article/4) + [Course on CUDA Programming on NVIDIA GPUs, July 22-26, 2019](https://people.maths.ox.ac.uk/gilesm/cuda/) + [Lecture 3: control flow and synchronisation: Warp divergence](https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec3-2x2.pdf) + [Is branch divergence really so bad?](https://stackoverflow.com/questions/17223640/is-branch-divergence-really-so-bad) + [Lecture 5: libraries and tools](https://people.maths.ox.ac.uk/gilesm/cuda/lecs/lec5.pdf) + [Maximizing Unified Memory Performance in CUDA](https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/) + [CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES Stephen Jones, GTC 2017](http://on-demand.gputechconf.com/gtc/2017/presentation/s7122-stephen-jones-cuda-optimization-tips-tricks-and-techniques.pdf) + [HIGH THROUGHPUT WITH GPUS](https://indico.cern.ch/event/764011/contributions/3214768/attachments/1755004/2845106/RAPID_workshop_20181119.pdf) + [Small tips of optimizing CUDA programs](https://nanxiao.me/en/small-tips-of-optimizing-cuda-programs/) + [Error using __ldg in cuda kernel at compile time](https://stackoverflow.com/questions/24069524/error-using-ldg-in-cuda-kernel-at-compile-time) tl;dr ```sh nvcc -arch=sm_35 ... ``` + [Open-Arch-Group](https://github.com/Open-Arch-Group) + [Matrix Multiplication (MMul) Benchmarks](https://github.com/Open-Arch-Group/mmul) + [Performance engineer that's always happy to answer questions!](https://github.com/CoffeeBeforeArch) + [GPGPU Programming with CUDA](https://github.com/CoffeeBeforeArch/cuda_programming) + [From Scratch: Histograms in CUDA using Atomics](https://www.youtube.com/watch?v=DaEmuL0PYxc) + [Parallel Programming in Modern C++](https://github.com/CoffeeBeforeArch/parallel_programming) + [This program shows off the basics of stop tokens in C++20](https://github.com/CoffeeBeforeArch/parallel_programming/blob/master/basics/jthread/stop_token.cpp) + [Matrix multiplication in cuSparse (cusparseDcsrgemm) outputs wrong results](https://stackoverflow.com/questions/57385060/matrix-multiplication-in-cusparse-cusparsedcsrgemm-outputs-wrong-results) + [C++ (Cpp) cusparseDcsrgemm примеры использования](https://cpp.hotexamples.com/ru/examples/-/-/cusparseDcsrgemm/cpp-cusparsedcsrgemm-function-examples.html) + [Problem of two large sparse matrices multiplication in cuParse](https://forums.developer.nvidia.com/t/problem-of-two-large-sparse-matrices-multiplication-in-cuparse/33316/4) + [spgemm_example.c](https://github.com/NVIDIA/CUDALibrarySamples/blob/master/cuSPARSE/spgemm/spgemm_example.c) + [CusparseManager.cu](https://github.com/sintefmath/equelle/blob/master/backends/cuda/src/CusparseManager.cu) + [how to cast thrust::device_vector<int> to raw pointer](https://stackoverflow.com/questions/11113485/how-to-cast-thrustdevice-vectorint-to-raw-pointer) + [Параллельные вычисления с использованием стандартов MPI, OpenMP, OpenACC](https://www.youtube.com/playlist?list=PL-_cKNuVAYAWPC1WfK7_6v-gFOm4i7RKy) + Memory Model + [C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?](https://stackoverflow.com/questions/6319146/c11-introduced-a-standardized-memory-model-what-does-it-mean-and-how-is-it-g?rq=1) + [A Primer on Memory Consistency and Cache Coherence](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.225.9278&rep=rep1&type=pdf) + [LPC2018 - Open Source GPU compute stack - Not dancing the CUDA dance](https://www.youtube.com/watch?v=d94N2Lu4x9s) + OpenCL + [OpenCL 3.0 Specification Released With New Khronos Open-Source OpenCL SDK](https://www.phoronix.com/scan.php?page=news_item&px=OpenCL-3.0-Released-SDK) + [The State of OpenCL for Scientific Computing in 2018](https://mathema.tician.de/the-state-of-opencl-for-scientific-computing-in-2018/) + [OpenCL: History & Future](http://www.fz-juelich.de/SharedDocs/Downloads/IAS/JSC/EN/slides/opencl/opencl-10-history-future.pdf?__blob=publicationFile) + [Tuned OpenCL BLAS](https://github.com/CNugteren/CLBlast) + [CLBlast:ATunedBLASLibrary forFasterDeepLearning](https://cnugteren.github.io/downloads/CLBlast_GTC.pdf) + [OpenCL vloadn](https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/vloadn.html) + [Could not find a package configuration file provided by "OpenCLHeaders"](https://github.com/KhronosGroup/OpenCL-CLHPP/issues/173) + [Using OpenCL on Adreno & Mali GPUs is slower than CPU](https://github.com/ggerganov/llama.cpp/issues/5965) + [Zero copy buffer allocation on arm mali midgard gpus?](https://stackoverflow.com/questions/58481560/zero-copy-buffer-allocation-on-arm-mali-midgard-gpus) + SYCL - C++ Single-source Heterogeneous Programming for OpenCL + [Khronos SYCL](https://www.khronos.org/sycl/) + [An open-source implementation of OpenCL SYCL from Khronos Group](https://github.com/triSYCL/triSYCL) + [codeplaysoftware](https://github.com/codeplaysoftware) + [SYCL BLAS](https://github.com/codeplaysoftware/sycl-blas) + [SYCL DNN](https://github.com/codeplaysoftware/SYCL-DNN) + [SYCL VisionCpp](https://github.com/codeplaysoftware/visioncpp) + [Implementation of the SYCL specification.](https://github.com/ProGTX/sycl-gtx) + [Building a brain with SYCL and modern C++](https://www.semanticscholar.org/paper/Building-a-brain-with-SYCL-and-modern-C%2B%2B-Smithe-Potter/01cd48cda17008640076323b8ea10ac59a8b6509) + OneAPI + [Run simple DPC++ application](https://github.com/intel/llvm/blob/sycl/sycl/doc/GetStartedGuide.md#run-simple-dpc-application) + [oneAPI Direct Programming](https://github.com/zjin-lcf/oneAPI-DirectProgramming) + [Port a CUDA App to oneAPI and DPC++ in 5 Minutes](https://www.codeproject.com/Articles/5284841/Port-a-CUDA-App-to-oneAPI-and-DPCplusplus-in-5-Min) + [How to run dpc++ code on Intel HD Graphic atop Nvidia GPU](https://community.intel.com/t5/Intel-oneAPI-Data-Parallel-C/How-to-run-dpc-code-on-Intel-HD-Graphic-atop-Nvidia-GPU/m-p/1182497#M374) + Kompute + [The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)](https://kompute.cc/) + [Kompute github repo](https://github.com/KomputeProject/kompute) + HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform + [Why did AMD open source ROCm’s OpenCL driver-stack?](https://streamhpc.com/blog/2017-05-21/amd-open-sourced-rocms-opencl-driver-stack/) + [wiki for HCC](https://github.com/RadeonOpenCompute/hcc/wiki) + [github HCC repository](https://github.com/RadeonOpenCompute/hcc) + [Portable Computing Language](http://portablecl.org/) + [A collection of Arch Linux PKGBUILDS for the ROCm platform](https://github.com/rocm-arch/rocm-arch) tl;dr ```sh yay -S rocm-opencl-runtime ``` + [aur package rocm-opencl-runtime](https://aur.archlinux.org/packages/rocm-opencl-runtime/) + [Arch GPGPU](https://wiki.archlinux.org/index.php/GPGPU) + [Arch ROCm](https://wiki.archlinux.org/index.php/GPGPU#ROCm) + [ROCm for Arch Linux](https://github.com/rocm-arch/rocm-arch) + [rocm OpenCL Programming Guide](https://rocmdocs.amd.com/en/latest/Programming_Guides/Opencl-programming-guide.html#amd-rocm-implementation) + [clinfo ERROR: clBuildProgram(-11)](https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/issues/110) + [rock-dkms kernel vs mainline clarification](https://github.com/RadeonOpenCompute/ROCm/issues/816) + [Error during installation of rock-dkms 4.0 on 5.4 kernel](https://github.com/RadeonOpenCompute/ROCm/issues/1367) + [dkms build on unsported kernel and supported which makes errors](https://github.com/RadeonOpenCompute/ROCm/issues/1311) + [ROCm support in upstream Linux kernels](https://github.com/RadeonOpenCompute/ROCm#rocm-support-in-upstream-linux-kernels) + [Information for rock-dkms](https://repology.org/project/rock-dkms/information) + [Radeon ROCm 4.1 Released - Still Without RDNA GPU Support](https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/1246716-radeon-rocm-4-1-released-still-without-rdna-gpu-support/page5) + [ROCm 4.1 - Vega 20 (Radeon VII) with upstream amdgpu](https://githubmemory.com/@FilipVaverka) + [AMD dkms fails](https://bbs.archlinux.org/viewtopic.php?id=258940) ```sh dkms install --no-depmod -m amdgpu-4.0 -v 23 -k 5.11.16-arch1-1 Error! Bad return status for module build on kernel: 5.11.16-arch1-1 (x86_64) Consult /var/lib/dkms/amdgpu-4.0/23/build/make.log for more information. ==> Warning, `dkms install --no-depmod -m amdgpu-4.0 -v 23 -k 5.11.16-arch1-1' returned 10 pacman -Qo /usr/src/amdgpu-4.0-23 /usr/src/amdgpu-4.0-23/ принадлежит rock-dkms-bin 4.0-3 /usr/src/amdgpu-4.0-23/ принадлежит rock-dkms-firmware-bin 4.0-3 ``` + [Radeon Instinct like : Radeon VII](https://www.ixbt.com/3dv/amd-radeon-vii-review.html) + [RTX 2080 vs. Radeon VII vs. 5700 XT: Rendering and Compute Performance](https://www.extremetech.com/computing/297167-rtx-2080-vs-radeon-vii-vs-5700-xt-rendering-and-compute-performance) + [AMD Radeon VII Review: This Isn’t the 7nm GPU You’re Looking For](https://www.extremetech.com/computing/285286-amd-radeon-vii-review-this-isnt-the-7nm-gpu-youre-looking-for) + [Is a used Radeon VII worth it in 2020?](https://www.quora.com/Is-a-used-Radeon-VII-worth-it-in-2020) + [AMD Radeon Instinct MI50 1725MHz PCI-E 4.0 16384MB 1000MHz 4096 bit](https://market.yandex.ru/product--videokarta-amd-radeon-instinct-mi50-1725mhz-pci-e-4-0-16384mb-1000mhz-4096-bit/674247125?text=AMD%20Radeon%20VII) + [hipSYCL - a SYCL implementation for CPUs and GPUs](https://github.com/illuhad/hipSYCL) + [hipSYCL performance](https://githubmemory.com/repo/FilipVaverka/hipSYCL#performance) + OpenCL => Vulkan + [a prototype implementation of OpenCL 1.2 on top of Vulkan using clspv as the compiler](https://github.com/kpet/clvk) + [**clspv** is a prototype compiler for a subset of OpenCL C to Vulkan compute shaders](https://github.com/google/clspv) + [How To Set The CPU Affinity Of A Running Process In Linux](https://www.youtube.com/watch?v=9VJRsBmmY-4&feature=youtu.be) + OpenMP + [Ждали, ждали и дождались! OpenMP 4.0](http://habrahabr.ru/company/intel/blog/204668/) + [Parallelization of a prefix sum (Openmp)](https://stackoverflow.com/questions/35821844/parallelization-of-a-prefix-sum-openmp) + [Parallel Prefix Sum (Scan) with CUDA](http://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf) + [Parallel prefixsum algorithm in fastflow](https://github.com/pinkgopher/prefixsum) + [GPU prefix scan](https://github.com/mark-poscablo/gpu-prefix-sum/blob/master/scan_standalone/scan.cu) + OpenACC + [IPMACC is a framework for translating/executing OpenACC for C API to/over CUDA or OpenCL runtime](https://github.com/lashgar/ipmacc) + [IPMACC – An Open Source OpenACC to CUDA/OpenCL Translator](http://www.techenablement.com/ipmacc-open-source-openacc-cudaopencl-translator/) + MATOG - GPU Access Auto Tuning + [MATOG Auto-Tuning on GPUs is a tool to automatically optimize performance of NVIDIA CUDA code](https://www.gcc.tu-darmstadt.de/home/proj/matog/) + [MATOG preprint](https://tuprints.ulb.tu-darmstadt.de/6507/) + [MATOG: CUDA Array Access Auto-Tuner](https://github.com/mergian/matog) + [OCCA (Open Concurrent Compute Abstraction)](http://libocca.org/) + [github repository for OCCA](https://github.com/libocca/occa) + [LCSE - Linked Cluster Series Expansions - a framework for high-temperature series expansions](http://comp-phys.org/lcse/) + [VLI is a llibrary for high but fixed (128 to 512-bit) arithmetic and symbolic polinomials computations](http://comp-phys.org/vli/) + [Series Expansion Methods for Quantum Lattice Models](https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/123831/eth-50186-02.pdf) + Apache Arrow + [Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.](https://github.com/apache/arrow) + Sandia + [Trilinos is a collection of open-source software libraries, called packages, intended to be used as building blocks for the development of scientific applications.](https://en.wikipedia.org/wiki/Trilinos) + [github repo fo Trilinos](https://github.com/trilinos/Trilinos) tl;dr ``` $ yay -s trilinos 3 aur/trilinos 12.14.1-2 (+0 0.00%) algorithms for the solution of large-scale scientific problems 2 aur/mingw-w64-trilinos 12.12.1-1 (+0 0.00%) Framework for the solution of large-scale, complex multi-physics engineering and scientific problems (mingw-w64) 1 aur/trilinos-git 12.12.0.gd3b096f4f1-1 (+1 0.00%) (Out-of-date 2019-06-21) An effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. ``` + [Add option to turn off the install of gtest header and lib even if Gtest package is enabled](https://github.com/trilinos/Trilinos/issues/5341) + ARM + [The ARM Computer Vision and Machine Learning library](https://github.com/ARM-software/ComputeLibrary) + [HPCG for Arm](https://github.com/ARM-software/HPCG_for_Arm) + [Parallelizing HPCG's main kernels](https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/parallelizing-hpcg) + ARM Neon + [Coding for ARM NEON: How to start?](https://stackoverflow.com/questions/28547697/coding-for-arm-neon-how-to-start) + [SIMD Assembly Tutorial:ARM NEON](https://people.xiph.org/~tterribe/daala/neon_tutorial.pdf) + [ARM NEON скининг](https://habr.com/en/post/153015/) + CPU, GPU & DRAM Architecture Simulators + [GPGPU-Sim](http://www.gpgpu-sim.org/) + [Integrated gem5 + GPGPU-Sim Simulator](http://cpu-gpu-sim.ece.wisc.edu/) + [Getting gem5](http://www.m5sim.org/Download) + [SimpleScalar LLC](http://www.simplescalar.com/) + [SimpleScalar LLC Intro](http://www.ecs.umass.edu/ece/koren/architecture/Simplescalar/SimpleScalar_introduction.htm) + [Todd Austin : the author](http://web.eecs.umich.edu/~taustin/) + [DRAMSim2](http://www.eng.umd.edu/~blj/dramsim/) + [github repos for DRAMSim2 etc. from University of Maryland](https://github.com/dramninjasUMD) + [Write-back vs Write-Through](https://stackoverflow.com/questions/27087912/write-back-vs-write-through) + [Study of Different Cache Line Replacement Algorithms in Embedded Systems](https://people.kth.se/~ingo/MasterThesis/ThesisDamienGille2007.pdf) + [Chisel: Constructing Hardware in a Scala Embedded Language](https://chisel.eecs.berkeley.edu/) + [UC Berkeley Architecture Research](https://github.com/ucb-bar) + [The RISC-V Instruction Set Architecture](http://riscv.org) + [Rocket Chip Generator](http://riscv.org/download.html#tab_rocket) + [Rocket Microarchitectural Implementation of RISC-V ISA](https://github.com/ucb-bar/rocket) + [Rocket uncore: L2 cache, etc.](https://github.com/ucb-bar/uncore) # CUDA and friends related surveys, papers + [A Survey of CPU-GPU Heterogeneous Computing Techniques](https://www.academia.edu/12355899/A_Survey_of_CPU-GPU_Heterogeneous_Computing_Techniques) + [Гибридная реализация алгоритма MST с использованием CPU и GPU](http://habrahabr.ru/post/253031/) + [Понимание конфликтов банков разделяемой (shared) памяти в NVIDIA CUDA](http://habrahabr.ru/post/100363/) + [Vulkan: The next Khronos graphics API… that is not OpenGL](http://anki3d.org/vulkan-the-next-khronos-graphics-api-that-is-not-opengl/) + [AMD supported project: HIP : Convert CUDA to Portable C++ Code](https://github.com/ROCm-Developer-Tools/HIP) + [Examples for HIP](https://github.com/ROCm-Developer-Tools/HIP-Examples) # DSLs targeting GPU + [CARP: Correct and Efficient Accelerator Programming](http://carp.doc.ic.ac.uk/external/news.php) + [CARP dessimination](http://carp.doc.ic.ac.uk/external/dissemination.php) + [A taste of CARP: benchmark analysis, language design and kernel verification](http://www.cs.bris.ac.uk/Research/Micro/UKMAC2012/UKMAC12_Kravets_ARM.pdf) + PENCIL: a C99-based intermediate language for compute & optimization + [PENCIL summary in one slide: poster](http://carp.doc.ic.ac.uk/external/publications/posters/HiPEAC2013.pdf) + [PENCIL: A Platform-Neutral Language for Accelerator Programming](http://www.many-core.group.cam.ac.uk/ukmac2014/UKMAC2014_04_Grevendonk.pdf) + [PENCIL support in pet and PPCG](http://www.researchgate.net/profile/Sven_Verdoolaege/publication/273911354_PENCIL_support_in_pet_and_PPCG/links/551031d20cf27d62b913cc0b.pdf) + see also PPCG (below) + [Framework for performance-portable parallel computations on unstructured meshes](https://github.com/OP2/PyOP2) + [OP2: Developing an open-source framework for the execution of unstructured grid applications](http://www.oerc.ox.ac.uk/projects/op2) + [Optimising Unstructured Mesh Computational Fluid Dynamics Applications on Multicores via Machine Learning and Code Transformation](http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/r.rusitoru.pdf) + [Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs](https://www.oerc.ox.ac.uk/sites/default/files/uploads/profile-pages/Gihan/op2-lcpc.pdf) + [Copperhead Data Parallel Python](https://copperhead.github.io/) + [github CU copperhead](https://github.com/copperhead) + [Delite](https://github.com/stanford-ppl/Delite) + [Scalan](https://github.com/scalan) + [Scalan Community Edition](https://github.com/scalan/scalan-ce) + [Generating Performance Portable Code using Rewrite Rules: From High-level Functional Expressions to High-Performance OpenCL Code](http://homepages.inf.ed.ac.uk/slindley/papers/array-gpu-draft-february2015.pdf) + [Performance Comparison of GPU, DSP and FPGA implementations of image processing and computer vision algorithms in embedded systems, Fykse, Egil](http://brage.bibsys.no/xmlui/handle/11250/256108) + ROSE compiler + Mint for C-to-CUDA code generation + [ROSE compiler github](https://github.com/rose-compiler) + MINT + [ROSE project MINT](https://github.com/rose-compiler/rose/tree/master/projects/mint) + [MINT google project](https://sites.google.com/site/mintmodel/) + [Mint: Realizing CUDA performance in 3D Stencil Methods with Annotated C: claims 78% of handwritten CUDA performance](http://cseweb.ucsd.edu/groups/hpcl/scg/papers/2011/mint-unat-ics11.pdf) + [MINT PhD thesis](http://cseweb.ucsd.edu/groups/hpcl/scg/papers/2012/DidemUnat_thesis.pdf) + Nested Data Parallelism, Haskell, and friends + [Nested Data Parallelism on GPU](http://people.cs.uchicago.edu/~jhr/papers/2012/icfp-gpu.pdf) + [Compiling a high-level language for GPUs: (via language support for architectures and compilers)](http://hgpu.org/?p=7809) + [NOVA: A Functional Language for Data Parallelism](https://research.nvidia.com/sites/default/files/publications/nvr-2013-002_0.pdf) + [CuNesl: Compiling Nested Data-Parallel Languages for ... ](http://moss.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/icpp12.pdf) + [A Haskell EDSL for Nested Data-parallel Design-space ... ](http://www.cse.chalmers.se/edu/course/pfp/exploration-draft-Obsidian.pdf) + [Functional programming for nested data parallelism on GPUs](https://wiki.aalto.fi/download/attachments/70779066/T-106.5840_2012_Halme.pdf?version=1&modificationDate=1357205607000) + [Platform-Specific Optimization and Mapping of Stencil Codes through Refinement](https://graphics.cg.uni-saarland.de/2014/platform-specific-optimization-and-mapping-of-stencil-codes-through-refinement/) + [High-Performance Domain-Specific Languages for GPU Computing](https://anydsl.github.io/images/anydsl.pdf) + [Monoids and their efficiency in practice](http://myhaskelljournal.com/monoids-and-their-efficiency-in-practice/) + CUDA kernels generation using C++ expression templates technique + CU++ -- an interesting approach + [CU++, An Object Oriented Tool for CFD Applications: GTC 2012](http://on-demand.gputechconf.com/gtc/2012/presentations/S0264-CU++-An-Object-Oriented-Framework-for-CFD-CFD-Apps.pdf) + [CU++(ET) / UGC- CUDA With C++ Expression Templates with the Unified GPU-CPU Compiler](http://w3.uwyo.edu/~dchandar/CU++.html) + [A Hybrid Multi-GPU/CPU Computational Framework](http://scientific-sims.com/cfdlab/Dimitri_Mavriplis/HOME/NEW_PAPERS/Chandar.2013-2855.pdf) + VexCL is a C++ vector expression template library for OpenCL/CUDA + [VexCL is a C++ vector expression template library for OpenCL/CUDA](https://github.com/ddemidov/vexcl) + [Generating OpenCL/CUDA source code from C++ expressions in VexCL](https://isocpp.org/blog/2015/01/generating-opencl-cuda-source-code-from-c-expressions-in-vexcl) + AnyDSL - A Framework for Rapid Development of Domain-Specific Libraries; thorin (The Higher-ORder INtermediate representation) / impala (An imperative and functional programming language) + [A Framework for Rapid Development of Domain-Specific Libraries](http://anydsl.github.io/) + [AnyDSL Build Instructions](https://github.com/AnyDSL/anydsl/wiki/Build-Instructions) + [Shallow Embedding of DSLs via Online Partial Evaluation.(Best Paper Award)](http://compilers.cs.uni-saarland.de/papers/gpce15.pdf) + [thorin - The Higher-ORder INtermediate representation](https://github.com/AnyDSL/thorin) + [impala - An imperative and functional programming language](https://github.com/AnyDSL/impala) + [A DSL for Stencil Codes](https://github.com/AnyDSL/stincilla) + [AnyDSL ports from http://benchmarksgame.alioth.debian.org](https://github.com/AnyDSL/benchmarks-impala) # parallelforall + [An Efficient Matrix Transpose in CUDA C/C++](http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/) + [BIDMach: Machine Learning at the Limit with GPUs](http://devblogs.nvidia.com/parallelforall/bidmach-machine-learning-limit-gpus/) + [High-Performance Geometric Multi-Grid with GPU Acceleration](https://devblogs.nvidia.com/parallelforall/high-performance-geometric-multi-grid-gpu-acceleration/) + [Inside Pascal: NVIDIA’s Newest Computing Platform](https://devblogs.nvidia.com/parallelforall/inside-pascal/) + [GPU Programming in Functional Languages](http://www.cse.chalmers.se/~joels/writing/GPUFL.pdf) + [HIP : Convert CUDA to Portable C++ Code](https://github.com/GPUOpen-ProfessionalCompute-Tools/HIP) # Pencil computations + [Ускоряем трафаретные вычисления: сборка и запуск YASK на процессорах Intel](https://habrahabr.ru/company/intel/blog/305128/) + [flexible package manager that supports multiple versions, configurations, platforms, and compilers. https://spack.io](https://github.com/LLNL/spack) + [Tutorial: Spack 101](https://spack.readthedocs.io/en/latest/tutorial_sc16.html) + [NASA: High Performance Fast Computing Challenge](https://hn.svelte.technology/item/14265751) + [Why Rust fails hard at scientific computing](https://www.reddit.com/r/rust/comments/76olo3/why_rust_fails_hard_at_scientific_computing/) + [Why Rust fails hard at scientific computing](https://internals.rust-lang.org/t/why-rust-fails-hard-at-scientific-computing/6065) + [technicalities: interactive scientific computing #2 of 2, goldilocks languages](https://graydon2.dreamwidth.org/189377.html) # Nim links + [Laser - Primitives for high performance computing](https://github.com/numforge/laser) + [NimTorch](https://github.com/fragcolor-xyz/nimtorch) + [A matrix library https://unicredit.github.io/neo/](https://github.com/unicredit/neo) + [A fast, ergonomic and portable tensor library with a deep learning focus](https://github.com/mratsim/Arraymancer) + [high performance tensor library in Nim](https://andre-ratsimbazafy.com/high-performance-tensor-library-in-nim/#how-controlling-overhead) + [Arraymancer - A n-dimensional tensor (ndarray) library](https://mratsim.github.io/Arraymancer/) + [A curated list of awesome Nim frameworks, libraries and software](https://github.com/VPashkov/awesome-nim) + [Find the nim package](http://nimism.co/) + [Meta Nim Are we scientists yet?](https://github.com/nim-lang/needed-libraries/issues/77) + [Quantum EXpressions lattice field theory framework](https://github.com/jcosborn/qex) + [QEX: a framework for lattice field theories](https://arxiv.org/abs/1612.02750) + tl;dr ```sh nimble refresh nimble install neo nimble install Arraymancer ``` + [Why is nim and nimble in official repo so outdated?](https://amp.reddit.com/r/archlinux/comments/cdv3xu/why_is_nim_and_nimble_in_official_repo_so_outdated/) + [parallel-computing resources list](https://github.zhrichard.me/topics/parallel-computing) + [Portable Hardware Locality (hwloc)](https://www.open-mpi.org/projects/hwloc/) + [Overview of the Efficient Programming Languages (v.3) 2018.4](https://sdevprog.blogspot.com/2018/04/overview-of-efficient-programming.html?m=1) + Intel Level Zero + [oneAPI Level Zero](https://github.com/oneapi-src/level-zero) + [Code Generation for High Performance PDE Solvers on Modern Architectures](https://archiv.ub.uni-heidelberg.de/volltextserver/27360/) + [PhD Thesis Software Stack](https://github.com/dokempf/dkempf-phd-software) + [Loopy: Transformation-Based Generation of High-Performance CPU/GPU Code](https://github.com/inducer/loopy) + [HyperHDG - a C++ based library implementing hybrid discontinuous Galerkin methods on extremely general domains ](https://github.com/HyperHDG/HyperHDG) + GPU roof model + [Elias Konstantinidis publications](http://users.uoa.gr/~ekondis/publications/) + [A quantitative roofline model for GPU kernel performance estimation using micro-benchmarks and hardware metric profiling](https://www.sciencedirect.com/science/article/pii/S0743731517301247) + [mixbench - The purpose of this benchmark tool is to evaluate performance bounds of GPUs on mixed operational intensity kernels](https://github.com/ekondis/mixbench) + [Analysis-Driven Optimization: Preparing for Analysis with NVIDIA Nsight Compute, Part 1](https://developer.nvidia.com/blog/analysis-driven-optimization-preparing-for-analysis-with-nvidia-nsight-compute-part-1/) + [GPU Performance Analysis](https://vimeo.com/454873041) + [Roofline and NVIDIA Ampere GPU Architecture Analysis](https://www.youtube.com/watch?v=VtkxhygfNsY) + [Nsight Compute Feature Spotlight: Roofline Analysis, Asynchronous Copy, Sparse Data Compression](https://www.youtube.com/watch?v=DnwZ6ZTLw50) + [Optimizing CUDA Memory Allocations Using NVIDIA Nsight Systems](https://www.youtube.com/watch?v=kTKk05yzuzo&list=UUBHcMCGaiJhv-ESTcWGJPcw) + [Roofline Hackathon 2020 part 1 and 2](https://www.youtube.com/watch?v=Hy48J0Ivz18) + YouTube videos on GPU embedded profiling/optimization + [Presentation: Mali Graphics Debugger (GDC 2014)](https://www.youtube.com/watch?v=yv-V9Bl9pO4) + [GPU Compute Optimisation with Hardware Counters](https://www.youtube.com/watch?v=93cWfkyid7k) + [ARM Mali GPU Architecture Overview](https://www.youtube.com/watch?v=mo5zVbCg12I) + [AMD Radeon and NVIDIA GeForce FP32/FP64 GFLOPS Table](https://www.geeks3d.com/20140305/amd-radeon-and-nvidia-geforce-fp32-fp64-gflops-table-computing/) + [RICOS Co. Ltd. Research Institute for Computational Science Co.Ltd.](https://github.com/ricosjp) + [Load-link/store-conditional](https://en.wikipedia.org/wiki/Load-link/store-conditional)

Related Documents

Testing

Multi-class: exactly one of the sentiment labels applies

Ruby 2.7