Unrolled and/or SSE tuned versions of TensorMult_*
Reported by Jed Brown | November 28th, 2008 @ 01:40 AM
Without unrolling over the last dimension, GCC seems limited at
about 1150 MFLOPS, but this operation should be capable of much
better. The first step is naive unrolling over the last dimension.
That has a chance of putting us above 2 GFLOPS by skipping lots of
conditionals and changing lots of mulsd
to
mulapd
and such. If it doesn't then some SSE
intrinsics should easily do the trick, but that may call for some
custom tuning for each number of dofs and last dimension. Not a
priority.
No comments found
Please Sign in or create a free account to add a new ticket.
With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.
Create your profile
Help contribute to this project by taking a few moments to create your personal profile. Create your profile ยป
An implementation of the ``dual order hp'' version of the finite element method. This project targets parallel domain-decomposition methods for strongly coupled nonlinear problems with PDE constraints.