#9 new
Jed Brown

Unrolled and/or SSE tuned versions of TensorMult_*

Reported by Jed Brown | November 28th, 2008 @ 01:40 AM

Without unrolling over the last dimension, GCC seems limited at about 1150 MFLOPS, but this operation should be capable of much better. The first step is naive unrolling over the last dimension. That has a chance of putting us above 2 GFLOPS by skipping lots of conditionals and changing lots of mulsd to mulapd and such. If it doesn't then some SSE intrinsics should easily do the trick, but that may call for some custom tuning for each number of dofs and last dimension. Not a priority.

No comments found

Please Sign in or create a free account to add a new ticket.

With your very own profile, you can contribute to projects, track your activity, watch tickets, receive and update tickets through your email and much more.

New-ticket Create new ticket

Create your profile

Help contribute to this project by taking a few moments to create your personal profile. Create your profile »

An implementation of the ``dual order hp'' version of the finite element method. This project targets parallel domain-decomposition methods for strongly coupled nonlinear problems with PDE constraints.

People watching this ticket

Tags

Pages