Having a distributed memory system simulate shared memory, i.e., cache coherent non uniform memory access, is a very convenient abstraction for programmers. However, what other features could be provided to make it possible for a dedicated programmer to optimize things, breaking the abstraction?
The OS moves blocks of memory closer to the processor using it. A program can declare in advance that it plans to frequently access (or write) a block of memory. Or the program could explicitly request a block of memory be moved closer. Conversely the program could prevent the OS from automatically moving a block of memory if it knows it will be a bad idea.
Alternatively, some way of migrating a process to the processor close to the memory that the process will accessed. Perhaps automatic, perhaps explicitly yes or explicitly disable automatic.
A program could issue a bunch of memory requests, wait for the first few to respond, then cancel the remaining requests. The remaining requests, if they arrive back, should not displace elements in the cache.