According to Intel documentation “Loads May Be Reordered with Earlier Stores to Different Locations”
Below is pseudo code
Initial values
std::atomic<int> x = 0;
std::atomic<int> y = 0;
So when using relaxed memory model
//Thread running on Processor 0
x.store(1, std::memory_order_relaxed);
register int ry = y.load(std::memory_order_relaxed);
//Thread running Processor 1
y.store(1, std::memory_order_relaxed);
register int rx = x.load(std::memory_order_relaxed);
//rx == 0 and ry == 0 IS ALLOWED
(copied from intel's doc_ -> At each processor, the load and the store are to different locations and hence may be reordered. Any interleaving
of the operations is thus allowed. One such interleaving has the two loads occurring before the two stores. This
would result in each load returning value 0.
Now when using sequential memory model
// Thread running on Processor 0
x.store(1, std::memory_order_relaxed);
register int ry = y.load(std::memory_order_seq_cst);
// Thread running on Processor 1
y.store(1, std::memory_order_relaxed);
register int rx = x.load(std::memory_order_seq_cst);
//rx == 0 and ry == 0 SHOULD NOT BE ALLOWED BECAUSE load uses memory_order_seq_cst
If my understanding is correct (please correct me if I am wrong), then just::thread load implementation should not use simple mov, but have an appropriate memory barrier to not allow stores to sink below loads.