Add atomic load/store, fetch_add, fence, and is-lock-free lowering.

Loads/stores w/ type i8, i16, and i32 are converted to
plain load/store instructions and lowered w/ the plain
lowerLoad/lowerStore.  Atomic stores are followed by an mfence
for sequential consistency.

For 64-bit types, use movq to do 64-bit memory
loads/stores (vs the usual load/store being broken into
separate 32-bit load/stores). This means bitcasting the
i64 -> f64, first (which splits the load of the value to be
stored into two 32-bit ops) then stores in a single op. For
load, load into f64 then bitcast back to i64 (which splits
after the atomic load). This follows what GCC does for
c++11 std::atomic<uint64_t> load/store methods (uses movq
when -mfpmath=sse). This introduces some redundancy between
movq and movsd, but the convention seems to be to use movq
when working with integer quantities. Otherwise, movsd
could work too. The difference seems to be in whether or
not the XMM register's upper 64-bits are filled with 0 or
not. Zero-extending could help avoid partial register
stalls.

Handle up to i32 fetch_add. TODO: add i64 via a cmpxchg loop.

TODO: add some runnable crosstests to make sure that this
doesn't do funny things to integer bit patterns that happen
to look like signaling NaNs and quiet NaNs. However, the system
clang would not know how to handle "llvm.nacl.*" if we choose to
target that level directly via .ll files. Or, (a) we use old-school __sync
methods (sync_fetch_and_add w/ 0 to load) or (b) require buildbot's
clang/gcc to support c++11...

BUG= https://code.google.com/p/nativeclient/issues/detail?id=3882
R=stichnot@chromium.org

Review URL: https://codereview.chromium.org/342763004
10 files changed