### Multicore Synchronization

a pragmatic introduction





# Speaker (@0xF390)

#### **Co-founder of Backtrace**

Building high performance debugging platform for native applications.

http://backtrace.io @0xCD03

#### Founder of Concurrency Kit

A concurrent memory model for C99 and arsenal of tools for high performance synchronization.

http://concurrencykit.org

Previously at AppNexus, Message Systems and GWU HPCL





# Multicore Synchronization

This is a talk on mechanical sympathy of parallel systems on modern multicore systems.

Understanding both your workload and your environment allows for effective optimization.





# Principles of Multicore





Cache coherency guarantees the eventual consistency of shared state.

| Core 0 |                   |   |        |                  |  |                   | Core 1 |   |        |                  |  |
|--------|-------------------|---|--------|------------------|--|-------------------|--------|---|--------|------------------|--|
|        |                   | 7 |        | <b>↓</b>         |  |                   |        |   |        | <b>A</b>         |  |
|        | Cache             |   |        |                  |  | Cache             |        |   |        |                  |  |
|        | -1                | S | А      | Data             |  |                   |        | S | А      | Data             |  |
|        | 0                 | I | 0x0000 | backtrace.io…    |  |                   | 0      | E | 0x1000 | ASDOKADO         |  |
|        | 1                 | I | 0x0040 | 393929191        |  |                   | 1      | Μ | 0x1040 | 213091i491119    |  |
|        | 2                 | S | 0x0080 | asodkadoakdoak\0 |  |                   | 2      | S | 0x0080 | asodkadoakdoak\0 |  |
|        | 3                 | Μ | 0x00c0 | 3.141592         |  |                   | 3      | Μ | 0x10c0 | 9940191          |  |
|        | Memory Controller |   |        |                  |  | Memory Controller |        |   |        |                  |  |
|        | Memory            |   |        |                  |  |                   | Memory |   |        |                  |  |





int x = 443; (&x = 0x20c4)

#### Thread 0 x = x + 10010;

#### Thread 1



Memory

#### printf("%d\n", x);







Load x

Thread 0 x = x + 10010;

| Core 0 |   |        |                       |          |  |  |  |
|--------|---|--------|-----------------------|----------|--|--|--|
|        | 7 |        |                       |          |  |  |  |
|        | S | A      | Cache<br>Data         |          |  |  |  |
| 0      | I | 0x0000 | backtrace.io…         |          |  |  |  |
| 1      | I | 0x0040 | 393929191             |          |  |  |  |
| 2      | S | 0x0080 | asodkadoakdoak\0      |          |  |  |  |
| 3      | E | 0x20c0 | 1921919119 <b>443</b> |          |  |  |  |
|        |   | Memor  | ry Controller         | <b>←</b> |  |  |  |
|        |   |        |                       |          |  |  |  |
| •      |   |        |                       |          |  |  |  |

Memory

#### Thread 1

printf("%d\n", x);







Update the value of x

#### **Thread 0** x = x + 10010;

| Core 0 |       |   |        |                         |  |  |
|--------|-------|---|--------|-------------------------|--|--|
|        |       | 7 |        |                         |  |  |
|        | Cache |   |        |                         |  |  |
|        |       | S | А      | Data                    |  |  |
|        | 0     | I | 0x0000 | backtrace.io…           |  |  |
|        | 1     | I | 0x0040 | 393929191               |  |  |
|        | 2     | S | 0x0080 | asodkadoakdoak\0        |  |  |
|        | 3     | М | 0x20c0 | 1921919119 <b>10453</b> |  |  |
|        |       |   | Memor  | ry Controller           |  |  |

Memory

#### Thread 1









Thread 1 loads x

#### **Thread 0** x = x + 10010;

| Core 0 |       |   |                                       |                         |  |  |  |  |  |
|--------|-------|---|---------------------------------------|-------------------------|--|--|--|--|--|
|        |       | 7 |                                       | 4                       |  |  |  |  |  |
|        | Cache |   |                                       |                         |  |  |  |  |  |
|        | I     | S | А                                     | Data                    |  |  |  |  |  |
|        | 0     | I | 0x0000                                | backtrace.io…           |  |  |  |  |  |
|        | 1     | I | 0x0040                                | 393929191               |  |  |  |  |  |
|        | 2     | S | 0x0080                                | asodkadoakdoak\0        |  |  |  |  |  |
|        | 3     | 0 | 0x20c0                                | 1921919119 <b>10453</b> |  |  |  |  |  |
|        |       |   | Memor                                 | ry Controller           |  |  |  |  |  |
|        |       |   | i i i i i i i i i i i i i i i i i i i | • Controller            |  |  |  |  |  |

Memory

#### Thread 1





Backtrace



MESI, MOESI and MESIF are common cache coherency protocols.

MESI: Modified, Exclusive, Shared, Invalid

MOESI: Modified, Owned, Exclusive, Shared, Invalid

MESIF: Modified, Exclusive, Shared, Forwarding





The **cache line** is the unit of coherency and can become an unnecessary source of contention.



```
Thread 0
for (;;) {
    array[0]++;
}
```

# Thread 1 for (;;) { array[1]++; }





**False sharing** occurs when logically disparate objects share the same cache line and contend on it.

```
struct {
    rwlock_t rwlock;
    int value;
} object;
```

#### Thread 0

```
for (;;) {
    read_lock(&object.rwlock);
    int v = atomic_read(&object.value);
    do_work(v);
    read_unlock(&object.rwlock);
}
```

#### Thread 1

```
for (;;) {
    read_lock(&object.rwlock);
    <short work>
    read_unlock(&object.rwlock);
}
```





**False sharing** occurs when logically disparate objects share the same cache line and contend on it.



Backtrace



Padding can be used to mitigate false sharing.

```
struct {
    rwlock_t rwlock;
    char pad[64 - sizeof(rwlock_t)];
    int value;
} object;
```





#### Padding can be used to mitigate false sharing.



Backtrace



Padding must consider access patterns and overall footprint of application.

Too much padding is bad.





### Simultaneous Multithreading

SMT technology allows for throughput increases by allowing programs to better utilize processor resources.



Figure from "The Architecture of the Nehalem Processor and Nehalem-EP SMP Platforms" (Michael E. Thomadakis)





### Atomic Operations

Atomic operations are typically implemented with the help of the cache coherency mechanism.

lock cmpxchg(target, compare, new):
 register = load\_and\_lock(target);
 if (register == compare)
 store(target, new);
 unlock(target);
 return register;

Cache line locking typically only serializes accesses to the target cache line.





### Atomic Operations

In the old commodity processor days, atomic operations were implemented with a bus lock.

lock cmpxchg(target, compare, new):
 lock(memory\_bus);
 register = load(target);
 if (register == compare)
 store(target, new);
 unlock(memory\_bus);
 return register;

x86 will assert a bus lock if an atomic operations goes across a cache line boundary. Be careful!





### Atomic Operations

Atomic operations are crucial to efficient synchronization primitives.

**COMPARE\_AND\_SWAP(a, b, c):** updates **a** to **c** if **a** is equal to **b**, atomically.

#### **LOAD\_LINKED(a)/STORE\_CONDITIONAL(a, b):** Updates **a** to **b** if **a** was not modified between the loadlinked (LL) and store-conditional (SC).





Most modern multicore systems are NUMA architectures: the throughput and latency of memory accesses varies.







The NUMA factor is a ratio that represents the relative cost of a remote memory access.



Intel Xeon L5640 machine at 2.27 GHz (12x2)





NUMA effects can be pervasive and difficult to mitigate.







Be wary of your operating system's memory placement mechanisms.

First Touch: Allocate page on memory of first processor to touch it.

Interleave: Allocate pages round-robin across nodes.

More sophisticated schemes exist that do hierarchical allocation, page migration, replication and more.





NUMA-oblivious synchronization objects are not only susceptible to performance mismatch but starvation and even livelock under extreme load.





Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4





Fair locks guarantee starvation-freedom.















while (ck\_pr\_load\_uint(&ticket->position) != request)
 ck\_pr\_stall();

return;

}







```
CK_CC_INLINE static void
ck_spinlock_ticket_unlock(struct ck_spinlock_ticket *ticket)
{
    unsigned int update;
    update = ck_pr_load_uint(&ticket->position);
    ck_pr_store_uint(&ticket->position, update + 1);
    return;
}
```









Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4



Fair locks are not a silver bullet and may negatively impact throughput.

Fairness comes at the cost of increased sensitivity to preemption and other sources of jitter.









Intel(R) Xeon(R) CPU E7- 4850 @ 2.00GHz @ 10x4



Array and queue-based locks provide lock scalability and fairness with distributing spinning and point-topoint wake-up.







The **MCS** lock was a seminal contribution to the area and introduced queue locks to the masses.























































Similar mechanisms exist for read-write locks.







Big reader locks (brlocks) or Read-Mostly Locks (rmlocks) distribute read-side flags so that readers spin only on local memory.







#### Similar mechanisms exist for read-write locks.



Intel Xeon E5-2630L at 2.40 GHz





#### Limitations of Locks

Locks are not composable and are susceptible to priority inversion, livelock, starvation, deadlock and more.

A delicate balance must be found between lock hierarchies, granularity and quality of service.

A significant delay in one thread holding a synchronization object leads to significant delays for all other threads waiting on the same synchronization object.





#### Limitations of Locks



Intel Xeon E5-2630L at 2.40 GHz

QCon

NEW YORK



With lock-based synchronization, it is sufficient to reason in terms of lock dependencies and critical sections.

This model doesn't work with lock-less synchronization where we must guarantee correctness in much more subtle ways.





These days, cache coherency helps implement the consistency model.

The memory model is specified by the runtime environment and defines the correct behavior of shared memory accesses.





int x = 0; int y = 0;

Core 0 Core 1
int r\_0; int r\_1;
x = 1; y = 1;
r\_0 = y; r\_1 = x;





int x = 0; int y = 0;

| Core 0              | Core 1              |
|---------------------|---------------------|
| <pre>int r_0;</pre> | <pre>int r_1;</pre> |
| x = 1;              | y = 1;              |
| r_0 = y;            | r_1 = x;            |
|                     |                     |

(1,1)





int x = 0; int y = 0; Core 0 int r\_0; x = 1; r\_0 = y; y = 1; r\_1 = x;







(1,0)











Modern processors rely on a myriad of techniques to achieve high levels of instruction-level parallelism.







The processor memory model is specified with respect to loads, stores and atomic operations.

|                | TSO             | RMO        |
|----------------|-----------------|------------|
| Load to Load   |                 |            |
| Load to Store  |                 |            |
| Store to Store |                 |            |
| Store to Load  |                 |            |
| Atomics        |                 |            |
| Examples       | x86*, SPARC-TSO | ARM, Power |





Ordering guarantees are provided by serializing instructions such as memory fences.







Serializing instructions are expensive because they disable some processor optimizations.

Atomic instructions are expensive because they either involve serialization (and locking) or are just plain old complex.

|              | Throughput (/ second) |
|--------------|-----------------------|
| lock cmpxchg | 147,304,564           |
| cmpxchg      | 458,940,006           |

Intel Core i7-3615QM at 2.30 GHz





Non-blocking synchronization provides very specific progress guarantees and high levels of resilience at the cost of complexity on the fast path.



Lock-freedom provides system-wide progress guarantees.

Wait-freedom provides peroperation progress guarantees.





```
struct node {
   void *value;
   struct node *next;
};
void
stack push(struct node **top, struct node *entry, void *value)
{
   entry->value = value;
   entry->next = *top;
   *top = entry;
   return;
}
struct node *
stack pop(struct node **top)
{
   struct node *r;
   r = *top;
   *top = r->next;
   return r;
}
```





```
struct node {
    void *value;
    struct node *next;
};
void
stack_push(struct node **top, struct node *entry,
    void *value)
{
    entry->value = value;
    do {
        entry->next = ck_pr_load_ptr(top);
      } while (ck_pr_cas_ptr(top, entry->next,
        entry) == false);
    return;
}
```





```
struct node {
   void *value;
   struct node *next;
void
stack push(struct node **top, struct node *entry,
     void *value)
                                                    struct node *
                                                    stack pop(struct node **top)
   entry->value = value;
                                                        struct node *r, *next;
   do {
       entry->next = ck pr load ptr(top);
   } while (ck pr cas ptr(top, entry->next,
                                                        do 4
                                                            r = ck pr load ptr(top);
         entry) == false);
                                                            if (r == NULL)
                                                               return NULL;
   return;
                                                            next = ck pr load ptr(&r->next);
                                                        } while (ck pr cas ptr(top, r, next) ==
                                                            false);
                                                        return r;
```



# Non-blocking synchronization is not a silver bullet.

| Operation     | Intel Core i7-3615QM | IBM Power 730 Express |
|---------------|----------------------|-----------------------|
| spinlock_push | 17                   | 29                    |
| lockfree_push | 25                   | 12                    |
| spinlock_pop  | 18                   | 29                    |
| lockfree_pop  | 27                   | 12                    |

The cost of complexity on the fast path will outweigh the benefits until sufficient levels of contention are reached.



Source: Nonblocking Algorithms and Scalable Multicore Programming, Samy Bahra





Relaxing correctness constraints and constraining runtime requirements allows for many of the benefits without as much additional complexity and impact on the fast path.





```
#define EMPLOYEE_MAX 8
```

```
struct employee {
    const char *name;
    unsigned long long number;
```

```
};
```

```
struct directory {
    struct employee *employee[EMPLOYEE_MAX];
    rwlock_t rwlock;
```

};







```
unsigned long long
employee_number_get(struct directory *d, const char *n)
    struct employee *em;
    unsigned long number;
    size_t i;
    rwlock_read_lock(&d->rwlock);
    for (i = 0; i < EMPLOYEE_MAX; i++) {
        em = d->employee[i];
        if (em == NULL)
            continue;
        if (strcmp(em->name, n) != 0)
            continue;
       number = em->number;
       rwlock_read_unlock(&d->rwlock);
       return number;
   rwlock_read_unlock(&d->rwlock);
   return 0;
```



# The rwlock\_t object provides correctness at cost of forward progress.





```
bool
employee_add(struct directory *d, const char *n,
    unsigned long long number)
    struct employee *em;
    size_t i;
    rwlock_write_lock(&d->rwlock);
    for (i = 0; i < EMPLOYEE_MAX; i++) {
        if (d->employee[i] != NULL)
            continue;
        em = xmalloc(sizeof *em);
        em - > name = n;
        em->number = number;
        d->employee[i] = em;
        rwlock_write_unlock(&d->rwlock);
        return true;
   rwlock_write_unlock(&d->rwlock);
   return false;
```



# The rwlock\_t object provides correctness at cost of forward progress.





```
void
employee_delete(struct directory *d, const char *n)
    struct employee *em;
    size_t i;
    rwlock_write_lock(&d->rwlock);
    for (i = 0; i < EMPLOYEE_MAX; i++) {
        if (d->employee[i] == NULL)
            continue;
        if (strcmp(d->employee[i]->name, n) != 0)
            continue;
       em = d->employee[i];
       d->employee[i] = NULL;
       rwlock_write_unlock(&d->rwlock);
       free(em);
       return;
    rwlock_write_unlock(&d->rwlock);
```

return;



# The rwlock\_t object provides correctness at cost of forward progress.





```
void
employee_delete(struct directory *d, const char *n)
    struct employee *em;
    size_t i;
    rwlock_write_lock(&d->rwlock);
    for (i = 0; i < EMPLOYEE_MAX; i++) {
        if (d->employee[i] == NULL)
            continue;
        if (strcmp(d->employee[i]->name, n) != 0)
            continue;
       em = d->employee[i];
       d->employee[i] = NULL;
       rwlock_write_unlock(&d->rwlock);
       free(em);
       return;
    rwlock_write_unlock(&d->rwlock);
```

return;



If reachability and liveness are coupled, you also protect against a **read-reclaim** race.

OCON NEW YORK



If reachability and liveness are coupled, you also protect against a **read-reclaim** race.

#### Time







Decoupling is sometimes necessary, but requires a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it.





Decoupling is sometimes necessary, but requires a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it.

```
static void
employee_delref(struct employee *em)
{
    bool z;
    ck_pr_dec_uint_zero(&em->ref, &z);
    if (z == true)
        free(em);
    return;
}
```





# Lock-less Synchronization

Decoupling is sometimes necessary, but requires a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it.





# Lock-less Synchronization

Decoupling is sometimes necessary, but requires a safe memory reclamation scheme to guarantee that an object cannot be physically destroyed if there are active references to it.







# Concurrent Data Structures

```
static bool
```

```
struct employee *em;
size_t i;
```

```
ck_rwlock_write_lock(&d->rwlock);
for (i = 0; i < EMPLOYEE_MAX; i++) {
    if (d->employee[i] != NULL)
        continue;
```

```
em = malloc(sizeof *em);
em->name = n;
em->number = number;
ck_pr_fence_store();
ck_pr_store_ptr(&d->employee[i], em);
ck_rwlock_write_unlock(&d->rwlock);
return true;
```

```
ck_rwlock_write_unlock(&d->rwlock);
```

```
return false;
```

```
return 0;
```

static unsigned long long





# Concurrent Data Structures

```
static void
employee_delete(struct directory *d, const char *n)
        struct employee *em;
        size_t i;
        ck_rwlock_write_lock(&d->rwlock);
        for (i = 0; i < EMPLOYEE_MAX; i++) {</pre>
                if (d->employee[i] == NULL)
                         continue;
                if (strcmp(d->employee[i]->name, n) != 0)
                         continue;
                em = d->employee[i];
                ck_pr_store_ptr(&d->employee[i], NULL);
                ck_rwlock_write_unlock(&d->rwlock);
                /* XXX: When is it safe to free em? */
                return;
        ck_rwlock_write_unlock(&d->rwlock);
        return;
```

```
employee_number_get(struct directory *d, const char *n)
        struct employee *em;
        unsigned long number;
        size t i;
        for (i = 0; i < EMPLOYEE_MAX; i++) {</pre>
                em = ck_pr_load_ptr(&d->employee[i]);
                if (em == NULL)
                         continue;
                ck_pr_fence_load_depends();
                if (strcmp(em->name, n) != 0)
                         continue;
                number = em->number;
                return number;
        return 0;
```





# EXPERIMENT

### Workload

- Uniform read-mostly workload
- Single writer attempts pessimistic add operation at fixed frequency
- Readers attempt to get the number of the first employee

### Environment

- 12 cores across 2 sockets
- Intel Xeon E5-2630L at 2.40 GHz
- Linux 2.6.32

```
Machine (64GB)

NUMANode L#0 (P#0 32GB)

Socket L#0 + L3 L#0 (15MB)

L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
```





### Read Latency No updates







# Read Latency No updates







### Read Latency Single writer







### Write Latency Single writer







# Safe Memory Reclamation

A **read-reclaim** race occurs if an object is destroyed while there are references or accesses to it.







# Safe Memory Reclamation

Techniques such as **hazard pointers**, **quiescent-state-based reclamation** and **epoch-based reclamation** protect against readreclaim races.

Schemes such as QSBR and EBR do so without affecting reader progress but without guaranteeing writer progress.

Schemes provide strong guarantees on forward progress but require heavy-weight instructions and retry logic for readers.



### BLOCKING SMR SCHEMES

Read-side critical sections

smr\_read\_lock();

<protected section>

smr\_read\_unlock();

smr\_read\_begin();

<protected section>

smr\_read\_end();

• Explicit Reclamation

smr\_synchronize();





### **QUIESCENT-STATE-BASED RECLAMATION**



Time





### QUIESCENT-STATE-BASED RECLAMATION

Readers

#### Writer



Backtrace



### **QUIESCENT-STATE-BASED RECLAMATION**

#### Writers

```
static void
qsbr_synchronize(void)
        int i;
        uint64 t goal;
        ck_pr_fence_memory();
        goal = ck_pr_faa_64(&global.value, 1) + 1;
        for (i = 0; i < n_reader; i++) {</pre>
                 uint64 t *c =
                   &threads.readers[i].counter.value;
                 while (ck_pr_load_64(c) < goal)</pre>
                          ck_pr_stall();
        }
        return;
```

#### Readers

```
static void
qsbr_quiesce(struct thread *th)
       uint64_t v;
       ck_pr_fence_memory();
       v = ck_pr_load_64(&global.value);
       ck_pr_store_64(&th->counter.value, v);
       ck_pr_fence_memory();
       return;
static void
qsbr_read_lock(struct thread *th)
       ck_pr_barrier(); /* Compiler barrier. */
       return;
static void
qsbr_read_unlock(struct thread *th)
       ck_pr_barrier(); /* Compiler barrier. */
       return;
```





## Conclusion

There are no silver bullets in multicore synchronization, but a deep understanding of both your workload and your underlying environment may allow you to extract phenomenal performance and reliability increases.





## The End

### @0xF390

http://concurrencykit.org

http://backtrace.io/

A lot of the content can be found on <u>https://</u> <u>queue.acm.org/detail.cfm?id=2492433</u> - along with references.



