Document Number:	N4352
Date:	2015-01-08
Revises:	N4310
Editor:	Jared Hoberock NVIDIA Corporation [email protected]

General

1.1

Scope

[parallel.general.scope]

This Technical Specification describes requirements for implementations of an interface that computer programs written in the C++ programming language may use to invoke algorithms with parallel execution. The algorithms described by this Technical Specification are realizable across a broad class of computer architectures.

This Technical Specification is non-normative. Some of the functionality described by this Technical Specification may be considered for standardization in a future version of C++, but it is not currently part of any C++ standard. Some of the functionality in this Technical Specification may never be standardized, and other functionality may be standardized in a substantially changed form.

The goal of this Technical Specification is to build widespread existing practice for parallelism in the C++ standard algorithms library. It gives advice on extensions to those vendors who wish to provide them.

1.2

Normative references

[parallel.general.references]

The following referenced document is indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.

ISO/IEC 14882:—¹
1) To be published. Section references are relative to N3937.
, Programming Languages — C++

ISO/IEC 14882:— is herein called the C++ Standard. The library described in ISO/IEC 14882:— clauses 17-30 is herein called the C++ Standard Library. The C++ Standard Library components described in ISO/IEC 14882:— clauses 25, 26.7 and 20.7.2 are herein called the C++ Standard Algorithms Library.

Unless otherwise specified, the whole of the C++ Standard's Library introduction (C++14 §17) is included into this Technical Specification by reference.

1.3

Namespaces and headers

[parallel.general.namespaces]

Since the extensions described in this Technical Specification are experimental and not part of the C++ Standard Library, they should not be declared directly within namespace std. Unless otherwise specified, all components described in this Technical Specification are declared in namespace std::experimental::parallel::v1.

[ Note: Once standardized, the components described by this Technical Specification are expected to be promoted to namespace std. — end note ]

Unless otherwise specified, references to such entities described in this Technical Specification are assumed to be qualified with std::experimental::parallel::v1, and references to entities described in the C++ Standard Library are assumed to be qualified with std::.

Extensions that are expected to eventually be added to an existing header <meow> are provided inside the <experimental/meow> header, which shall include the standard contents of <meow> as if by

    #include <meow>

1.3.1

Terms and definitions

[parallel.general.defns]

For the purposes of this document, the terms and definitions given in the C++ Standard and the following apply.

A parallel algorithm is a function template described by this Technical Specification declared in namespace std::experimental::parallel::v1 with a formal template parameter named ExecutionPolicy.

Parallel algorithms access objects indirectly accessible via their arguments by invoking the following functions:

All operations of the categories of the iterators that the algorithm is instantiated with.
Functions on those sequence elements that are required by its specification.
User-provided function objects to be applied during the execution of the algorithm, if required by the specification.

These functions are herein called element access functions.

[ Example: The sort function may invoke the following element access functions:

Methods of the random-access iterator of the actual template argument, as per 24.2.7, as implied by the name of the template parameters RandomAccessIterator.
The swap function on the elements of the sequence (as per 25.4.1.1 [sort]/2).
The user-provided Compare function object.

— end example ]

Execution policies

[parallel.execpol]

2.1

In general

[parallel.execpol.general]

This clause describes classes that are execution policy types. An object of an execution policy type indicates the kinds of parallelism allowed in the execution of an algorithm and expresses the consequent requirements on the element access functions.

[ Example:

std::vector<int> v = ...

// standard sequential sort
std::sort(v.begin(), v.end());

using namespace std::experimental::parallel;

// explicitly sequential sort
sort(seq, v.begin(), v.end());

// permitting parallel execution
sort(par, v.begin(), v.end());

// permitting vectorization as well
sort(par_vec, v.begin(), v.end());

// sort with dynamically-selected execution
size_t threshold = ...
execution_policy exec = seq;
if (v.size() > threshold)
{
  exec = par;
}

sort(exec, v.begin(), v.end());

— end example ]

[ Note: Because different parallel architectures may require idiosyncratic parameters for efficient execution, implementations of the Standard Library may provide additional execution policies to those described in this Technical Specification as extensions. — end note ]

2.2

Header `<experimental/execution_policy>` synopsis

[parallel.execpol.synopsis]

namespace std {
namespace experimental {
namespace parallel {
inline namespace v1 {
  // 2.3, Execution policy type trait
  template<class T> struct is_execution_policy;
  template<class T> constexpr bool is_execution_policy_v = is_execution_policy<T>::value;

  // 2.4, Sequential execution policy
  class sequential_execution_policy;

  // 2.5, Parallel execution policy
  class parallel_execution_policy;

  // 2.6, Parallel+Vector execution policy
  class parallel_vector_execution_policy;

  // 2.7, Dynamic execution policy
  class execution_policy;
}
}
}
}

2.3

Execution policy type trait

[parallel.execpol.type]

template<class T> struct is_execution_policy { see below };

is_execution_policy can be used to detect parallel execution policies for the purpose of excluding function signatures from otherwise ambiguous overload resolution participation.

is_execution_policy<T> shall be a UnaryTypeTrait with a BaseCharacteristic of true_type if T is the type of a standard or implementation-defined execution policy, otherwise false_type.

[ Note: This provision reserves the privilege of creating non-standard execution policies to the library implementation. — end note ]

The behavior of a program that adds specializations for is_execution_policy is undefined.

2.4

Sequential execution policy

[parallel.execpol.seq]

class sequential_execution_policy{ unspecified };

The class sequential_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm's execution may not be parallelized.

2.5

Parallel execution policy

[parallel.execpol.par]

class parallel_execution_policy{ unspecified };

The class parallel_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be parallelized.

2.6

Parallel+Vector execution policy

[parallel.execpol.vec]

class parallel_vector_execution_policy{ unspecified };

The class class parallel_vector_execution_policy is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm's execution may be vectorized and parallelized.

2.7

Dynamic execution policy

[parallel.execpol.dynamic]

class execution_policy
{
  public:
    // 2.7.1, execution_policy construct/assign
    template<class T> execution_policy(const T& exec);
    template<class T> execution_policy& operator=(const T& exec);

    // 2.7.2, execution_policy object access
    
    template<class T> T* get() noexcept;
    template<class T> const T* get() const noexcept;
};

The class execution_policy is a container for execution policy objects. execution_policy allows dynamic control over standard algorithm execution.

[ Example:

std::vector<float> sort_me = ...
        
using namespace std::experimental::parallel;
execution_policy exec = seq;

if(sort_me.size() > threshold)
{
  exec = std::par;
}
 
std::sort(exec, std::begin(sort_me), std::end(sort_me));

— end example ]

Objects of type execution_policy shall be constructible and assignable from objects of type T for which is_execution_policy<T>::value is true.

2.7.1

`execution_policy` construct/assign

[parallel.execpol.con]

template<class T> execution_policy(const T& exec);

Effects:: Constructs an execution_policy object with a copy of exec's state.
Remarks:: This constructor shall not participate in overload resolution unless is_execution_policy<T>::value is true.

template<class T> execution_policy& operator=(const T& exec);

Effects:: Assigns a copy of exec's state to *this.
Returns:: *this.

2.7.2

`execution_policy` object access

[parallel.execpol.access]


          const type_info& type() const noexcept;

Returns:: typeid(T), such that T is the type of the execution policy object contained by *this.


          template<class T> T* get() noexcept;
          template<class T> const T* get() const noexcept;

Returns:: If target_type() == typeid(T), a pointer to the stored execution policy object; otherwise a null pointer.
Requires:: is_execution_policy<T>::value is true.

2.8

Execution policy objects

[parallel.execpol.objects]

constexpr sequential_execution_policy      seq{};
constexpr parallel_execution_policy        par{};
constexpr parallel_vector_execution_policy par_vec{};

The header <experimental/execution_policy> declares a global object associated with each type of execution policy defined by this Technical Specification.

Parallel exceptions

[parallel.exceptions]

3.1

Exception reporting behavior

[parallel.exceptions.behavior]

During the execution of a standard parallel algorithm, if temporary memory resources are required and none are available, the algorithm throws a std::bad_alloc exception.

During the execution of a standard parallel algorithm, if the invocation of an element access function terminates with an uncaught exception, the behavior of the program is determined by the type of execution policy used to invoke the algorithm:

If the execution policy object is of type class parallel_vector_execution_policy, std::terminate shall be called.
If the execution policy object is of type sequential_execution_policy or parallel_execution_policy, the execution of the algorithm terminates with an exception_list exception. All uncaught exceptions thrown during the invocations of element access functions shall be contained in the exception_list.
[ Note: For example, the number of invocations of the user-provided function object in for_each is unspecified. When for_each is executed sequentially, only one exception will be contained in the exception_list object. — end note ]
[ Note: These guarantees imply that, unless the algorithm has failed to allocate memory and terminated with std::bad_alloc, all exceptions thrown during the execution of the algorithm are communicated to the caller. It is unspecified whether an algorithm implementation will "forge ahead" after encountering and capturing a user exception. — end note ]
[ Note: The algorithm may terminate with the std::bad_alloc exception even if one or more user-provided function objects have terminated with an exception. For example, this can happen when an algorithm fails to allocate memory while creating or adding elements to the exception_list object. — end note ]
If the execution policy object is of any other type, the behavior is implementation-defined.

3.2

Header `<experimental/exception_list>` synopsis

[parallel.exceptions.synopsis]

namespace std {
namespace experimental {
namespace parallel {
inline namespace v1 {

  class exception_list : public exception
  {
    public:
      typedef unspecified iterator;
  
      size_t size() const noexcept;
      iterator begin() const noexcept;
      iterator end() const noexcept;

      const char* what() const noexcept override;
  };
}
}
}
}

The class exception_list owns a sequence of exception_ptr objects. The parallel algorithms may use the exception_list to communicate uncaught exceptions encountered during parallel execution to the caller of the algorithm.

The type exception_list::iterator shall fulfill the requirements of ForwardIterator.


          size_t size() const noexcept;

Returns:: The number of exception_ptr objects contained within the exception_list.
Complexity:: Constant time.


          iterator begin() const noexcept;

Returns:: An iterator referring to the first exception_ptr object contained within the exception_list.


          iterator end() const noexcept;

Returns:: An iterator that is past the end of the owned sequence.


          const char* what() const noexcept override;

Returns:: An implementation-defined NTBS.

Parallel algorithms

[parallel.alg]

4.1

In general

[parallel.alg.general]

This clause describes components that C++ programs may use to perform operations on containers and other sequences in parallel.

4.1.1

Requirements on user-provided function objects

[parallel.alg.general.user]

Function objects passed into parallel algorithms as objects of type BinaryPredicate, Compare, and BinaryOperation shall not directly or indirectly modify objects via their arguments.

4.1.2

Effect of execution policies on algorithm execution

[parallel.alg.general.exec]

Parallel algorithms have template parameters named ExecutionPolicy which describe the manner in which the execution of these algorithms may be parallelized and the manner in which they apply the element access functions.

The invocations of element access functions in parallel algorithms invoked with an execution policy object of type sequential_execution_policy execute in sequential order in the calling thread.

The invocations of element access functions in parallel algorithms invoked with an execution policy object of type parallel_execution_policy are permitted to execute in an unordered fashion in either the invoking thread or in a thread implicitly created by the library to support parallel algorithm execution. Any such invocations executing in the same thread are indeterminately sequenced with respect to each other. [ Note: It is the caller's responsibility to ensure correctness, for example that the invocation does not introduce data races or deadlocks. — end note ]

[ Example:

using namespace std::experimental::parallel;
int a[] = {0,1};
std::vector<int> v;
for_each(par, std::begin(a), std::end(a), [&](int i) {
  v.push_back(i*2+1);
});

The program above has a data race because of the unsynchronized access to the container v. — end example ]

[ Example:

using namespace std::experimental::parallel;
std::atomic<int> x = 0;
int a[] = {1,2};
for_each(par, std::begin(a), std::end(a), [&](int n) {
  x.fetch_add(1, std::memory_order_relaxed);
  // spin wait for another iteration to change the value of x
  while (x.load(std::memory_order_relaxed) == 1) { }
});

The above example depends on the order of execution of the iterations, and is therefore undefined (may deadlock). — end example ]

[ Example:

using namespace std::experimental::parallel;
int x=0;
std::mutex m;
int a[] = {1,2};
for_each(par, std::begin(a), std::end(a), [&](int) {
  m.lock();
  ++x;
  m.unlock();
});

The above example synchronizes access to object x ensuring that it is incremented correctly. — end example ]

The invocations of element access functions in parallel algorithms invoked with an execution policy of type parallel_vector_execution_policy are permitted to execute in an unordered fashion in unspecified threads, and unsequenced with respect to one another within each thread. [ Note: This means that multiple function object invocations may be interleaved on a single thread. — end note ]

[ Note: This overrides the usual guarantee from the C++ standard, Section 1.9 [intro.execution] that function executions do not interleave with one another. — end note ]

Since parallel_vector_execution_policy allows the execution of element access functions to be interleaved on a single thread, synchronization, including the use of mutexes, risks deadlock. Thus the synchronization with parallel_vector_execution_policy is restricted as follows:

A standard library function is vectorization-unsafe if it is specified to synchronize with another function invocation, or another function invocation is specified to synchronize with it, and if it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions may not be invoked by user code called from parallel_vector_execution_policy algorithms.

[ Note: Implementations must ensure that internal synchronization inside standard library routines does not induce deadlock. — end note ]

[ Example:

using namespace std::experimental::parallel;
int x=0;
std::mutex m;
int a[] = {1,2};
for_each(par_vec, std::begin(a), std::end(a), [&](int) {
  m.lock();
  ++x;
  m.unlock();
});

The above program is invalid because the applications of the function object are not guaranteed to run on different threads. — end example ]

[ Note: The application of the function object may result in two consecutive calls to m.lock on the same thread, which may deadlock. — end note ]

[ Note: The semantics of the parallel_execution_policy or the parallel_vector_execution_policy invocation allow the implementation to fall back to sequential execution if the system cannot parallelize an algorithm invocation due to lack of resources. — end note ]

Algorithms invoked with an execution policy object of type execution_policy execute internally as if invoked with the contained execution policy object.

The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined.

4.1.3

`ExecutionPolicy` algorithm overloads

[parallel.alg.overloads]

The Parallel Algorithms Library provides overloads for each of the algorithms named in Table 1, corresponding to the algorithms with the same name in the C++ Standard Algorithms Library. For each algorithm in Table 1, if there are overloads for corresponding algorithms with the same name in the C++ Standard Algorithms Library, the overloads shall have an additional template type parameter named ExecutionPolicy, which shall be the first template parameter. In addition, each such overload shall have the new function parameter as the first function parameter of type ExecutionPolicy&&.

Unless otherwise specified, the semantics of ExecutionPolicy algorithm overloads are identical to their overloads without.

Parallel algorithms shall not participate in overload resolution unless is_execution_policy<decay_t<ExecutionPolicy>>::value is true.

adjacent_difference adjacent_find all_of any_of

copy copy_if copy_n count

count_if equal exclusive_scan fill

fill_n find find_end find_first_of

find_if find_if_not for_each for_each_n

generate generate_n includes inclusive_scan

inner_product inplace_merge is_heap is_heap_until

is_partitioned is_sorted is_sorted_until lexicographical_compare

max_element merge min_element minmax_element

mismatch move none_of nth_element

partial_sort partial_sort_copy partition partition_copy

reduce remove remove_copy remove_copy_if

remove_if replace replace_copy replace_copy_if

replace_if reverse reverse_copy rotate

rotate_copy search search_n set_difference

set_intersection set_symmetric_difference set_union sort

stable_partition stable_sort swap_ranges transform

transform_exclusive_scan transform_inclusive_scan transform_reduce uninitialized_copy

uninitialized_copy_n uninitialized_fill uninitialized_fill_n unique

unique_copy

[ Note: Not all algorithms in the Standard Library have counterparts in

Working Draft, Technical Specification for C++ Extensions for Parallelism