This PR enhances `BasicPtxBuilder` to support predicates in PTX code
generation. The `BasicPtxBuilder` interface was initially introduced for
generating PTX code automatically for Ops that aren't supported by LLVM
core. Predicates, which are typically not supported in LLVM core, are
now supported using the same mechanism.
In PTX programming, instructions can be guarded by predicates as shown
below:. Here `@p` is a predicate register and guard the execution of the
instruction.
```
@p ptx.code op1, op2, op3
```
This PR introduces the `getPredicate` function in the `BasicPtxBuilder`
interface to set an optional predicate. When a predicate is provided,
the instruction is generated with predicate and guarded, otherwise,
predicate is not genearted. Note that the predicate value must always
appear as the last argument on the Op definition.
Additionally, this PR implements predicate usage for the following ops:
- mbarrier.init
- mbarrier.init.shared
- mbarrier.arrive.expect_tx
- mbarrier.arrive.expect_tx.shared
- cp.async.bulk.tensor.shared.cluster.global
- cp.async.bulk.tensor.global.shared.cta
See for more detail in PTX programing model
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-instructions