Python cudaDeviceSynchronize 예제들

프로그래밍 언어: Python

네임스페이스/패키지 이름: cuda.cudart

메소드/함수: cudaDeviceSynchronize

hotexamples.com에서의 예제들: 2

Python cudaDeviceSynchronize - 2개의 예제가 발견되었습니다. 이것들은 오픈소스 프로젝트에서 추출된 Python의 cuda.cudart.cudaDeviceSynchronize에 대한 실세계 최고 등급의 예제들입니다. 예제들을 평가하여 예제의 품질 향상에 도움을 줄 수 있습니다.

예제 #1

파일 보기

    def benchmark(self, func, gpu_args, threads, grid):
        """runs the kernel and measures time repeatedly, returns average time

        Runs the kernel and measures kernel execution time repeatedly, number of
        iterations is set during the creation of CudaFunctions. Benchmark returns
        a robust average, from all measurements the fastest and slowest runs are
        discarded and the rest is included in the returned average. The reason for
        this is to be robust against initialization artifacts and other exceptional
        cases.

        :param func: A PyCuda kernel compiled for this specific kernel configuration
        :type func: pycuda.driver.Function

        :param gpu_args: A list of arguments to the kernel, order should match the
            order in the code. Allowed values are either variables in global memory
            or single values passed by value.
        :type gpu_args: list( pycuda.driver.DeviceAllocation, numpy.int32, ...)

        :param threads: A tuple listing the number of threads in each dimension of
            the thread block
        :type threads: tuple(int, int, int)

        :param grid: A tuple listing the number of thread blocks in each dimension
            of the grid
        :type grid: tuple(int, int)

        :returns: A dictionary with benchmark results.
        :rtype: dict()
        """

        result = dict()
        err = cudart.cudaDeviceSynchronize()
        for _ in range(self.iterations):
            for obs in self.observers:
                obs.before_start()
            err = cudart.cudaDeviceSynchronize()
            err = cudart.cudaEventRecord(self.start, self.stream)
            self.run_kernel(func, gpu_args, threads, grid, stream=self.stream)
            err = cudart.cudaEventRecord(self.end, self.stream)
            for obs in self.observers:
                obs.after_start()
            while cudart.cudaEventQuery(
                    self.end) != cuda.CUresult.CUDA_SUCCESS:
                for obs in self.observers:
                    obs.during()
                time.sleep(1e-6)
            for obs in self.observers:
                obs.after_finish()

        for obs in self.observers:
            result.update(obs.get_results())

        return result

예제 #2

파일 보기

파일: int8-QDQ.py 프로젝트: NVIDIA/trt-samples-for-hackathon-cn

#

import numpy as np
from cuda import cudart
import tensorrt as trt

nIn, cIn, hIn, wIn = 1, 1, 6, 9
cOut, hW, wW = 1, 3, 3
data = np.tile(
    np.arange(1, 1 + hW * wW, dtype=np.float32).reshape(hW, wW),
    (cIn, hIn // hW, wIn // wW)).reshape(1, cIn, hIn, wIn)
weight = np.full([cOut, hW, wW], 1, dtype=np.float32)
bias = np.zeros(cOut, dtype=np.float32)

np.set_printoptions(precision=8, linewidth=200, suppress=True)
cudart.cudaDeviceSynchronize()

logger = trt.Logger(trt.Logger.ERROR)
builder = trt.Builder(logger)
network = builder.create_network(
    1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
config.flags = 1 << int(trt.BuilderFlag.INT8)  # 需要打开 int8 模式
inputT0 = network.add_input('inputT0', trt.float32, (nIn, cIn, hIn, wIn))

qValue = 1 / 1
qTensor = network.add_constant([], np.array([qValue],
                                            dtype=np.float32)).get_output(0)
inputQLayer = network.add_quantize(inputT0, qTensor)
inputQLayer.axis = 0
inputQDQLayer = network.add_dequantize(inputQLayer.get_output(0), qTensor)