请教:CUDA程序出现错误nvcc fatal:Could notav open input filee

请教:CUDA程序出现错误nvcc fa-中国学网-中国IT综合门户网站
> 信息中心 >
请教:CUDA程序出现错误nvcc fa
来源:互联网 发表时间: 16:51:43 责任编辑:李志喜字体:
为了帮助网友解决“请教:CUDA程序出现错误nvcc fa”相关的问题,中国学网通过互联网对“请教:CUDA程序出现错误nvcc fa”相关的解决方案进行了整理,用户详细问题包括:RT,我想知道:请教:CUDA程序出现错误nvcc fatal:Could not open input file,具体解决方案如下:解决方案1:高级&#47,如果是W环境变量 里面的CUDA_PATH是否正确,请查看 系统属性&#47应该是你的CUDA 运行库的路径不对
1个回答2个回答1个回答1个回答1个回答1个回答1个回答1个回答2个回答1个回答1个回答1个回答1个回答1个回答1个回答1个回答1个回答1个回答1个回答
相关文章:
最新添加资讯
24小时热门资讯
Copyright © 2004- All Rights Reserved. 中国学网 版权所有
京ICP备号-1 京公网安备02号CUDA: nvcc fatal error when trying to make CUDASW++ - Stack Overflow
to customize your list.
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.
J it only takes a minute:
Join the Stack Overflow community to:
Ask programming questions
Answer and help your peers
Get recognized for your expertise
I keep getting:
nvcc fatal
: Value 'sm_20' is not defined for option 'gpu-name'
My GPU is a GTX 590 and is indeed version 2.0 so that's not the problem. I switched to a lower version (sm_20) and get tons of errors with .h files.
Any ideas on what to try? I'm using cuda 5.0.
You could try compute_20 instead of sm_20.
Looking at the nvcc documentation in CUDA 5.0, the --gpu-name command line option is not mentioned. I guess it is an old option and you should probably instead use the -arch and/or -code options.
Your Answer
Sign up or
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Post as a guest
By posting your answer, you agree to the
Not the answer you're looking for?
Browse other questions tagged
Stack Overflow works best with JavaScript enablednvidia - Error compiling CUDA from Command Prompt - Stack Overflow
to customize your list.
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.
J it only takes a minute:
Join the Stack Overflow community to:
Ask programming questions
Answer and help your peers
Get recognized for your expertise
I'm trying to compile
a cuda test program on Windows 7 via Command Prompt,
I'm this command:
nvcc test.cu
But all I get is this error:
nvcc fatal : Cannot find compiler 'cl.exe' in PATH
What may be causing this error?
You will need to add the folder containing the "cl.exe" file to your path environment variable. For example:
C:\Program Files\Microsoft Visual Studio 10.0\VC\bin
Edit: Ok, go to My Computer -> Properties -> Advanced System Settings -> Environment Variables. Here look for Path in the list, and add the path above (or whatever is the location of your cl.exe).
43.6k75096
cl.exe is Microsoft's C/C++ compiler.
So the problem is that you don't have that installed where the command line can find it.
54.5k350104
nvcc is only a front end for the CUDA specific part of the program. It must invoke a full compiler to finish the job. In this case it cannot find the Visual Studio compiler 'cl.exe'
Check paths, nvcc documentation etc.
4,45932950
Solve this problem by adding this options to nvcc
nvcc x.cu ...
-ccbin "D:\Program Files\Microsoft Visual Studio 11.0\VC\bin"
for example my compiler is VS2012. and cl.exe is in this dir
I see that this is an old question but I recently got this error on my Visual Studio 2012 when I tried to build my CUDA project.
Apparently I had changed my CUDA project to the Nov 2012 pack, changing it back to the v110 that it usually is by default fixed this error.
In Visual Studio, left click on the CUDA project, ->properties->Configuration Properties-> General -> Platform toolset, and choose: Visual Studio 2012 (v110).
I could probably get it to work with the Nov 2012 pack, but the CUDA code does not use any of the additional functions of that pack, so it is not necessary. (That pack contains the variadic templates for C++11.)
Your Answer
Sign up or
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Post as a guest
By posting your answer, you agree to the
Not the answer you're looking for?
Browse other questions tagged
Stack Overflow works best with JavaScript enabledcuda - nvcc fatal: A single input file is required for a non-link phase when an outputfile is specified - Stack Overflow
to customize your list.
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.
J it only takes a minute:
Join the Stack Overflow community to:
Ask programming questions
Answer and help your peers
Get recognized for your expertise
I'm getting this problem with Nsight Eclipse. I just installed my Cuda Toolkit 5.0 I have a project which uses several C files and one Cuda file.
I read that sometimes the problem arises when you use C files along Cuda files in Nsight so I changed all files to .cu and .cuh extensions in my project. Likewise it said that sometimes the problem comes from having a path for the files with black spaces which I made sure it's not this case.
The error arises when it tries compiling the first file Calcular.cu
This is the compilation output
Building file: ../Calcular.cu
Invoking: NVCC Compiler
nvcc -I/usr/include/ImageMagick -G -g -O0 -gencode arch=compute_11,code=sm_11 -gencode arch=compute_12,code=sm_12 -gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -odir "" -M -o "Calcular.d" "../Calcular.cu"
nvcc –Xcompiler –fopenmp --compile -G -I/usr/include/ImageMagick -O0 -g -gencode arch=compute_11,code=compute_11 -gencode arch=compute_11,code=sm_11 -gencode arch=compute_12,code=compute_12 -gencode arch=compute_12,code=sm_12 -gencode arch=compute_13,code=compute_13 -gencode arch=compute_13,code=sm_13 -gencode arch=compute_20,code=compute_20 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -gencode arch=compute_30,code=compute_30 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=compute_35 -gencode arch=compute_35,code=sm_35
"Calcular.o" "../Calcular.cu"
nvcc fatal
: A single input file is required for a non-link phase when an outputfile is specified
make: *** [Calcular.o] Error 255
This are my compile options
–Xcompiler –fopenmp -I/usr/include/ImageMagick -G -g -O0
The compilation gives no other errors within the files. The files it needs to compile are Calcular.cu, Calcular.cuh, Preprocesamiento.cu, Preprocesamiento.cuh, Principal.cu, Principal.cuh.
Do someone knows how to fix this? Thanks
The dashes you have here:
–Xcompiler –fopenmp
Are not the right kind of dashes.
If you look closely at your question posting, you will see they are a slightly different character than the correct one which precedes this:
-I/usr/include/ImageMagick
for example.
You need to replace those dashes with the same kind of dash used in front of the include switch.
If you manually entered those compiler options, you need to fix those characters.
This dash:
is not correct.
Use this dash:
61.4k33767
Your Answer
Sign up or
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Post as a guest
By posting your answer, you agree to the
Not the answer you're looking for?
Browse other questions tagged
Stack Overflow works best with JavaScript enabledMajor update to the document to reflect recent nvcc
The CUDA Toolkit targets a class of applications whose control part runs
as a process on a general purpose computing device, and which use one or
more NVIDIA GPUs as coprocessors for accelerating
single program, multiple data (SPMD) parallel jobs.
Such jobs are self-contained, in the sense that they can be executed and
completed by a batch of GPU threads entirely without intervention by the
host process, thereby gaining optimal benefit from the parallel graphics
The GPU code is implemented as a collection of functions in a language
that is essentially C++, but with some annotations for distinguishing
them from the host code, plus annotations for distinguishing different
types of data memory that exists on the GPU.
Such functions may have parameters, and they can be called using a
syntax that is very similar to regular C function calling, but slightly
extended for being able to specify the matrix of GPU threads that must
execute the called function.
During its life time, the host process may dispatch many parallel GPU
For more information on the CUDA programming model, consult the
Source files for CUDA applications consist of a mixture of conventional
C++ host code, plus GPU device functions.
The CUDA compilation trajectory separates the device functions from the
host code, compiles the device functions using the proprietary NVIDIA
compilers and assembler, compiles the host code using a C++ host
compiler that is available, and afterwards embeds the compiled GPU
functions as fatbinary images in the host object file.
In the linking stage, specific CUDA runtime libraries are added for
supporting remote SPMD procedure calling and for providing explicit GPU
manipulation such as allocation of GPU memory buffers and host-GPU data
The compilation trajectory involves several splitting, compilation,
preprocessing, and merging steps for each CUDA source file.
It is the purpose of nvcc, the CUDA compiler driver, to
hide the intricate details of CUDA compilation from developers.
It accepts a range of conventional compiler options, such as for
defining macros and include/library paths, and for steering the
compilation process.
All non-CUDA compilation steps are forwarded to a C++ host compiler that
is supported by nvcc, and nvcc
translates its options to appropriate host compiler command line
A general purpose C++ host compiler is needed by nvcc
in the following situations:
During non-CUDA phases (except the run phase), because these phases
will be forwarded by nvcc to this compiler.
During CUDA phases, for several preprocessing stages and host code
compilation (see also ).
nvcc assumes that the host compiler is installed with
the standard method designed by the compiler provider.
If the host compiler installation is non-standard, the user must make
sure that the environment is set appropriately and use relevant
nvcc compile options.
The following documents provide detailed information about supported
host compilers:
On all platforms, the default host compiler executable
(gcc and g++ on Linux,
clang and clang++ on Mac OS X, and
cl.exe on Windows) found in the current execution
search path will be used, unless specified otherwise with appropriate
options (see ).
nvcc predefines the following macros:
Defined when compiling C/C++/CUDA source files.
__CUDACC__
Defined when compiling CUDA source files.
__CUDACC_RDC__
Defined when compiling CUDA sources files in relocatable device
code mode (see
__CUDACC_VER_MAJOR__
Defined with the major version number of nvcc.
__CUDACC_VER_MINOR__
Defined with the minor version number of nvcc.
__CUDACC_VER_BUILD__
Defined with the build version number of nvcc.
__CUDACC_VER__
Defined with the full version number of nvcc,
represented as
__CUDACC_VER_MAJOR__ * 10000 +
__CUDACC_VER_MINOR__ * 100 +
__CUDACC_VER_BUILD__
A compilation phase is the a logical translation step that can be
selected by command line options to nvcc.
A single compilation phase can still be broken up by
nvcc into smaller steps, but these smaller steps are
just implementations of the phase: they depend on seemingly arbitrary
capabilities of the internal tools that nvcc uses, and
all of these internals may change with a new release of the CUDA
Hence, only compilation phases are stable across releases, and although
nvcc provides options to display the compilation steps
that it executes, these are for debugging purposes only and must not be
copied and used into build scripts.
nvcc phases are selected by a combination of command
line options and input file name suffixes, and the execution of these
phases may be modified by other command line options.
In phase selection, the input file suffix defines the phase input, while
the command line option defines the required output of the phase.
The following paragraphs will list the recognized file name suffixes and
the supported compilation phases.
A full explanation of the nvcc command line options can
be found in the next chapter.
The following table defines how nvcc interprets its
input files:
Note that nvcc does not make any distinction
between object, library or resource files.
It just passes files of these types to the linker when the linking
phase is executed.
The following table specifies the supported compilation phases, plus
the option to nvcc that enables execution of this
It also lists the default name of the output file generated by this
phase, which will take effect when no explicit output file name is
specified using option
The last phase in this list is more of a convenience phase.
It allows running the compiled and linked executable without
having to explicitly set the library path to the CUDA dynamic
libraries.
Unless a phase option is specified, nvcc will
compile and link all its input files.
Each nvcc option has a long name and a short name,
which are interchangeable with each other.
These two variants are distinguished by the number of hyphens that must
precede the option name: long names must be preceded by two hyphens,
while short names must be preceded by a single hyphen.
For example,
is the short name of
Long options are intended for use in build scripts, where size of the
option is less important than descriptive value.
In contrast, short options are intended for interactive use.
nvcc recognizes three types of command options: boolean
options, single value options, and list options.
Boolean options do
they are either specified on a
command line or not.
Single value options must be specified at most once, and list options
may be repeated.
Examples of each of these option types are, respectively:
(switch to verbose mode),
(specify output file), and
(specify include path).
Single value options and list options must have arguments, which must
follow the name of the option itself by either one of more spaces or an
equals character.
When a one-character short name such as
is used, the value of the option may also immediately follow the option
itself without being seperated by spaces or an equal character.
The individual values of list options may be separated by commas in a
single instance of the option, or the option may be repeated, or any
combination of these two cases.
Hence, for the two sample options mentioned above that may take values,
the following notations are legal:
-Idir1,dir2 -I=dir3 -I dir4,dir5
Long option names are used throughout the document, unless specified
otherwise, however, short names can be used instead of long names to have
the same effect.
This section presents tables of nvcc options.
The option type in the tables can be recognized as follows: boolean
options do not have arguments specified in the first column, while the
other two types do.
List options can be recognized by the repeat indicator
,... at the end of the argument.
Long options are described in the first columns of the options tables,
and short options occupy the second columns.
Options of this category specify up to which stage the input files
must be compiled.
These allow for passing specific options directly to the internal
compilation tools that nvcc encapsulates, without
burdening nvcc with too-detailed knowledge on these
A table of useful sub-tool options can be found at the end of this
The following sections lists some useful options to lower level
compilation tools.
The following table lists some useful ptxas options
which can be specified with nvcc option
The following table lists some useful nvlink options
which can be specified with nvcc option
The CUDA phase converts a source file coded in the extended CUDA
language into a regular ANSI C++ source file that can be handed over to
a general purpose C++ host compiler for further compilation and linking.
The exact steps that are followed to achieve this are displayed in
CUDA compilation works as follows: the input program is
preprocessed for device compilation compilation and is compiled to CUDA
binary (cubin) and/or PTX intermediate code, which are
placed in a fatbinary.
The input program is preprocesed once again for host compilation and is
synthesized to embed the fatbinary and transform CUDA specific C++
extensions into standard C++ constructs.
Then the C++ host compiler compiles the synthesized host code with the
embedded fatbinary into a host object.
The embedded fatbinary is inspected by the CUDA runtime system whenever
the device code is launched by the host program to obtain an appropriate
fatbinary image for the current GPU.
The CUDA compilation trajectory is more complicated in the separate
compilation mode.
For more information, see
Figure 1. CUDA Whole Program Compilation Trajectory
This chapter describes the GPU compilation model that is maintained by
nvcc, in cooperation with the CUDA driver.
It goes through some technical sections, with concrete examples at the
In order to allow for architectural evolution, NVIDIA GPUs are released
in different generations.
New generations introduce major improvements in functionality and/or
chip architecture, while GPU models within the same generation show
minor configuration differences that moderately affect
functionality, performance, or both.
Binary compatibility of GPU applications is not guaranteed across
different generations.
For example, a CUDA application that has been compiled for a Fermi GPU
will very likely not run on a Kepler GPU (and vice versa).
This is the instruction set and instruction encodings of a geneartion is
different from those of of other generations.
Binary compatibility within one GPU generation can be guaranteed under
certain conditions because they share the basic instruction set.
This is the case between two GPU versions that do not show functional
differences at all (for instance when one version is a scaled down
version of the other), or when one version is functionally included in
the other.
An example of the latter is the base Kepler version
sm_30 whose functionality is a subset of all other
Kepler versions: any code compiled for sm_30 will run
on all other Kepler GPUs.
The following table lists the names of the current GPU architectures,
annotated with the functional capabilities that they provide.
There are other differences, such as amounts of register and processor
clusters, that only affect execution performance.
In the CUDA naming scheme, GPUs are named sm_xy, where
x denotes the GPU generation number, and
y the version in that generation.
Additionally, to facilitate comparing GPU capabilities, CUDA attempts to
choose its GPU names such that if
x2y2 then all non-ISA related
capabilities of sm_x1y1 are
included in those of sm_x2y2.
From this it indeed follows that sm_30 is the
base Kepler model, and it also explains why higher entries in the
tables are always functional extensions to the lower entries.
This is denoted by the plus sign in the table.
Moreover, if we abstract from the instruction encoding, it implies that
sm_30's functionality will continue to be included in
all later GPU generations.
As we will see next, this property will be the foundation for
application compatibility support by nvcc.
Basic features
+ Fermi support
sm_30 and sm_32
+ Kepler support
+ Unified memory programming
+ Dynamic parallelism support
sm_50, sm_52, and
+ Maxwell support
Binary code compatibility over CPU generations, together with a
published instruction set architecture is the usual mechanism for
ensuring that distributed applications out there in the field
will continue to run on newer versions of the CPU when these become
mainstream.
This situation is different for GPUs, because NVIDIA cannot guarantee
binary compatibility without sacrificing regular opportunities for GPU
improvements.
Rather, as is already conventional in the graphics programming domain,
nvcc relies on a two stage compilation model for
ensuring application compatibility with future GPU generations.
GPU compilation is performed via an intermediate representation, PTX,
which can be considered as assembly for a virtual GPU architecture.
Contrary to an actual graphics processor, such a virtual GPU is defined
entirely by the set of capabilities, or features, that it provides to
the application.
In particular, a virtual GPU architecture provides a (largely) generic
instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
Hence, a nvcc compilation command always uses two
architectures: a virtual intermediate architecture, plus a
real GPU architecture to specify the intended processor to
execute on.
For such an nvcc command to be valid, the real
architecture must be an implementation of the virtual
architecture.
This is further explained below.
The chosen virtual architecture is more of a statement on the GPU
capabilities that the application requires: using a smallest
virtual architecture still allows a widest range of actual
architectures for the second nvcc stage.
Conversely, specifying a virtual architecture that provides features
unused by the application unnecessarily restricts the set of possible
GPUs that can be specified in the second nvcc stage.
From this it follows that the virtual architecture should always be
chosen as low as possible, thereby maximizing the actual GPUs to
The real architecture should be chosen as high as
possible (assuming that this always generates better code), but this is
only possible with knowledge of the actual GPUs on which the application
is expected to run.
As we will see later, in the situation of just in time compilation,
where the driver has this exact knowledge: the runtime GPU is the one on
which the program is about to be launched/executed.
Figure 2. Two-Staged Compilation with Virtual and Real Architectures
compute_20
Basic features
+ Fermi support
compute_30 and compute_32
+ Kepler support
+ Unified memory programming
compute_35
+ Dynamic parallelism support
compute_50,
compute_52, and
compute_53
+ Maxwell support
The above table lists the currently defined virtual architectures.
The virtual architecture naming scheme is the same as the real
architecture naming scheme shown in Section
Clearly, compilation staging in itself does not help towards the goal of
application compatibility with future GPUs.
For this we need the two other mechanisms by CUDA Samples: just in time
compilation (JIT) and fatbinaries.
By specifying a virtual code architecture instead of a real GPU,
nvcc postpones the assembly of PTX code until
application runtime, at which the target GPU is exactly known.
For instance, the command below allows generation of exactly matching
GPU binary code, when the application is launched on an
sm_20 or later architecture.
nvcc x.cu --gpu-architecture=compute_20 --gpu-code=compute_20
The disadvantage of just in time compilation is increased application
startup delay, but this can be alleviated by letting the CUDA driver use
a compilation cache (refer to "Section 3.1.1.2. Just-in-Time
Compilation" of
which is persistent over multiple runs of the applications.
A different solution to overcome startup delay by JIT while still
allowing execution on newer GPUs is to specify multiple code instances,
nvcc x.cu --gpu-architecture=compute_30 --gpu-code=compute_30,sm_30,sm_35
This command generates exact code for two Kepler variants, plus PTX code
for use by JIT in case a next-generation GPU is encountered.
nvcc organizes its device code in fatbinaries, which
are able to hold multiple translations of the same GPU source code.
At runtime, the CUDA driver will select the most appropriate translation
when the device function is launched.
nvcc provides the options
for specifying the target architectures for both translation stages.
Except for allowed short hands described below, the
option takes a single value, which must be the
name of a virtual compute architecture, while option
takes a list of values which must all be the
names of actual GPUs.
nvcc performs a stage 2 translation for each of these
GPUs, and will embed the result in the result of compilation (which
usually is a host object file or executable).
Examplenvcc x.cu --gpu-architecture=compute_30 --gpu-code=sm_30,sm_35
nvcc allows a number of shorthands for simple cases.
arguments can be virtual architectures.
In this case the stage 2 translation will be omitted for such virtual
architecture, and the stage 1 PTX result will be embedded instead.
At application launch, and in case the driver does not find a better
alternative, the stage 2 compilation will be invoked by the driver
with the PTX as input.
Examplenvcc x.cu --gpu-architecture=compute_30 --gpu-code=compute_30,sm_30,sm_35
option can be omitted.
Only in this case, the
value can be a non-virtual architecture.
values default to the closest virtual
architecture that is implemented by the GPU specified with
value itself.
The closest virtual architecture is used as the effective
value is a virtual architecture, it is also used as the effective
nvcc x.cu --gpu-architecture=sm_35
nvcc x.cu --gpu-architecture=compute_30
are equivalent to
nvcc x.cu --gpu-architecture=compute_35 --gpu-code=sm_35,compute_35
nvcc x.cu --gpu-architecture=compute_30 --gpu-code=compute_30
options can be omitted.
is equivalent to
nvcc x.cu --gpu-architecture=compute_20 --gpu-code=sm_20,compute_20
The options
can be used in all cases where code is to be generated for one or more
GPUs using a common virtual architecture.
This will cause a single invocation of nvcc stage 1
(that is, preprocessing and generation of virtual PTX assembly code),
followed by a compilation stage 2 (binary code generation) repeated for
each specified GPU.
Using a common virtual architecture means that all assumed GPU features
are fixed for the entire nvcc compilation.
For instance, the following nvcc command assumes no
warp shuffle support for both the sm_20 code and the
sm_30 code:
nvcc x.cu --gpu-architecture=compute_20 --gpu-code=compute_20,sm_20,sm_30
Sometimes it is necessary to perform different GPU code generation
steps, partitioned over different architectures.
This is possible using nvcc option
which then must be used instead of a
combination.
Unlike option
may be repeated on the nvcc command line.
It takes sub-options arch and code,
which must not be confused with their main option equivalents, but
behave similarly.
If repeated architecture compilation is used, then the device code must
use conditional compilation based on the value of the architecture
identification macro __CUDA_ARCH__, which is described
in the next section.
For example, the following assumes absence of warp shuffle support for
the sm_20 and sm_21 code, but full
support on sm_3x:
nvcc x.cu \
--generate-code arch=compute_20,code=sm_20 \
--generate-code arch=compute_20,code=sm_21 \
--generate-code arch=compute_30,code=sm_30
Or, leaving actual GPU code generation to the JIT compiler in the CUDA
nvcc x.cu \
--generate-code arch=compute_20,code=compute_20 \
--generate-code arch=compute_30,code=compute_30
The code sub-options can be combined, but for technical reasons must
then be quoted, which causes a slightly more complex syntax:
nvcc x.cu \
--generate-code arch=compute_20,code=\"sm_20,sm_21\" \
--generate-code arch=compute_30,code=\"sm_30,sm_35\"
The architecture identification macro __CUDA_ARCH__
is assigned a three-digit value string xy0 (ending in
a literal 0) during each nvcc
compilation stage 1 that compiles for compute_xy.
This macro can be used in the implementation of GPU functions for
determining the virtual architecture for which it is currently being
The host code (the non-GPU code) must not depend on it.
Prior to the 5.0 release, CUDA did not support separate compilation, so
CUDA code could not call device functions or access variables across
Such compilation is referred to as whole program compilation.
We have always supported the separate compilation of host code, it was
just the device CUDA code that needed to all be within one file.
Starting with CUDA 5.0, separate compilation of device code is
supported, but the old whole program mode is still the default, so there
are new options to invoke separate compilation.
The code changes required for separate compilation of device code are
the same as what you already do for host code, namely using
extern and static to control the
visibility of symbols.
Note that previously extern was ignored in CUDA
now it will be honored.
With the use of static it is possible to have multiple
device symbols with the same name in different files.
For this reason, the CUDA API calls that referred to symbols by their
string instead the symbol should be referenced by
its address.
CUDA works by embedding device code into host objects.
In whole program compilation, it embeds executable device code into the
host object.
In separate compilation, we embed relocatable device code into the host
object, and run nvlink, the device linker, to link all
the device code together.
The output of nvlink is then linked together with all the host objects
by the host linker to form the final executable.
The generation of relocatable vs executable device code is controlled by
option is already used to control stopping a
compile at a host object, so a new option
is added that simply does
To invoke just the device linker, the
option can be used, which emits a host object
containing the embedded executable device code.
The output of that must then be passed to the host linker.
nvcc &objects&
can be used to implicitly call both the device and host linkers.
This works because if the device linker does not see any relocatable
code it does not do anything.
shows the flow (nvcc --device-c
has the same flow as
Figure 4. CUDA Separate Compilation Trajectory
The device linker has the ability to read the static host library
formats (.a on Linux and Mac OS X,
.lib on Windows).
It ignores any dynamic (.so or .dll)
libraries.
options can be used to
pass libraries to both the device and host linker.
The library name is specified without the library file extension when
option is used.
nvcc --gpu-architecture=sm_20 a.o b.o --library-path=&path& --library=foo
Alternatively, the library name, including the library file extension,
can be used without the
option on Windows.
nvcc --gpu-architecture=sm_20 a.obj b.obj foo.lib --library-path=&path&
Note that the device linker ignores any objects that do not have
relocatable device code.
Suppose we have the following files:
#define N 8
extern __device__ int g[N];
extern __device__ void bar(void);
#include "b.h"
__device__ int g[N];
__device__ void bar (void)
g[threadIdx.x]++;
#include &stdio.h&
#include "b.h"
__global__ void foo (void) {
__shared__ int a[N];
a[threadIdx.x] = threadIdx.x;
__syncthreads();
g[threadIdx.x] = a[blockDim.x - threadIdx.x - 1];
int main (void) {
unsigned int
int *dg, hg[N];
int sum = 0;
foo&&&1, N&&&();
if(cudaGetSymbolAddress((void**)&dg, g)){
printf("couldn't get the symbol addr\n");
if(cudaMemcpy(hg, dg, N * sizeof(int), cudaMemcpyDeviceToHost)){
printf("couldn't memcpy\n");
for (i = 0; i & N; i++) {
sum += hg[i];
if (sum == 36) {
printf("PASSED\n");
printf("FAILED (%d)\n", sum);
These can be compiled with the following commands (these examples are
for Linux):
nvcc --gpu-architecture=sm_20 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_20 a.o b.o
If you want to invoke the device and host linker separately, you can
nvcc --gpu-architecture=sm_20 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_20 --device-link a.o b.o --output-file link.o
g++ a.o b.o link.o --library-path=&path& --library=cudart
Note that a target architecture must be passed to the device linker.
If you want to use the driver API to load a linked cubin, you can
request just the cubin:
nvcc --gpu-architecture=sm_20 --device-link a.o b.o \
--cubin --output-file link.cubinThe objects could be put into a library and used with:nvcc --gpu-architecture=sm_20 --device-c a.cu b.cu
nvcc --lib a.o b.o --output-file test.a
nvcc --gpu-architecture=sm_20 test.a
Note that only static libraries are supported by the device linker.
A PTX file can be compiled to a host object file and then linked by
nvcc --gpu-architecture=sm_20 --device-c a.ptx
An example that uses libraries, host linker, and dynamic parallelism
nvcc --gpu-architecture=sm_35 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_35 --device-link a.o b.o --output-file link.o
nvcc --lib --output-file libgpu.a a.o b.o link.o
g++ host.o --library=gpu --library-path=&path& \
--library=cudadevrt --library=cudart
It is possible to do multiple device links within a single host
executable, as long as each device link is independent of the other.
This requirement of independence means that they cannot share code
across device executables, nor can they share addresses (e.g., a
device function address can be passed from host to device for a
callback only if the device link sees both the caller and potential
you cannot pass an address from one device executable
to another, as those are separate address spaces).
Only relocatable device code with the same ABI version, same SM target
architecture, and same pointer size (32 or 64) can be linked together.
Incompatible objects will produce a link error.
An object could have been compiled for a different architecture but also
have PTX available, in which case the device linker will JIT the PTX to
cubin for the desired architecture and then link.
Relocatable device code requires CUDA 5.0 or later Toolkit.
If a kernel is limited to a certain number of registers with the
launch_bounds attribute or the
option, then all functions that the kernel calls must not use more than
that if they exceed the limit, then a link error
will be given.
CUDA 5.0 does not support JIT linking, while CUDA 5.5 does.
This means that to use JIT linking you must recompile your code with
CUDA 5.5 or later.
JIT linking means doing a relink of the code at startup time.
The device linker (nvlink) links at the cubin level.
If the cubin does not match the target architecture at load time, the
driver re-invokes the device linker to generate cubin for the target
architecture, by first JIT'ing the PTX for each object to the
appropriate cubin, and then linking together the new cubin.
A file like b.cu above only contains CUDA device code,
so one might think that the b.o object doesn't need to be passed to the
host linker.
But actually there is implicit host code generated whenever a device
symbol can be accessed from the host side, either via a launch or an
API call like cudaGetSymbolAddress().
This implicit host code is put into b.o, and needs to
be passed to the host linker.
Plus, for JIT linking to work all device code must be passed to the host
linker, else the host executable will not contain device code needed for
the JIT link.
So a general rule is that the device linker and host linker must see the
same host object files (if the object files have any device references
in them—if a file is pure host then the device linker doesn't need to
If an object file containing device code is not passed to the host
linker, then you will see an error message about the function
__cudaRegisterLinkedBinary_name calling an
undefined or unresolved symbol
__fatbinwrap_name.
In separate compilation, __CUDA_ARCH__ must not be used
in headers such that different objects could contain different behavior.
Or, it must be guaranteed that all objects will compile for the same
compute_arch.
If a weak function or template function is defined in a header and its
behavior depends on __CUDA_ARCH__, then the instances
of that function in the objects could conflict if the objects are
compiled for different compute arch.
For example, if an a.h contains:
template&typename T&
__device__ T* getptr(void)
#if __CUDA_ARCH__ == 200
return NULL;
__shared__ T arr[256];
Then if a.cu and b.cu both include a.h and instantiate
getptr for the same type, and b.cu expects a non-NULL
address, and compile with:
nvcc --gpu-architecture=compute_20 --device-c a.cu
nvcc --gpu-architecture=compute_30 --device-c b.cu
nvcc --gpu-architecture=sm_30 a.o b.o
At link time only one version of the getptr is used, so the behavior
would depend on which version is picked.
To avoid this, either a.cu and b.cu must be compiled for the same
compute arch, or __CUDA_ARCH__ should not be used in
the shared header function.
Cross compilation is controlled by using the following
nvcc command line options:
is used for cross compilation, where the underlying host compiler is
capable of generating objects for the target platform.
This option signals that the target platform is a 32-bit platform.
Use this when the host platform is a 64-bit platform.
nvcc stores intermediate results by default into
temporary files that are deleted immediately before it completes.
The location of the temporary file directories used are, depending on
the current platform, as follows:
Value of environment variable TEMP is used.
If it is not set, C:\Windows\temp is used instead.
Other Platforms
Value of environment variable TMPDIR is used.
If it is not set, /tmp is used instead.
makes nvcc store these intermediate files in the
current directory or in the directory specified by
instead, with names as described in .
All files generated by a particular nvcc command can be
cleaned up by repeating the command, but with additional option
This option is particularly useful after using
because the
option usually leaves quite an amount of intermediate files around.
Because using
will remove exactly what the original nvcc command
created, it is important to exactly repeat all of the options in the
original command.
For instance, in the following example, omitting
will have different cleanup effects.
nvcc acos.cu --keep
nvcc acos.cu --keep --clean-targets
A summary on the amount of used registers and the amount of memory
needed per compiled device function can be printed by passing option
$ nvcc --resource-usage acos.cu
ptxas info
: 1536 bytes gmem, 8 bytes cmem[14]
ptxas info
: Compiling entry function 'acos_main' for 'sm_20'
ptxas info
: Function properties for acos_main
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info
: Used 6 registers, 1536 bytes smem, 32 bytes cmem[0]As shown in the above example, the amounts of statically allocated global memory (gmem) and
constant memory in bank 14 (cmem) are listed.
Global memory and some of the constant banks are module scoped resources and not per kernel
resources.
Allocation of constant variables to constant banks is profile specific.
Followed by this, per kernel resource information is printed.
Stack frame is per thread stack usage used by this function.
Spill stores and loads
represent stores and loads done on stack memory which are being used for storing variables
that couldn't be allocated to physical registers.
Similarly number of registers, amount of shared memory and total space in constant bank
allocated is shown.
Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation
in the U.S. and other countries.
Other company and product names may be trademarks of
the respective companies with which they are associated.

我要回帖

更多关于 could not open file 的文章

 

随机推荐