|Fortran Programming Guide|
This chapter describes how to measure and display program performance. Knowing where a program is spending most of its compute cycles and how efficiently it uses system resources is a prerequisite for performance tuning.
Sun WorkShop Performance Analyzer
Sun WorkShop Performance Analyzer provides a sophisticated pair of tools for collecting and analyzing program performance data:
- The Sampling Collector collects performance data (statistical profiles of call stacks, thread-synchronization delay events, hardware-counter overflow profiles, address space data, and summary information for the operating system), and stores it in an experiment file.
- The Sampling Analyzer displays the data recorded by the Sampling Collector, so you can examine the information. The analyzer processes the data and displays various metrics of performance, at function, caller-callee, source-line, disassembly-instruction, and program levels.
The Collector and Analyzer are designed for use by any software developer, even if performance tuning is not the developer's main responsibility.
Command-line equivalents of the Collector and Analyzer are available:
dbxincludes a data-collection feature that has the same functionality as the Collector.
- The command-line utility
er_print(1), which prints out an ASCII version of the various Analyzer displays, operates as a command-line sampling analyzer.
Details can be found in the Sun WorkShop manual Analyzing Program Performance With Sun WorkShop.
The simplest way to gather basic data about program performance and resource utilization is to use the
time(1) command or, in
The interpretation is:
user system wallclock resources memory I/O paging
- user - 6.5 seconds in user code, approximately
- system - 17.1 seconds in system code for this task, approximately
- wallclock - 1 minute 16 seconds to complete
- resources - 31% of system resources dedicated to this program
- memory - 11 Kilobytes of shared program memory, 21 kilobytes of private data memory
- I/O - 354 reads, 210 writes
- paging - 135 page faults, 0 swapouts
Multiprocessor Interpretation of
Timing results are interpreted in a different way when the program is run in parallel in a multiprocessor environment. Since
/bin/timeaccumulates the user time on different threads, only wall clock time is used.
Since the user time displayed includes the time spent on all the processors, it can be quite large and is not a good measure of performance. A better measure is the real time, which is the wall clock time. This also means that to get an accurate timing of a parallelized program you must run it on a quiet system dedicated to just your program.
gprof(1) command provides a detailed postmortem analysis of program timing at the subprogram level, including how many times a subprogram was called, who called it, whom it called, and how much time was spent in the routine and by the routines it called.
The program must complete normally for
gprofto obtain meaningful timing information.
At program termination, the file
gmon.outis automatically written in the working directory. This file contains the profiling data that will be interpreted by
gprofproduces a report on standard output. An example is shown on the next pages. Not only the routines in your program are listed but also the library procedures and the routines they call.
The report is mostly two profiles of how the total time is distributed across the program procedures: the call graph and the flat profile. They are preceded by an explanation of the column labels, followed by an index. (The
gprof -boption eliminates the explanatory text; see the
gprof(1) man page for other options that can be used to limit the amount of output generated.)
In the graph profile, each procedure (subprogram, procedure) is presented in a call-tree representation. The line representing a procedure in its call-tree is called the function line, and is identified by an index number in the leftmost column, within square brackets; the lines above it are the parent lines; the lines below it, the descendant lines.
---------------------------------------------------------------parent line caller 1parent line caller 2....
]time function line function namedescendant line called 1descendant line called 2....---------------------------------------------------------------
The call graph profile is followed by a flat profile that provides a routine-by-routine overview. An (edited) example of
Note User-defined functions appear with their Fortran names followed by an underscore. Library routines appear with leading underscores.
The call graph profile:
granularity: each sample hit covers 2 byte(s) for 0.08% of 12.78 secondscalled/total parentsindex %time self descendents called+self name indexcalled/total children0.00 12.66 1/1 main  99.1 0.00 12.66 1 MAIN_ 0.92 10.99 1000/1000 diffr_ 0.62 0.00 2000/2001 code_ 0.11 0.00 1000/1000 shock_ 0.02 0.00 1000/1000 bndry_ 0.00 0.00 1/1 init_ 0.00 0.00 2/2 output_ 0.00 0.00 1/1 input_ -----------------------------------------------0.92 10.99 1000/1000 MAIN_  93.2 0.92 10.99 1000 diffr_ 1.11 4.52 3000/3000 deriv_ 1.29 2.91 3000/6000 cheb1_ 1.17 0.00 3000/3000 dissip_ -----------------------------------------------1.29 2.91 3000/6000 deriv_ 1.29 2.91 3000/6000 diffr_  65.7 2.58 5.81 6000 cheb1_ 5.81 0.00 6000/6000 fftb_ 0.00 0.00 128/321 cos 0.00 0.00 128/192 __sin -----------------------------------------------5.81 0.00 6000/6000 cheb1_  45.5 5.81 0.00 6000 fftb_ 0.00 0.00 64/321 cos 0.00 0.00 64/192 __sin -----------------------------------------------...
The flat profile overview:
granularity: each sample hit covers 2 byte(s) for 0.08% of 12.84 seconds% cumulative self self totaltime seconds seconds calls ms/call ms/call name45.2 5.81 5.81 6000 0.97 0.97 fftb_ 20.1 8.39 2.5 6000 0.43 1.40 cheb1_ 9.1 9.56 1.17 3000 0.39 0.39 dissip_ 8.6 10.67 1.11 3000 0.37 1.88 deriv_ 7.1 11.58 0.92 1000 0.92 11.91 diffr_ 4.8 12.20 0.62 2001 0.31 0.31 code_ 2.5 12.53 0.33 69000 0.00 0.00 __exp 0.9 12.64 0.11 1000 0.11 0.11 shock_ ...
The function line  in the preceding example reveals that:
cheb1was called 6000 times-- 3000 from
deriv, 3000 from
- 2.58 seconds were spent in
- 5.81 seconds were spent in routines called by
- 65.7% of the execution time of the program was within
- Parent Lines.
- The parent lines above  indicate that
cheb1was called from two routines,
diffr. The timings on these lines show how much time was spent in
cheb1when it was called from each of these routines.
- The lines below the function line indicate the routines called from
cos. The library sine function is called indirectly.
- Function names appear on the right. The profile is sorted by percentage of total execution time.
Overhead and Other Considerations
Profiling (compiling with the
-pgoption) may greatly increase the running time of a program. This is due to the extra overhead required to clock program performance and subprogram calls. Profiling tools like
gprofattempt to subtract an approximate overhead factor when computing relative runtime percentages. All other timings shown may not be accurate due to UNIX and hardware timekeeping inaccuracies.
Programs with short execution times are the most difficult to profile because the overhead may be a significant fraction of the total execution time. The best practice is to choose input data for the profiling run that will result in a realistic test of the program's performance. If this is not possible, consider enclosing the main computational part of the program within a loop that effectively runs the program N times. Estimate actual performance by dividing the profile results by N.
The Fortran library includes two routines that return the total time used by the calling process. See the man pages for
gprofcan give misleading results. A well-known limitation is that
gprofcannot differentiate time spent in a function called from more than one caller. For example, it may be that function
FUtakes much more time when called from routine
BARthan from any other routine, and knowing this could suggest to you a significant restructuring of the program and better performance. Unfortunatly, the results shown by gprof average the total time spent in FU over all calls, obscuring this valuable bit of information. The Sun WorkShop Performance Analyzer provides much more detailed and useful information if you intend to do serious performance analysis of a program, and should be used instead.
tcov(1) command, when used with programs compiled with the
-xprofile=tcovoptions, produces a statement-by-statement profile of the source code showing which statements executed and how often. It also gives a summary of information about the basic block structure of the program.
There are two implementations of
tcovcoverage analysis. The original
tcovis invoked by the
-xacompiler options. Enhanced statement level coverage is invoked by the
-xprofile=tcovcompiler option and the
-x tcovoption. In either case, the output is a copy of the source files annotated with statement execution counts in the margin. Although these two versions of
tcovare essentially the same as far as the Fortran user is concerned (most of the enhancements apply to C++ programs), there will be some performance improvement with the newer style.
Note The code coverage report produced by
tcovwill be unreliable if the compiler has inlined calls to routines. The compiler inlines calls whenever appropriate at optimization levels above
-O3, and according to the
-inlineoption. With inlining, the compiler replaces a call to a routine with the actual code for the called routine. And, since there is no call, references to those inlined routines will not be reported by
tcov. Therefore, to get an accurate coverage report, do not enable compiler inlining.
Compile the program with the
-xa) option. This produces the file
.dfor each source
.ffile in the compilation. (If environment variable
$TCOVDIRis not set at compile time, the
.dfiles are stored in the current directory.)
Run the program (execution must complete normally). This produces updated information in the
.dfiles. To view the coverage analysis merged with the individual source files, run
tcovon the source files. The annotated source files are named
.tcovfor each source file.
The output produced by tcov shows the number of times each statement was actually executed. Statements that were not executed are marked with
####->to the left of the statement.
Here is a simple example:
f77 -a -o onetwo -silent one.f two.fdemo%
onetwo... output from programdemo%
tcov one.f two.fdemo%
cat one.tcov two.tcovprogram one1 -> do i=1,1010 -> call two(i)end do1 -> endTop 10 BlocksLine Count3 102 15 13 Basic blocks in this file3 Basic blocks executed100.00 Percent of the file executed12 Total basic block executions4.00 Average executions per basic blocksubroutine two(i)10 -> print*, "two called", ireturnendTop 10 BlocksLine Count2 101 Basic blocks in this file1 Basic blocks executed100.00 Percent of the file executed10 Total basic block executions10.00 Average executions per basic blockdemo%
"New Style" Enhanced
To use new style
tcov, compile with
-xprofile=tcov. When the program is run, coverage data is stored in program
.profile/tcovd, where program is the name of the executable file. (If the executable were
a.out.profile/tcovdwould be created.)
tcov -xdirname source_files to create the coverage analysis merged with each source file. The report is written to file
.tcovin the current directory.
Running a simple example:
f77 -o onetwo -silent -xprofile=tcov one.f two.fdemo%
onetwo... output from programdemo%
tcov -x onetwo.profile one.f two.fdemo%
cat one.f.tcov two.f.tcovprogram one1 -> do i=1,1010 -> call two(i)end do1 -> end.....etcdemo%
$SUN_PROFDATA_DIRcan be used to specify where the intermediary data collection files are kept. These are the
tcovdfiles created by old and new style
Each subsequent run accumulates more coverage data into the
tcovdfile. Data for each object file is zeroed out the first time the program is executed after the corresponding source file has been recompiled. Data for the entire program is zeroed out by removing the
These environment variables can be used to separate the collected data from different runs. With these variables set, the running program writes execution data to the files in
Similarly, the directory that
tcovreads is specified by
tcov -x $SUN_PROFDATA. If
tcovwill prepend it, looking for files in
$SUN_PROFDATA_DIR/$SUN_PROFDATA/, and not in the working directory.
For the details, see the
tcov(1) man page.
You can obtain a report about how much data was transferred by your program. For each Fortran unit, the report shows the file name, the number of I/O statements, the number of bytes, and some statistics on these items.
To obtain an I/O profiling report, insert calls to the library routines start_iostats and end_iostats around the parts of the program you wish to measure. (A call to end_iostats is required if the program terminates with an
STOPstatement rather than a
Note The I/O statements profiled are:
REWIND. The runtime system opens
stderrbefore the first executable statement of your program, so you must explicitly reopen these units after the call to
If you want to measure only part of the program, call
end_iostatsto stop the process. A call to
end_iostatsmay also be required if your program terminates with an
STOPstatement rather than
The program must be compiled with the
-pgoption. When the program terminates, the I/O profile report is produced on the file name
.io_stats, where name is the name of the executable file.
Here is an example:
f77 -o myprog -pg -silent myprog.fdemo%
myprog... output from programdemo%
cat myprog.io_statsINPUT REPORT1. unit 2. file name 3. input data 4. mapcnt total avg std dev (cnt)------------------------------------------------------------------------0 stderr 0 0 0 0 No0 0 0 05 stdin 2 8 4 0 No1 8 8 06 stdout 0 0 0 0 No0 0 0 019 fort.19 8 48 6 4.276 No4 48 12 020 fort.20 8 48 6 4.276 No4 48 12 021 fort.21 8 48 6 4.276 No4 48 12 022 fort.22 8 48 6 4.276 No4 48 12 0OUTPUT REPORT1. unit 5. output data 6. blk size 7. fmt 8. directcnt total avg std dev (rec len)-----------------------------------------------------------------------------0 4 40 10 0 -1 Yes seq1 40 40 05 0 0 0 0 -1 Yes seq0 0 0 06 26 248 9.538 1.63 -1 Yes seq6 248 41.33 3.26619 8 48 6 4.276 500548 Yes seq4 48 12 020 8 48 6 4.276 503116 No seq4 48 12 021 8 48 6 4.276 503116 Yes dir4 48 12 0 ( 12)22 8 48 6 4.276 503116 No dir4 48 12 0 ( 12)...
Each pair of lines in the report displays information about an I/O unit. One section shows input operations and another shows output. The first line of a pair displays statistics on the number of data elements transferred before the unit was closed. The second row of statistics is based on the number of I/O statements processed.
In the example, there were 6 calls to write a total of 26 data elements to standard output. A total of 248 bytes was transferred. The display also shows the average and standard deviation in bytes transferred per I/O statement (9.538 and 1.63, respectively), and the average and standard deviation per I/O statement call (42.33 and 3.266, respectively).
The input report also contains a column to indicate whether a unit was memory mapped or not. If mapped, the number of mmap() calls is recorded in parentheses in the second row of the pair.
The output report indicates block sizes, formatting, and access type. A file opened for direct access shows its defined record length in parentheses in the second row of the pair.
Note Compiling with environment variable
LD_LIBRARY_PATHset might disable I/O profiling, which relies on its profiling I/O library being in a standard location.
Sun Microsystems, Inc.
Copyright information. All rights reserved.
|Library | Contents | Previous | Next | Index|