Why is there a performance penalty for nested subroutines in Delphi? Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara Data science time! April 2019 and salary with experience The Ask Question Wizard is Live!Is it more optimized to use local function or global function?Delphi code completion performanceWhat is Causing This Memory Leak in Delphi?Delphi Interface Performance IssueIs the compiler treatment of implicit interface variables documented?how to define classes inside classes in delphi?Delphi XE custom build target is always disabledAlternative to nested for-loop in DelphiIs there a performance penalty in accessing the Windows API through Delphi?How to affect Delphi XEx code generation for Android/ARM targets?Are there any penalties for using generic types in Delphi?
Can you stand up from being prone using Skirmisher outside of your turn?
Married in secret, can marital status in passport be changed at a later date?
Co-worker works way more than he should
Function to calculate red-edgeNDVI in Google Earth Engine
Will I lose my paid in full property
Justification for leaving new position after a short time
Retract an already submitted recommendation letter (written for an undergrad student)
How to count in linear time worst-case?
Protagonist's race is hidden - should I reveal it?
Does the set of sets which are elements of every set exist?
How long after the last departure shall the airport stay open for an emergency return?
Multiple options vs single option UI
What is it called when you ride around on your front wheel?
Book with legacy programming code on a space ship that the main character hacks to escape
Passing args from the bash script to the function in the script
How do I check if a string is entirely made of the same substring?
Expansion//Explosion and Siren Stormtamer
Is accepting an invalid credit card number a security issue?
Second order approximation of the loss function (Deep learning book, 7.33)
Check if a string is entirely made of the same substring
What is ls Largest Number Formed by only moving two sticks in 508?
Does Mathematica have an implementation of the Poisson Binomial Distribution?
Can I criticise the more senior developers around me for not writing clean code?
What do you call the part of a novel that is not dialog?
Why is there a performance penalty for nested subroutines in Delphi?
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!Is it more optimized to use local function or global function?Delphi code completion performanceWhat is Causing This Memory Leak in Delphi?Delphi Interface Performance IssueIs the compiler treatment of implicit interface variables documented?how to define classes inside classes in delphi?Delphi XE custom build target is always disabledAlternative to nested for-loop in DelphiIs there a performance penalty in accessing the Windows API through Delphi?How to affect Delphi XEx code generation for Android/ARM targets?Are there any penalties for using generic types in Delphi?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
A static analyzer we use has a report that says:
Subprograms with local subprograms (OPTI7)
This section lists subprograms that themselves have local subprograms.
Especially when these subprograms share local variables, it can have a
negative effect on performance.
This guide says:
Do not use nested routines Nested routines (routines within other
routines; also known as "local procedures") require some special stack
manipulation so that the variables of the outer routine can be seen by
the inner routine. This results in a good bit of overhead. Instead of
nesting, move the procedure to the unit scoping level and pass the
necessary variables - if necessary by reference (use the var keyword)
- or make the variable global at the unit scope.
We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.
What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?
delphi
|
show 2 more comments
A static analyzer we use has a report that says:
Subprograms with local subprograms (OPTI7)
This section lists subprograms that themselves have local subprograms.
Especially when these subprograms share local variables, it can have a
negative effect on performance.
This guide says:
Do not use nested routines Nested routines (routines within other
routines; also known as "local procedures") require some special stack
manipulation so that the variables of the outer routine can be seen by
the inner routine. This results in a good bit of overhead. Instead of
nesting, move the procedure to the unit scoping level and pass the
necessary variables - if necessary by reference (use the var keyword)
- or make the variable global at the unit scope.
We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.
What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?
delphi
Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.
– J...
9 hours ago
2
@J..., the bit about global variables is not from Peganza.
– Uli Gerhardt
8 hours ago
@UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o
– J...
8 hours ago
Actually, what Robert Lee (he was some kind of optimization guru, these days) wrote was generally quite correct. His main objective was to raise the speed of code. So then you get advice like using global variables. They can/could be quite a bit faster, indeed. But indeed, on modern processors the differences are not that big anymore.
– Rudy Velthuis
6 hours ago
2
The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.
– GolezTrol
6 hours ago
|
show 2 more comments
A static analyzer we use has a report that says:
Subprograms with local subprograms (OPTI7)
This section lists subprograms that themselves have local subprograms.
Especially when these subprograms share local variables, it can have a
negative effect on performance.
This guide says:
Do not use nested routines Nested routines (routines within other
routines; also known as "local procedures") require some special stack
manipulation so that the variables of the outer routine can be seen by
the inner routine. This results in a good bit of overhead. Instead of
nesting, move the procedure to the unit scoping level and pass the
necessary variables - if necessary by reference (use the var keyword)
- or make the variable global at the unit scope.
We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.
What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?
delphi
A static analyzer we use has a report that says:
Subprograms with local subprograms (OPTI7)
This section lists subprograms that themselves have local subprograms.
Especially when these subprograms share local variables, it can have a
negative effect on performance.
This guide says:
Do not use nested routines Nested routines (routines within other
routines; also known as "local procedures") require some special stack
manipulation so that the variables of the outer routine can be seen by
the inner routine. This results in a good bit of overhead. Instead of
nesting, move the procedure to the unit scoping level and pass the
necessary variables - if necessary by reference (use the var keyword)
- or make the variable global at the unit scope.
We were interested in knowing if we should take this report into consideration when validating our code. The answers to this question suggest that one should profile one's application to see if there is any performance difference, but not much is said about the difference between nested routines and normal subroutines.
What is the actual difference between nested routines and normal routines and how may it cause a performance penalty?
delphi
delphi
asked 9 hours ago
afarahafarah
477211
477211
Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.
– J...
9 hours ago
2
@J..., the bit about global variables is not from Peganza.
– Uli Gerhardt
8 hours ago
@UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o
– J...
8 hours ago
Actually, what Robert Lee (he was some kind of optimization guru, these days) wrote was generally quite correct. His main objective was to raise the speed of code. So then you get advice like using global variables. They can/could be quite a bit faster, indeed. But indeed, on modern processors the differences are not that big anymore.
– Rudy Velthuis
6 hours ago
2
The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.
– GolezTrol
6 hours ago
|
show 2 more comments
Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.
– J...
9 hours ago
2
@J..., the bit about global variables is not from Peganza.
– Uli Gerhardt
8 hours ago
@UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o
– J...
8 hours ago
Actually, what Robert Lee (he was some kind of optimization guru, these days) wrote was generally quite correct. His main objective was to raise the speed of code. So then you get advice like using global variables. They can/could be quite a bit faster, indeed. But indeed, on modern processors the differences are not that big anymore.
– Rudy Velthuis
6 hours ago
2
The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.
– GolezTrol
6 hours ago
Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.
– J...
9 hours ago
Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.
– J...
9 hours ago
2
2
@J..., the bit about global variables is not from Peganza.
– Uli Gerhardt
8 hours ago
@J..., the bit about global variables is not from Peganza.
– Uli Gerhardt
8 hours ago
@UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o
– J...
8 hours ago
@UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o
– J...
8 hours ago
Actually, what Robert Lee (he was some kind of optimization guru, these days) wrote was generally quite correct. His main objective was to raise the speed of code. So then you get advice like using global variables. They can/could be quite a bit faster, indeed. But indeed, on modern processors the differences are not that big anymore.
– Rudy Velthuis
6 hours ago
Actually, what Robert Lee (he was some kind of optimization guru, these days) wrote was generally quite correct. His main objective was to raise the speed of code. So then you get advice like using global variables. They can/could be quite a bit faster, indeed. But indeed, on modern processors the differences are not that big anymore.
– Rudy Velthuis
6 hours ago
2
2
The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.
– GolezTrol
6 hours ago
The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.
– GolezTrol
6 hours ago
|
show 2 more comments
1 Answer
1
active
oldest
votes
tl;dr
- There are extra
push
/pop
s for nested subroutines - Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines
- Inlining results in the same code being generated for both nested and normal subroutines
- For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off
I wrote a little test to determine this, where GetRTClock
is measuring the current time with a precision of 1ns:
function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);
Result := GetRTClock - s;
end;
function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);
Result := GetRTClock - s;
end;
This compiles to:
subprogram_main
MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...
normal_main
MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...
subprogram_aux:
MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax
normal_aux:
MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax
The only difference is one push
and one pop
. What happens if we turn on optimizations?
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux
MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux
Both compile exactly to the same thing.
What happens when inlining?
MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810
subprogram_main:
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4
Again, no difference.
I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:
constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);
normal_gt_ns := 0;
subprogram_gt_ns := 0;
rnd_input := Random(1000);
for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));
OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;
There is no significant difference even with optimizations turned off:
Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)
Finally, both texts that warn about performance mention something about shared local variables.
If we do not pass z
to subprogram_aux
, instead access it directly, we get:
MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax
Even with optimizations turned on.
2
Would be interesting to see what the other compilers do with this, e.g. Win64 compiler
– David Heffernan
9 hours ago
1
Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?
– Rudy Velthuis
2 hours ago
@RudyVelthuis It's just a wrapper to Windows'QueryPerformanceCounter
, I was told the resolution is 1ns but the documentation only says "<1us".
– afarah
1 hour ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55832314%2fwhy-is-there-a-performance-penalty-for-nested-subroutines-in-delphi%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
tl;dr
- There are extra
push
/pop
s for nested subroutines - Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines
- Inlining results in the same code being generated for both nested and normal subroutines
- For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off
I wrote a little test to determine this, where GetRTClock
is measuring the current time with a precision of 1ns:
function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);
Result := GetRTClock - s;
end;
function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);
Result := GetRTClock - s;
end;
This compiles to:
subprogram_main
MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...
normal_main
MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...
subprogram_aux:
MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax
normal_aux:
MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax
The only difference is one push
and one pop
. What happens if we turn on optimizations?
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux
MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux
Both compile exactly to the same thing.
What happens when inlining?
MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810
subprogram_main:
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4
Again, no difference.
I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:
constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);
normal_gt_ns := 0;
subprogram_gt_ns := 0;
rnd_input := Random(1000);
for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));
OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;
There is no significant difference even with optimizations turned off:
Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)
Finally, both texts that warn about performance mention something about shared local variables.
If we do not pass z
to subprogram_aux
, instead access it directly, we get:
MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax
Even with optimizations turned on.
2
Would be interesting to see what the other compilers do with this, e.g. Win64 compiler
– David Heffernan
9 hours ago
1
Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?
– Rudy Velthuis
2 hours ago
@RudyVelthuis It's just a wrapper to Windows'QueryPerformanceCounter
, I was told the resolution is 1ns but the documentation only says "<1us".
– afarah
1 hour ago
add a comment |
tl;dr
- There are extra
push
/pop
s for nested subroutines - Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines
- Inlining results in the same code being generated for both nested and normal subroutines
- For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off
I wrote a little test to determine this, where GetRTClock
is measuring the current time with a precision of 1ns:
function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);
Result := GetRTClock - s;
end;
function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);
Result := GetRTClock - s;
end;
This compiles to:
subprogram_main
MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...
normal_main
MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...
subprogram_aux:
MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax
normal_aux:
MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax
The only difference is one push
and one pop
. What happens if we turn on optimizations?
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux
MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux
Both compile exactly to the same thing.
What happens when inlining?
MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810
subprogram_main:
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4
Again, no difference.
I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:
constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);
normal_gt_ns := 0;
subprogram_gt_ns := 0;
rnd_input := Random(1000);
for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));
OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;
There is no significant difference even with optimizations turned off:
Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)
Finally, both texts that warn about performance mention something about shared local variables.
If we do not pass z
to subprogram_aux
, instead access it directly, we get:
MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax
Even with optimizations turned on.
2
Would be interesting to see what the other compilers do with this, e.g. Win64 compiler
– David Heffernan
9 hours ago
1
Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?
– Rudy Velthuis
2 hours ago
@RudyVelthuis It's just a wrapper to Windows'QueryPerformanceCounter
, I was told the resolution is 1ns but the documentation only says "<1us".
– afarah
1 hour ago
add a comment |
tl;dr
- There are extra
push
/pop
s for nested subroutines - Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines
- Inlining results in the same code being generated for both nested and normal subroutines
- For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off
I wrote a little test to determine this, where GetRTClock
is measuring the current time with a precision of 1ns:
function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);
Result := GetRTClock - s;
end;
function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);
Result := GetRTClock - s;
end;
This compiles to:
subprogram_main
MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...
normal_main
MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...
subprogram_aux:
MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax
normal_aux:
MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax
The only difference is one push
and one pop
. What happens if we turn on optimizations?
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux
MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux
Both compile exactly to the same thing.
What happens when inlining?
MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810
subprogram_main:
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4
Again, no difference.
I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:
constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);
normal_gt_ns := 0;
subprogram_gt_ns := 0;
rnd_input := Random(1000);
for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));
OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;
There is no significant difference even with optimizations turned off:
Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)
Finally, both texts that warn about performance mention something about shared local variables.
If we do not pass z
to subprogram_aux
, instead access it directly, we get:
MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax
Even with optimizations turned on.
tl;dr
- There are extra
push
/pop
s for nested subroutines - Turning on optimizations may strip those away, such that the generated code is the same for both nested subroutines and normal subroutines
- Inlining results in the same code being generated for both nested and normal subroutines
- For simple routines with few parameters and local variables we perceived no performance difference even with optimizations turned off
I wrote a little test to determine this, where GetRTClock
is measuring the current time with a precision of 1ns:
function subprogram_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
function subprogram_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := subprogram_aux(n, z);
Result := GetRTClock - s;
end;
function normal_aux(n, z : Integer) : Integer;
var
i : Integer;
begin
// Do some useless work on the aux program
for i := 0 to n - 1 do begin
if (i > z) then
z := z + i
else
z := z - i;
end;
Result := z;
end;
function normal_main(z : Integer) : Int64;
var
n : Integer;
s : Int64;
begin
s := GetRTClock;
// Do some minor work on the main program
n := z div 100 * 100 + 100;
// Call the aux program
z := normal_aux(n, z);
Result := GetRTClock - s;
end;
This compiles to:
subprogram_main
MyFormU.pas.41: begin
005CE7D0 55 push ebp
005CE7D1 8BEC mov ebp,esp
005CE7D3 83C4E0 add esp,-$20
005CE7D6 8945FC mov [ebp-$04],eax
MyFormU.pas.42: s := GetRTClock;
...
MyFormU.pas.45: n := z div 100 * 100 + 100;
...
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7F8 55 push ebp
005CE7F9 8B55FC mov edx,[ebp-$04]
005CE7FC 8B45EC mov eax,[ebp-$14]
005CE7FF E880FFFFFF call subprogram_aux
005CE804 59 pop ecx
005CE805 8945FC mov [ebp-$04],eax
MyFormU.pas.49: Result := GetRTClock - s;
...
normal_main
MyFormU.pas.70: begin
005CE870 55 push ebp
005CE871 8BEC mov ebp,esp
005CE873 83C4E0 add esp,-$20
005CE876 8945FC mov [ebp-$04],eax
MyFormU.pas.71: s := GetRTClock;
...
MyFormU.pas.74: n := z div 100 * 100 + 100;
...
MyFormU.pas.76: z := normal_aux(n, z);
005CE898 8B55FC mov edx,[ebp-$04]
005CE89B 8B45EC mov eax,[ebp-$14]
005CE89E E881FFFFFF call normal_aux
005CE8A3 8945FC mov [ebp-$04],eax
MyFormU.pas.78: Result := GetRTClock - s;
...
subprogram_aux:
MyFormU.pas.31: begin
005CE784 55 push ebp
005CE785 8BEC mov ebp,esp
005CE787 83C4EC add esp,-$14
005CE78A 8955F8 mov [ebp-$08],edx
005CE78D 8945FC mov [ebp-$04],eax
MyFormU.pas.33: for i := 0 to n - 1 do begin
005CE790 8B45FC mov eax,[ebp-$04]
005CE793 48 dec eax
005CE794 85C0 test eax,eax
005CE796 7C29 jl $005ce7c1
005CE798 40 inc eax
005CE799 8945EC mov [ebp-$14],eax
005CE79C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.34: if (i > z) then
005CE7A3 8B45F0 mov eax,[ebp-$10]
005CE7A6 3B45F8 cmp eax,[ebp-$08]
005CE7A9 7E08 jle $005ce7b3
MyFormU.pas.35: z := z + i
005CE7AB 8B45F0 mov eax,[ebp-$10]
005CE7AE 0145F8 add [ebp-$08],eax
005CE7B1 EB06 jmp $005ce7b9
MyFormU.pas.37: z := z - i;
005CE7B3 8B45F0 mov eax,[ebp-$10]
005CE7B6 2945F8 sub [ebp-$08],eax
normal_aux:
MyFormU.pas.55: begin
005CE824 55 push ebp
005CE825 8BEC mov ebp,esp
005CE827 83C4EC add esp,-$14
005CE82A 8955F8 mov [ebp-$08],edx
005CE82D 8945FC mov [ebp-$04],eax
MyFormU.pas.57: for i := 0 to n - 1 do begin
005CE830 8B45FC mov eax,[ebp-$04]
005CE833 48 dec eax
005CE834 85C0 test eax,eax
005CE836 7C29 jl $005ce861
005CE838 40 inc eax
005CE839 8945EC mov [ebp-$14],eax
005CE83C C745F000000000 mov [ebp-$10],$00000000
MyFormU.pas.58: if (i > z) then
005CE843 8B45F0 mov eax,[ebp-$10]
005CE846 3B45F8 cmp eax,[ebp-$08]
005CE849 7E08 jle $005ce853
MyFormU.pas.59: z := z + i
005CE84B 8B45F0 mov eax,[ebp-$10]
005CE84E 0145F8 add [ebp-$08],eax
005CE851 EB06 jmp $005ce859
MyFormU.pas.61: z := z - i;
005CE853 8B45F0 mov eax,[ebp-$10]
005CE856 2945F8 sub [ebp-$08],eax
The only difference is one push
and one pop
. What happens if we turn on optimizations?
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7C5 8BD3 mov edx,ebx
005CE7C7 8BC6 mov eax,esi
005CE7C9 E8B6FFFFFF call subprogram_aux
MyFormU.pas.76: z := normal_aux(n, z);
005CE82D 8BD3 mov edx,ebx
005CE82F 8BC6 mov eax,esi
005CE831 E8B6FFFFFF call normal_aux
Both compile exactly to the same thing.
What happens when inlining?
MyFormU.pas.76: z := normal_aux(n, z);
005CE804 8BD3 mov edx,ebx
005CE806 8BC8 mov ecx,eax
005CE808 49 dec ecx
005CE809 85C9 test ecx,ecx
005CE80B 7C11 jl $005ce81e
005CE80D 41 inc ecx
005CE80E 33C0 xor eax,eax
005CE810 3BD0 cmp edx,eax
005CE812 7D04 jnl $005ce818
005CE814 03D0 add edx,eax
005CE816 EB02 jmp $005ce81a
005CE818 2BD0 sub edx,eax
005CE81A 40 inc eax
005CE81B 49 dec ecx
005CE81C 75F2 jnz $005ce810
subprogram_main:
MyFormU.pas.47: z := subprogram_aux(n, z);
005CE7A8 8BD3 mov edx,ebx
005CE7AA 8BC8 mov ecx,eax
005CE7AC 49 dec ecx
005CE7AD 85C9 test ecx,ecx
005CE7AF 7C11 jl $005ce7c2
005CE7B1 41 inc ecx
005CE7B2 33C0 xor eax,eax
005CE7B4 3BD0 cmp edx,eax
005CE7B6 7D04 jnl $005ce7bc
005CE7B8 03D0 add edx,eax
005CE7BA EB02 jmp $005ce7be
005CE7BC 2BD0 sub edx,eax
005CE7BE 40 inc eax
005CE7BF 49 dec ecx
005CE7C0 75F2 jnz $005ce7b4
Again, no difference.
I also profiled this little example, taking an average of 30 executions for each (normal and subprogram), called in random order:
constructor TForm1.Create(AOwner: TComponent);
const
c_nSamples = 60;
rnd_sample : array[0..c_nSamples - 1] of byte = (1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0);
var
subprogram_gt_ns : Int64;
normal_gt_ns : Int64;
rnd_input : Integer;
i : Integer;
begin
inherited Create(AOwner);
normal_gt_ns := 0;
subprogram_gt_ns := 0;
rnd_input := Random(1000);
for i := 0 to c_nSamples - 1 do
if (rnd_sample[i] = 1) then
Inc(subprogram_gt_ns, subprogram_main(rnd_input))
else
Inc(normal_gt_ns, normal_main(rnd_input));
OutputDebugString(PChar(' Normal ' + FloatToStr(normal_gt_ns / 30) + ' Subprogram ' + FloatToStr(subprogram_gt_ns / 30)));
end;
There is no significant difference even with optimizations turned off:
Debug Output: Normal 1166,66666666667 Subprogram 1203,33333333333 Process MyProject.exe (1824)
Finally, both texts that warn about performance mention something about shared local variables.
If we do not pass z
to subprogram_aux
, instead access it directly, we get:
MyFormU.pas.47: z := subprogram_aux(n);
005CE7D2 55 push ebp
005CE7D3 8BC3 mov eax,ebx
005CE7D5 E8AAFFFFFF call subprogram_aux
005CE7DA 59 pop ecx
005CE7DB 8945FC mov [ebp-$04],eax
Even with optimizations turned on.
edited 9 hours ago
answered 9 hours ago
afarahafarah
477211
477211
2
Would be interesting to see what the other compilers do with this, e.g. Win64 compiler
– David Heffernan
9 hours ago
1
Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?
– Rudy Velthuis
2 hours ago
@RudyVelthuis It's just a wrapper to Windows'QueryPerformanceCounter
, I was told the resolution is 1ns but the documentation only says "<1us".
– afarah
1 hour ago
add a comment |
2
Would be interesting to see what the other compilers do with this, e.g. Win64 compiler
– David Heffernan
9 hours ago
1
Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?
– Rudy Velthuis
2 hours ago
@RudyVelthuis It's just a wrapper to Windows'QueryPerformanceCounter
, I was told the resolution is 1ns but the documentation only says "<1us".
– afarah
1 hour ago
2
2
Would be interesting to see what the other compilers do with this, e.g. Win64 compiler
– David Heffernan
9 hours ago
Would be interesting to see what the other compilers do with this, e.g. Win64 compiler
– David Heffernan
9 hours ago
1
1
Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?
– Rudy Velthuis
2 hours ago
Would also be interesting to see your RTClock (or RTC lock?). Does that require a specific piece of hardware, or is it available for everyone?
– Rudy Velthuis
2 hours ago
@RudyVelthuis It's just a wrapper to Windows'
QueryPerformanceCounter
, I was told the resolution is 1ns but the documentation only says "<1us".– afarah
1 hour ago
@RudyVelthuis It's just a wrapper to Windows'
QueryPerformanceCounter
, I was told the resolution is 1ns but the documentation only says "<1us".– afarah
1 hour ago
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55832314%2fwhy-is-there-a-performance-penalty-for-nested-subroutines-in-delphi%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Any code quality tool that encourages you to refactor into global variables should be treated with a severe degree of scepticism. With that said, it's good to see that the compiler, as expected, is not quite so incompetent as this would suggest.
– J...
9 hours ago
2
@J..., the bit about global variables is not from Peganza.
– Uli Gerhardt
8 hours ago
@UliGerhardt I see that now. I guess the second lesson is...don't trust code quality advice that was written almost two decades ago. Unless, of course, we're meaning to micro-optimize our code for an ancient compiler and the "new" Pentium II and its advanced core features 0_o
– J...
8 hours ago
Actually, what Robert Lee (he was some kind of optimization guru, these days) wrote was generally quite correct. His main objective was to raise the speed of code. So then you get advice like using global variables. They can/could be quite a bit faster, indeed. But indeed, on modern processors the differences are not that big anymore.
– Rudy Velthuis
6 hours ago
2
The difference are never that big, and these kinds of optimizations only count when you are invoking that code very often. In loops of a billion iterations, it could matter, but in a standard business application where you're reading and writing some stuff to files and databases, and now and then have a piece of logic with a nested procedure that is executed for each of the 100 rows you just queried, there is no way you're gonna spot the difference. Doesn't mean, though, that you shouldn't refactor them, if only for quality metrics like readability and re-usability.
– GolezTrol
6 hours ago