Josef “Jeff” Sipek

DTrace: The utmp_update Debugger

For the past 2 years, I am a happy user of rxvt-unicode, aka urxvt. Recently, I noticed that my logs contained rather mysterious error messages:

Nov  3 22:46:03 meili utmp_update[1613]: [ID 845426 user.error] Wrong number of
	arguments or invalid user 

Sometimes, there were a dozen of these. Of course I filed a bug with the Illumos folks. Rich Lowe suggested using DTrace to figure out what is actually going on. It was time to look at the exit codes for utmp_update.

syscall::rexit:entry
/execname=="utmp_update"/
{
	printf("utmp_update exited with code %d", arg0);
	@[arg0] = count();
}

tick-60sec
{
	printa(@);
}

Since utmp is involved, it had something to do with terminals, so I tried to open some terminals and close them. That did it!

# dtrace -s catch-errors.d 
dtrace: script 'catch-errors.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0     49                      rexit:entry utmp_update exited with code 0
  1     49                      rexit:entry utmp_update exited with code 0
  0     49                      rexit:entry utmp_update exited with code 7
  0     49                      rexit:entry utmp_update exited with code 7
  1     49                      rexit:entry utmp_update exited with code 0
  5     49                      rexit:entry utmp_update exited with code 0
  1     49                      rexit:entry utmp_update exited with code 0
  2     49                      rexit:entry utmp_update exited with code 0
  3     49                      rexit:entry utmp_update exited with code 7
  1  67549                      :tick-60sec 
                7                3
                0                6

It turns out that every time I closed a terminal, utmp_update exited with error 7. A quick glance at usr/src/cmd/utmp_update/utmp_update.c reveals:

/*
 * Return codes
 */
#define	NORMAL_EXIT		0
#define	BAD_ARGS		1
#define	PUTUTXLINE_FAILURE	2
#define	FORK_FAILURE		3
#define	SETSID_FAILURE		4
#define	ALREADY_DEAD		5
#define	ENTRY_NOTFOUND		6
#define	ILLEGAL_ARGUMENT	7
#define	DEVICE_ERROR		8

Aha! It really is an invalid argument. At this point, Rich pointed me to setutxline in libc.so. Sadly, for whatever reason, the probe pid*:libc.so.1:pututxline:entry didn’t work (it didn’t match anything). Rich suggested the following DTrace script:

proc:::exec
/strstr(args[0], "utmp") != NULL/
{
	trace(execname);
}

Pretty straightforward — the output told me that it was urxvt causing all this trouble.

Now, I knew to watch out for pututxline in urxvt. I tried to set a probe pid$target::pututxline:entry and use the new print function in DTrace, but due to a user error (read: sometimes I write stupid code) it didn’t work. Rich helped me navigate through mdb to get a print-out of the utx structure. At this point, it was a bit too late in the night and so I went to bed.

The next morning, I tried the print function again and this time I used it right and it printed out the structure:

struct utmpx {
    char [32] ut_user = [ "" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x8
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d2a
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0, 0 ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}

Everything looks right, except that the ut_user field is blank. I wonder if this could be the cause of it. Time to look at the urxvt code! (The ustack() action in a DTrace probe for pututxline:entry will tell you where to look.) Here’s a snippet from rxvt-unicode-9.12/libptytty/src/logging.C:

/*
 * remove utmp and wtmp entries
 */
void
ptytty_unix::logout ()
{
  ...
#ifdef HAVE_STRUCT_UTMPX
  setutxent ();
  tmputx = getutxid (utx);
  if (tmputx && tmputx->ut_pid == cmd_pid)
    pututxline (utx);
  endutxent ();
#endif
  ...
}

Ok, so it gets a utx struct and then it puts a different one. Let’s see how different those two are:

# cat getutxid.d
#include <utmpx.h>

pid$target::getutxid:return
{
	ustack();
	print(*(struct utmpx*)copyin(arg1, sizeof(struct utmpx)));
}
# dtrace -Cs getutxid.d -p 103403
dtrace: script 'getutxid.d' matched 1 probes
dtrace: pid 103403 has exited
CPU     ID                    FUNCTION:NAME
  4  67555                  getutxid:return 
              libc.so.1`getutxid+0xf1
              urxvt`_ZN11ptytty_unixD0Ev+0x16
              urxvt`_ZN9rxvt_termD1Ev+0x59b
              urxvt`_ZN9rxvt_term10destroy_cbERN2ev4idleEi+0x70
              urxvt`_ZN2ev4baseI7ev_idleNS_4idleEE12method_thunkI9rxvt_termXadL_ZNS5_10destroy_cbERS2_iEEEEvPS1_i+0x27
              urxvt`ev_invoke_pending+0x35
              urxvt`ev_run+0x520
              urxvt`main+0x29b
              urxvt`_start+0x83
struct utmpx {
    char [32] ut_user = [ "jeffpc" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x7
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0x2
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d1f
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0x303a0005, 0x302e ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}

Ok, it’s more or less the same. It does, however, have a username filled in. I wonder what would happen if I filled in the username. (The urxvt code seems to fill it in only on login updates and it leaves the field empty on logout updates.) Now, I had 3 choices…

  1. Change the urxvt code to fill in the username on logout updates.
  2. Set a breakpoint in gdb or mdb and then tweak the structure before it is passed to utmp_update.
  3. Use DTraces “destructive” option to allow me to modify the process’s memory.

I chose #3.

Here’s the script in all its glory:

#pragma D option destructive

#include <utmpx.h>

pid$target::getutxid:return
{
	ustack();
	print(*(struct utmpx*)copyin(arg1, sizeof(struct utmpx)));
}

pid$target::pututxline:entry
{
	ustack();
	print(*(struct utmpx*)copyin(arg0, sizeof(struct utmpx)));
	printf("\nFIXING...\n");
	copyout("jeffpc\0", (uintptr_t)&((struct utmpx*)arg0)->ut_user[0], 7);
	print(*(struct utmpx*)copyin(arg0, sizeof(struct utmpx)));
}

pid$target::pututxline:return
{
	printf("pututxline returned %p", arg1);
}

And here’s the output:

# dtrace -Cs foo.d -p 103403
dtrace: script 'foo.d' matched 3 probes
dtrace: allowing destructive actions
dtrace: pid 103403 has exited
CPU     ID                    FUNCTION:NAME
  4  67555                  getutxid:return 
              libc.so.1`getutxid+0xf1
              urxvt`_ZN11ptytty_unixD0Ev+0x16
              urxvt`_ZN9rxvt_termD1Ev+0x59b
              urxvt`_ZN9rxvt_term10destroy_cbERN2ev4idleEi+0x70
              urxvt`_ZN2ev4baseI7ev_idleNS_4idleEE12method_thunkI9rxvt_termXadL_ZNS5_10destroy_cbERS2_iEEEEvPS1_i+0x27
              urxvt`ev_invoke_pending+0x35
              urxvt`ev_run+0x520
              urxvt`main+0x29b
              urxvt`_start+0x83
struct utmpx {
    char [32] ut_user = [ "jeffpc" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x7
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0x2
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d1f
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0x303a0005, 0x302e ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}
  4  67556                 pututxline:entry 
              libc.so.1`pututxline
              urxvt`_ZN11ptytty_unix6logoutEv+0x15c
              urxvt`_ZN11ptytty_unixD0Ev+0x16
              urxvt`_ZN9rxvt_termD1Ev+0x59b
              urxvt`_ZN9rxvt_term10destroy_cbERN2ev4idleEi+0x70
              urxvt`_ZN2ev4baseI7ev_idleNS_4idleEE12method_thunkI9rxvt_termXadL_ZNS5_10destroy_cbERS2_iEEEEvPS1_i+0x27
              urxvt`ev_invoke_pending+0x35
              urxvt`ev_run+0x520
              urxvt`main+0x29b
              urxvt`_start+0x83
struct utmpx {
    char [32] ut_user = [ "" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x8
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d2a
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0, 0 ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}
FIXING...
struct utmpx {
    char [32] ut_user = [ "jeffpc" ]
    char [4] ut_id = [ 'v', 't', '0', '2' ]
    char [32] ut_line = [ "pts/2" ]
    pid_t ut_pid = 0x193ec
    short ut_type = 0x8
    struct exit_status ut_exit = {
        short e_termination = 0
        short e_exit = 0
    }
    struct timeval ut_tv = {
        time_t tv_sec = 0x4eb55d2a
        suseconds_t tv_usec = 0x18adc
    }
    int ut_session = 0
    int [5] pad = [ 0, 0, 0, 0, 0 ]
    short ut_syslen = 0
    char [257] ut_host = [ "" ]
}
  4  67557                pututxline:return pututxline returned fef6ecf0

We can see the getutxid return a reasonable utmpx. Then we see pututxline get a utmpx without a username set. Then there is the fixed up tmpx. Finally, we see that the pututxline returned a non-NULL pointer. (It returns NULL on error — which does indeed happen without the fix-up.)

There you have it, folks. DTrace let me debug an issue without annoying change-compile-install cycles. Now, all I have to do is fix up urxvt in OpenIndiana and possibly, if it applies to other systems, push the fix upstream.

Recursion with Sun Studio

I mentioned my post about recursion & DTrace in the #dtrace, and one of the folks there asked about Sun’s/Oracle’s compiler — Studio. I happen to have version 12u1, so why not have some more fun? :)

First things first, the results. With -O1, both variants recurse — 6 calls followed by 6 returns. With -O2, the naive code still recurses, while the explicitly-coded-as-tail-recursive code tail-recurses.

Here are the disassemblies (for -O2). Naive-recursive:

08050aa0 <fac>:
 8050aa0:	55                   	push   %ebp
 8050aa1:	8b ec                	mov    %esp,%ebp
 8050aa3:	53                   	push   %ebx
 8050aa4:	83 ec 04             	sub    $0x4,%esp
 8050aa7:	83 e4 f0             	and    $0xfffffff0,%esp
 8050aaa:	8b 5d 08             	mov    0x8(%ebp),%ebx
 8050aad:	83 fb 01             	cmp    $0x1,%ebx
 8050ab0:	7f 07                	jg     8050ab9 <fac+0x19>
 8050ab2:	b8 01 00 00 00       	mov    $0x1,%eax
 8050ab7:	eb 12                	jmp    8050acb <fac+0x2b>
 8050ab9:	83 ec 0c             	sub    $0xc,%esp
 8050abc:	8d 43 ff             	lea    -0x1(%ebx),%eax
 8050abf:	50                   	push   %eax
 8050ac0:	e8 db ff ff ff       	call   8050aa0 <fac>
 8050ac5:	83 c4 10             	add    $0x10,%esp
 8050ac8:	0f af c3             	imul   %ebx,%eax
 8050acb:	8d 65 fc             	lea    -0x4(%ebp),%esp
 8050ace:	5b                   	pop    %ebx
 8050acf:	c9                   	leave  
 8050ad0:	c3                   	ret    

and tail-recursive:

08050aa0 <fac>:
 8050aa0:	55                   	push   %ebp
 8050aa1:	8b ec                	mov    %esp,%ebp
 8050aa3:	8b 55 08             	mov    0x8(%ebp),%edx
 8050aa6:	8b 45 0c             	mov    0xc(%ebp),%eax
 8050aa9:	83 fa 01             	cmp    $0x1,%edx
 8050aac:	7e 0d                	jle    8050abb <fac+0x1b>
 8050aae:	8d 4a ff             	lea    -0x1(%edx),%ecx
 8050ab1:	0f af c2             	imul   %edx,%eax
 8050ab4:	8b d1                	mov    %ecx,%edx
 8050ab6:	83 f9 01             	cmp    $0x1,%ecx
 8050ab9:	7f f3                	jg     8050aae <fac+0xe>
 8050abb:	89 45 0c             	mov    %eax,0xc(%ebp)
 8050abe:	89 55 08             	mov    %edx,0x8(%ebp)
 8050ac1:	c9                   	leave  
 8050ac2:	c3                   	ret 

I find it fascinating that Studio produced nearly identical code to gcc. The big differences that you can see are: (1) Studio defaults to saving the frame pointer while gcc omitted it, (2) gcc inserted a no-op to 16-byte align the loop, and (3) Studio uses an extra register to perform the subtraction (the n-1).

The subtraction surprised me, and so I examined the code more closely. My guess is that Studio attempts to move the subtraction as far up as possible to start it early. This way, by the time we actually need the subtracted value (the cmp instruction) it’s already available. Compare that to gcc’s sub followed immediately by cmp. Some processors are designed in such a way that the result of sub will be available by the time cmp wants to execute. Others are not.

Your Turn

Have you noticed any interesting compiler optimizations? Share them in a comment!

Recursion

Last night while reading about DTrace, I came across an example that involved tracing a simple recursive factorial program. I pointed it out to my girlfriend, Holly, since I thought that she’d find it interesting — the class she teaches has a section about recursion.

Here’s the original code:

extern int fac (int n);

int main(int argc, char **argv)
{
	int f;
	f = fac(6);
	return 0;
}

int fac (int n)
{
	if (n <= 1)
		return 1;
	else
		return n * fac (n - 1);
}

Pretty simple. I compiled it, and ran it with DTrace:

$ gcc -o fac orig.c 
$ cat fac.d 
pid$target::fac:entry
{
        trace (arg0);
}
pid$target::fac:return
{
        trace (arg1);
}
$ sudo dtrace -Fs fac.d -c ./fac
dtrace: script 'fac.d' matched 2 probes
dtrace: pid 1146 has exited
CPU FUNCTION                                 
  0  -> fac                                                   6
  0    -> fac                                                 5
  0      -> fac                                               4
  0        -> fac                                             3
  0          -> fac                                           2
  0            -> fac                                         1
  0            <- fac                                         1
  0          <- fac                                           2
  0        <- fac                                             6
  0      <- fac                                              24
  0    <- fac                                               120
  0  <- fac                                                 720

Cool! I thought I was done. Holly asked whether it would work with tail-recursion. Interesting, I thought…it might - depending on how gcc handles the function prologue and epilogue. The D script is a little different to dump both of the arguments. Here’s the result:

$ gcc -o fact tail.c
$ cat fact.d 
pid$target::fac:entry
{
        trace (arg0); trace(arg1);
}
pid$target::fac:return
{
        trace (arg1);
}
$ sudo dtrace -Fs fact.d -c ./fact
dtrace: script 'fact.d' matched 2 probes
dtrace: pid 1233 has exited
CPU FUNCTION                                 
  2  -> fac                                                   6                1
  2    -> fac                                                 5                6
  2      -> fac                                               4               30
  2        -> fac                                             3              120
  2          -> fac                                           2              360
  2            -> fac                                         1              720
  2            <- fac                                       720
  2          <- fac                                         720
  2        <- fac                                           720
  2      <- fac                                             720
  2    <- fac                                               720
  2  <- fac                                                 720

You can see the values getting calculated and passed down versus them getting calculated on during the return. That was fun to see. But wait! If this function is tail recursive, why are we seeing all these returns? It should be just one return! Doh! I didn’t compile with optimizations so gcc emitted the stupidest possible code. Easy enough to fix:

$ gcc -o fact -O2 tail.c 
$ sudo dtrace -Fs fact.d -c ./fact
dtrace: script 'fact.d' matched 2 probes
dtrace: pid 1304 has exited

$

Huh… nothing happened! fac was never called. Let’s see what gcc emitted:

$ objdump -S fact
...
Disassembly of section .text.startup:

08050d10 <main>:
 8050d10:       31 c0                   xor    %eax,%eax
 8050d12:       c3                      ret  

Great, the main function turned into a return 0 because the value returned by fac() was never used. Easy enough, let’s just return the value from main.

$ cat tail.c 
extern int fac (int n);

int main(int argc, char **argv)
{
        return fac(6);
}

int fac (int n)
{
        if (n <= 1)
                return 1;
        else
                return n * fac (n - 1);
}
$ gcc -o fact -O2 tail.c 
$ objdump -S fact
...
Disassembly of section .text.startup:

08050d10 <main>:
 8050d10:       b8 d0 02 00 00          mov    $0x2d0,%eax
 8050d15:       c3                      ret    

Argh! Thanks gcc, you replaced my fac(6) function call with the value 720 — that is the factorial of 6. Fine, let’s do this the hard way: get an int from the first argument and print it out. Also, to prevent inlining, let’s put it in a separate file. So, now we have:

$ cat fac.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

extern int fac (int n);

int main(int argc, char **argv)
{
        printf("%d\n", fac(atoi(argv[1])));
        return 0;
}
$ cat fac_2.c
int fac (int n)
{
        if (n <= 1)
                return 1;
        else
                return n * fac (n - 1);
}
$ cat fact.c
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

extern int fac (int n, int v);

int main(int argc, char **argv)
{
        printf("%d\n", fac(atoi(argv[1]), 1));
        return 0;
}
$ cat fact_2.c
int fac (int n, int v)
{
        if (n <= 1)
                return v;
        else
                return fac(n-1, n*v);
}

Now, if we run the two variants compiled with -O2, we get:

$ sudo dtrace -Fs fac.d -c "./fac 6"
dtrace: script 'fac.d' matched 2 probes
720
dtrace: pid 1397 has exited
CPU FUNCTION                                 
  2  -> fac                                                   6
  2  <- fac                                                 720

$ sudo dtrace -Fs fact.d -c "./fact 6"
dtrace: script 'fact.d' matched 2 probes
720
dtrace: pid 1399 has exited
CPU FUNCTION                                 
  5  -> fac                                                   6                1
  5  <- fac                                                 720

Weird, for both we can see only one function entry and return. Let’s try with -O1:

$ sudo dtrace -Fs fac.d -c "./fac 6"
dtrace: script 'fac.d' matched 2 probes
720
dtrace: pid 1393 has exited
CPU FUNCTION                                 
  6  -> fac                                                   6
  6    -> fac                                                 5
  6      -> fac                                               4
  6        -> fac                                             3
  6          -> fac                                           2
  6            -> fac                                         1
  6            <- fac                                         1
  6          <- fac                                           2
  6        <- fac                                             6
  6      <- fac                                              24
  6    <- fac                                               120
  6  <- fac                                                 720

$ sudo dtrace -Fs fact.d -c "./fact 6"
dtrace: script 'fact.d' matched 2 probes
720
dtrace: pid 1395 has exited
CPU FUNCTION                                 
  2  -> fac                                                   6                1
  2    -> fac                                                 5                6
  2      -> fac                                               4               30
  2        -> fac                                             3              120
  2          -> fac                                           2              360
  2            -> fac                                         1              720
  2            <- fac                                       720
  2          <- fac                                         720
  2        <- fac                                           720
  2      <- fac                                             720
  2    <- fac                                               720
  2  <- fac                                                 720

Ok, now were back to having call and return instructions for both cases — the tail recursive function is not actually tail recursing when it should. So, first moral of the story is: -O1 is not enough to make tail recursive functions tail recurse. The odd behavior of the non-tail recursive code with -O2 is still weird. Let’s disassemble it; first the simple recursive code:

08050d00 <fac>:
 8050d00:       8b 54 24 04             mov    0x4(%esp),%edx
 8050d04:       b8 01 00 00 00          mov    $0x1,%eax
 8050d09:       83 fa 01                cmp    $0x1,%edx
 8050d0c:       7e 0d                   jle    8050d1b <fac+0x1b>
 8050d0e:       66 90                   xchg   %ax,%ax
 8050d10:       0f af c2                imul   %edx,%eax
 8050d13:       83 ea 01                sub    $0x1,%edx
 8050d16:       83 fa 01                cmp    $0x1,%edx
 8050d19:       75 f5                   jne    8050d10 <fac+0x10>
 8050d1b:       f3 c3                   repz ret 
 8050d1d:       90                      nop    
 8050d1e:       90                      nop    
 8050d1f:       90                      nop    

Whoa! gcc turned the plain recursive code into a tail-recursive one. For comparison, here’s the disassembly of the explicitly-coded-as-tail-recursive function:

08050d00 <fac>:
 8050d00:       8b 54 24 04             mov    0x4(%esp),%edx
 8050d04:       8b 44 24 08             mov    0x8(%esp),%eax
 8050d08:       83 fa 01                cmp    $0x1,%edx
 8050d0b:       7e 0e                   jle    8050d1b <fac+0x1b>
 8050d0d:       8d 76 00                lea    0x0(%esi),%esi
 8050d10:       0f af c2                imul   %edx,%eax
 8050d13:       83 ea 01                sub    $0x1,%edx
 8050d16:       83 fa 01                cmp    $0x1,%edx
 8050d19:       75 f5                   jne    8050d10 <fac+0x10>
 8050d1b:       f3 c3                   repz ret 
 8050d1d:       90                      nop    
 8050d1e:       90                      nop    
 8050d1f:       90                      nop

Do you see it? It’s virtually identical to what gcc emitted for the naive code.

So there you have it folks. The compiler is smarter than you, more consistent than you, and less likely to screw up compared to you when converting a recursive function into a tail-recursive one. In general, you should not prematurely optimize.

In case you care, I’ve used gcc 4.6.1 for these experiments on OpenIndiana.

Your Turn

Do you have an interesting compiler optimization story? Share it in a comment!

DTrace: qsort use in Firefox, part 2

Earlier, I wrote about some silly qsort behavior in Firefox. I couldn’t help but dig a bit deeper.

Before, we concluded that there were a lot of 8-element, 4-byte element sorts. What are these used for? What part of Firefox is causing these? DTrace to the rescue.

First, let’s change the last DTrace command from last time a bit. First of all, let’s look at 4-byte element invocations only (arg2 equals 4) and let’s aggregate on the caller function name:

# dtrace -n 'pid1120:libc:qsort:entry/arg2==4/{@[ufunc(ucaller)]=llquantize(arg1, 10,0,6,20)} tick-60sec{printa(@)}'

...
  1  75455                      :tick-60sec 
  libnss3.so`DPCache_GetUpToDate                    
           value  ------------- Distribution ------------- count    
             < 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14       
               1 |                                         0        

  libglib-2.0.so.0.2501.0`g_array_sort              
           value  ------------- Distribution ------------- count    
             700 |                                         0        
             750 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
             800 |                                         0        

  libfontconfig.so.1`FcFontSetSort                  
           value  ------------- Distribution ------------- count    
              55 |                                         0        
              60 |@                                        4        
              65 |                                         2        
              70 |                                         0        
              75 |@@@@                                     20       
              80 |@@@@@                                    24       
              85 |@@@@@@@                                  36       
              90 |@@@@@                                    26       
              95 |                                         0        
             100 |@@@@@@@@@@@@@@                           76       
             150 |@@@@                                     22       
             200 |                                         0        

  libcairo.so.2.10800.10`_cairo_bentley_ottmann_tessellate_bo_edges
           value  ------------- Distribution ------------- count    
               3 |                                         0        
               4 |@                                        59       
               5 |                                         0        
               6 |                                         32       
               7 |                                         0        
               8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@            2357     
               9 |                                         0        
              10 |@@                                       137      
              15 |@@@                                      215      
              20 |@                                        52       
              25 |                                         12       
              30 |@                                        58       
              35 |@@                                       200      
              40 |@                                        67       
              45 |                                         3        
              50 |                                         2        
              55 |                                         2        
              60 |                                         7        
              65 |                                         0        
              70 |@                                        67       
              75 |                                         0        
              80 |                                         0        
              85 |                                         0        
              90 |                                         0        
              95 |                                         16       
             100 |                                         0   

We see four unique functions that call qsort. It doesn’t take long to spot the one we were looking for: _cairo_bentley_ottmann_tessellate_bo_edges in libcairo.so. Interesting, so it turns out that it wasn’t Firefox itself doing all these sorts (8-element, 4-byte element) but rather the Cairo graphics library. It would also seem that it is the only place that does these sorts. Let’s see how Firefox is involved in this.

# dtrace -n 'pid1120:libc:qsort:entry/arg2==4 && arg1==8/{@[ustack()]=count()} tick-60sec{printa(@)}'

...
  0  75455                      :tick-60sec 

              libc.so.1`qsort
              libcairo.so.2.10800.10`_cairo_bentley_ottmann_tessellate_bo_edges+0x172
              libcairo.so.2.10800.10`_cairo_bentley_ottmann_tessellate_polygon+0x41f
              libcairo.so.2.10800.10`_cairo_path_fixed_fill_to_traps+0x150
              libcairo.so.2.10800.10`_cairo_clip_clip+0xe3
              libcairo.so.2.10800.10`_cairo_gstate_clip+0x44
              libcairo.so.2.10800.10`cairo_clip_preserve+0x31
              libxul.so`__1cKgfxContextEClip6M_v_+0x24
              libxul.so`__1cInsWindowNOnExposeEvent6MpnK_GtkWidget_pnP_GdkEventExpose__i_+0x56e
              libxul.so`__1cPexpose_event_cb6FpnK_GtkWidget_pnP_GdkEventExpose__i_+0x72
              libgtk-x11-2.0.so.0.2000.0`_gtk_marshal_BOOLEAN__BOXED+0x7b
              libgobject-2.0.so.0.2501.0`g_closure_invoke+0xdf
              libgobject-2.0.so.0.2501.0`signal_emit_unlocked_R+0x9f2
              libgobject-2.0.so.0.2501.0`g_signal_emit_valist+0x6d2
              libgobject-2.0.so.0.2501.0`g_signal_emit+0x28
              libgtk-x11-2.0.so.0.2000.0`gtk_widget_event_internal+0x24d
              libgtk-x11-2.0.so.0.2000.0`gtk_widget_send_expose+0x98
              libgtk-x11-2.0.so.0.2000.0`gtk_main_do_event+0x46f
              libgdk-x11-2.0.so.0.2000.0`_gdk_window_process_updates_recurse+0x291
              libgdk-x11-2.0.so.0.2000.0`_gdk_windowing_window_process_updates_recurse+0x24
              184

This counts the number of 4-byte element, 8-element arrays qsort aggregated on the stack trace. We get to use ustack() here because we are in userspace (stack() would give us the kernel stack trace). Where is Firefox in this mess? libxul.so. This is the limit of my knowledge of Firefox internals and someone more knowledgeable could tell you more.

Your Turn

Do you use DTrace? Do you have some interesting war stories? Let me know in the comments!

DTrace: qsort use in Firefox

I’ve talked about OpenIndiana a bunch. I’ve mentioned several of its features. Let me tell you about my Wikipedia article: DTrace experiments from today. Inspired by Wikipedia article: Con Kolivas, I decided to see how Firefox uses the qsort C function.

First things first, let’s look at what the function signature looks like.

void qsort(void *base, size_t nel, size_t width,
           int (*compar)(const void *, const void *));

The second argument contains the number of elements.

Now, let’s take a look at DTrace. We want the pid provider, which lets us instrument a specific process. In my case, Firefox was pid 1069. pid1069:libc:qsort:entry is the name of the probe that will fire every time qsort in libc.so is called by Firefox (pid 1069). Let’s aggregate the second argument (the number of elements). To keep things sane, I used the llquantize function. It is a log-linear quantization function that was rather well explained by Bryan Cantrill. (Base 10 with buckets between zero and one million seemed reasonable enough.) Additionally, I wanted DTrace to give me the current histogram every minute — that’s why there is the tick-60sec probe.

# dtrace -n 'pid1069:libc:qsort:entry{@=llquantize(arg1, 10,0,6,20)} tick-60sec{printa(@)}'

...
  1  78738                      :tick-60sec 

           value  ------------- Distribution ------------- count    
             < 1 |                                         2        
               1 |                                         0        
               2 |                                         21       
               3 |                                         2        
               4 |@@@@                                     365      
               5 |                                         1        
               6 |@                                        132      
               7 |                                         0        
               8 |@@@@@@@@@@@@@@@@@@@@@                    1923     
               9 |                                         0        
              10 |@                                        135      
              15 |@@                                       194      
              20 |@                                        134      
              25 |                                         9        
              30 |@@@                                      246      
              35 |                                         31       
              40 |                                         8        
              45 |                                         0        
              50 |                                         0        
              55 |                                         10       
              60 |                                         1        
              65 |                                         10       
              70 |                                         39       
              75 |                                         2        
              80 |@                                        112      
              85 |@                                        56       
              90 |@                                        82       
              95 |                                         1        
             100 |@                                        132      
             150 |@                                        90       
             200 |                                         0        
             250 |                                         0        
             300 |                                         4        
             350 |                                         0        

Interesting! After several minutes of browsing various websites, we can see that Firefox really likes to sort 8-element arrays. (The value column is the bucket for the various array lengths. The count column specifies how many times there was a qsort call for each bucket.) Let’s dig a little deeper. Sorting 1 byte array elements is very different from sorting 1 MB elements. It would be really nice if we could break the histogram down into several histograms — one for each size. Well, guess what? DTrace lets you do that very easily.

Note that the command changed only a little. Now, in addition to looking at the second argument (the array length), DTrace breaks down the distribution based on the value of the third argument (the array element size). Since I visited different websites and Firefox does caching, the distribution of qsorts is a bit different — but it is still close enough.

# dtrace -n 'pid1069:libc:qsort:entry{@[arg2]=llquantize(arg1, 10,0,6,20)} tick-60sec{printa(@)}'

...
  4  78738                      :tick-60sec 
               52
           value  ------------- Distribution ------------- count    
               1 |                                         0        
               2 |@@@@@@@@@@@@@@@@@                        60       
               3 |@@                                       9        
               4 |@@@@@                                    17       
               5 |                                         0        
               6 |@@@@@@@@@@@@@@@                          55       
               7 |                                         0        
               8 |@                                        3        
               9 |                                         1        
              10 |                                         0        

                8
           value  ------------- Distribution ------------- count    
             < 1 |@@@@@@@@@@@@@                            2        
               1 |                                         0        
               2 |                                         0        
               3 |                                         0        
               4 |                                         0        
               5 |                                         0        
               6 |                                         0        
               7 |                                         0        
               8 |                                         0        
               9 |                                         0        
              10 |                                         0        
              15 |                                         0        
              20 |                                         0        
              25 |                                         0        
              30 |                                         0        
              35 |                                         0        
              40 |                                         0        
              45 |                                         0        
              50 |                                         0        
              55 |                                         0        
              60 |                                         0        
              65 |                                         0        
              70 |                                         0        
              75 |                                         0        
              80 |                                         0        
              85 |                                         0        
              90 |                                         0        
              95 |                                         0        
             100 |                                         0        
             150 |                                         0        
             200 |@@@@@@@@@@@@@@@@@@@@                     3        
             250 |                                         0        
             300 |@@@@@@@                                  1        
             350 |                                         0        

                4
           value  ------------- Distribution ------------- count    
             < 1 |                                         12       
               1 |                                         0        
               2 |                                         1        
               3 |                                         0        
               4 |@@@@@@@                                  1351     
               5 |                                         0        
               6 |@                                        247      
               7 |                                         0        
               8 |@@@@@@@@@                                1868     
               9 |                                         0        
              10 |@@@                                      594      
              15 |@@                                       422      
              20 |@                                        230      
              25 |                                         4        
              30 |@@@@@@                                   1193     
              35 |@@                                       466      
              40 |                                         57       
              45 |                                         63       
              50 |                                         1        
              55 |                                         18       
              60 |@                                        190      
              65 |@@                                       341      
              70 |@                                        207      
              75 |                                         2        
              80 |                                         56       
              85 |@                                        158      
              90 |                                         46       
              95 |                                         0        
             100 |@@                                       350      
             150 |@                                        206      
             200 |                                         3        
             250 |                                         10       
             300 |                                         8        
             350 |                                         0        
             400 |                                         0        
             450 |                                         0        
             500 |                                         0        
             550 |                                         0        
             600 |                                         0        
             650 |                                         0        
             700 |                                         1        
             750 |                                         10       
             800 |                                         0        
             850 |                                         0        
             900 |                                         0        
             950 |                                         0        
            1000 |                                         0        
            1500 |                                         0        
            2000 |                                         0        
            2500 |                                         0        
            3000 |                                         8        
            3500 |                                         0        

As you can see, there are now three histograms printed — that’s because DTrace saw 3 unique values of arg2. The first histogram is for 52-byte array element sorts. There weren’t many of those over the few minutes of browsing I did. The second is for 8-bytes elements — there are six of those total! The third distribution is where things get interesting. These are all the sorts of 4-byte array elements. Now we know that the large amount of 8-element sorts Firefox performs are on 4-byte element arrays. I wonder what that’s about. We also see that there were eight times that Firefox ended up sorting an array that had somewhere between 3000 and 3500 4-byte elements. Eeek!

DTrace is a really powerful tool. It lets you inspect the operation of a system with minimal disruption (the performance overhead is rather small). I hope to post more analyses of various software in the future.

I should add, this experiment was conducted with Firefox 3.6.12 on OpenIndiana 151a.

CMake

Recently, I looked into various build system in hopes of finding one that sucks less than the custom built (not by me) one I had the “pleasure” of dealing with for about a month. The requirements were that it had to work on Windows, Linux, and Mac OS X. Of all the possibilities, CMake looked really promising. After successfully convincing the others that CMake was better for the project, most of the code got switched over. I have to say, CMake is nice.

Since then, I have replaced Makefiles in several of my projects (e.g., my blogging system) with CMakeLists.txt. Not only is the console output cleaner, but it does proper dependencies, some sanity checks, and in general simplifies one’s life. I am actually considering switching HVF build system to it.

zEnterprise 196

I’m rather late with this post, but here it goes anyway. In August 2010, IBM announced a new mainframe — the zEnterprise 196.

At first, I wasn’t sure if that was supposed to be a z10 replacement, or if it was an entry level system like the Wikipedia article: Multiprise was back in the ESA/390 days. It turns out that it is a z10 replacement.

The system looks like a z10 from the outside, on the inside it’s different of course.

The specs for a fully configured system are:

  • 80 customer cores @ 5.2GHz (96 cores total, the 16 extra ones take care of system stuff as well as provide spare cores for failover)
  • 3TB of memory (z10 has 1.5TB limit)

The big new thing is the addition of the zBX. Now, it is possible to have your mainframe hooked up to several POWER (and soon x86) BladeCenters working as one system. I don’t really know much about how this part of the system works, but it sure looks interesting.

Awful, Awful C++: Operator Overloading Insanity

Ew! This is an emergency post. I just found out about some operator overloading brain damage in C++. Take a look at the Wikipedia article about Wikipedia article: Operators in C and C++. Just to make sure, I checked the C/C++ Language Reference for IBM’s XL C++ compiler (I find it easier to read than many of the other language references).

As you are no doubt aware, C++ has a ++ prefix and a ++ postfix operator (there is also the decrement operator but the same insanity applies to it). They work differently so when overloading them, there is a need for two different functions.

The syntax for operator overloading does not however allow for the difference in a obvious way. operator++ doesn’t seem to indicate whether you are talking about the prefix or the postfix variant. Here’s a code snippet that shows the syntax for both:

class X {
	void operator++() { };
	void operator++(int) { };
};

Do you see the difference? The postfix variant uses a dummy int argument. What?!

C++ is even worse than that! I found out that there are two more operators that C++ has. Here’s an example of how they might be used:

struct X {
	int *foo;
};

void foo(struct X *p)
{
	p->*foo = 1;
}

void bar(struct X &p)
{
	p.*foo = 1;
}

Do you see it? Do you see that abomination? Here’s the equivalent version using a more mainstream syntax.

void foo(struct X *p)
{
	*p->foo = 1;
}

void bar(struct X &p)
{
	*p.foo = 1;
}

We’ve tried to compile a test program that uses this “fancy” notation, but gcc does not like it. I am glad.

Awful, Awful C++: Exceptions & Memory Leaks

Let me start today’s post with a question. Can you spot the issue in the following code snippet? (Yes, I know, it’s a bit artificial, but it is inspired by a lot of real code I have seen.)

void foo(struct S *ptr)
{
	assert(ptr);

	ptr->a = new int[100];
	ptr->b = new int[100];
}

void bar()
{
	struct S s;

	try {
		foo(&s);
	} catch (std::bad_alloc &ba) {
		// foo failed to allocate memory
		return;
	}

	// continue processing
}

Do you see the issue?

It’s not a C++ problem, but C++ exceptions make it possible. The answer is: there is a potential memory leak.

Everyone has always told me how amazing exceptions are, but I always found them counter intuitive. I’ve used several languages that have you manage the memory, and so I am rather familiar with the following construct (I’m using C for this, but the same idea exists in other languages):

ptr = malloc(sizeof(xyz));
if (!ptr) {
	/* allocation failed */
}

Everyone has always told me that with exceptions, you can do a bunch of things in the try block and then figure out what exactly went wrong in one of the catch blocks. In other words, something like:

try {
	ptr = new xyz;
	// more processing
} catch (std::bad_alloc &ba) {
	// oops, failed to allocate memory
} catch (whatever &w) {
	// handle 'whatever' exception
}

That all sounds great, but what if you need to allocate two objects? When you catch a bad_alloc exception, how do you know which of the allocations failed? Sure, you could initialize it to NULL, and then check for it in the catch block. It becomes a bigger issue when the function using new does no exception handling and instead relies on the caller to take care of it. (Java handles this better by forcing you to either catch the exception, or marking the function with an annotation indicating that it may throw a specific exception. Interestingly enough, C++ lets you annotate functions to tell the compiler whether or not it will throw an exception, but that’s a topic for another time.) Since the caller does not have access to the callee’s stack frame, it has no way of determining whether or not there was a successful allocation in addition to the failed one. Again, this is not an issue with the language, but rather with the view that exceptions are magic that makes memory management simple.

Recently, I mentioned this issue to someone who uses C++ on daily basis, and he suggested that this could be fixed by having the function doing the allocation (in my example foo) use auto_ptr instead of raw pointers. That means, that the if execution leaves the scope whatever was allocated gets destroyed. This compilicates the implementation of foo — now, we need two auto_ptrs to store the actual pointers to the integer arrays, and if we managed to allocate both, we can set the raw pointers in the structure and everything is fine and dandy. If there was an allocation failure, we exit the scope because of the exception — the auto_ptrs will destroy things for us.

Finally, I would like to address those of you that are about to tell me that if memory allocation failed, then we are doomed anyway and possibly leaking some other memory is not a big deal. In some software, that is certainly true. However, there is a huge amount of code that can (and should) continue to function even in the presence of such failure. For example, suppose you are writing a web server. A client connects, you try to allocate some sort of structure to keep track of per-client state. The allocation fails. Does it mean that the whole web server process is doomed? No. Just this one client as far as we know. We should clean up, and then close the network connection. That one client may end up unhappy, but there’s nothing we could do for it. Other clients that already happen to have all the memory they need, can continue on their merry way. Additionally, memory may become available in the future — a memory hog may exit freeing up a lot of memory.

Awful, Awful C++: Casting

C++ has several ways of casting. Sadly enough, the names are mostly unintuitive. I asked several people I know (some of which happen to be, or have been instructors for the introductory C++ class), and I managed to construct the following list. Not all of them knew what all the casts do. This is a very C++ symptom; most people don’t know significant fractions of the language. This is mostly because C++ is a very complex language and no one teaches/tells you about everything.

Of course, Stroustrup’s The C++ Programming Language describes all the cast constructs.

static_cast
converts between related types. The types have to be convertible. For example, ints can be trivially converted into enums. Floating point variables can be converted to integral types — truncate the decimal portion.
reinterpret_cast
converts between unrelated types. For example, from an integer to a pointer, or between two unrelated pointer types (e.g., float* and int*).
const_cast
is used when you want to remove const or volatile qualifiers.
dynamic_cast
attempts a type conversion but performs run-time checks to see if the conversion is valid. This is useful for converting pointer types — from a base class type to a subclass type. If the conversion is not valid, dynamic_cast returns a null pointer.
C-style cast
is the confusing cast in C++. It is a combination of static_cast, reinterpret_cast, and const_cast. That is, it can be used to add or remove const-ness, volatile-ness, convert between related or unrelated types (between ints and pointers, different type pointers, and between the various integral types and floats).

All in all, it really isn’t that bad once you see a list.

I started this post with the intent to show how ridiculous C++ casting is, but I think that all I can really say is just that I don’t agree with some of the ways it is done. I don’t like how verbose the cast syntax is. (I am aware that unnecessary casting is bad; in C, I get frustrated whenever I see people casting malloc return value instead of simply assigning it.) On the other hand, I have to agree with people that say that it’s easier to grep for C++-style casts.

The issue with C++ here is that it allows C-style casts to coexist with C-style casts. I’ve been told by C++ fans that one cannot treat C++ as if it were C. One way that would accomplish that would be making C++ look different than C. An easy thing would be to get rid of C-style casts. (I am aware that this will never happen.) Not only do they confuse C programmers trying to use C++, and C++ programmers that haven’t been told about all the features of the language, they also make the compiler’s job more difficult and the software potentially buggier. In C++, the C-style cast is the nuke. It tells the compiler to just do the conversion. C++ gives the programmer 4 different casts, using the right one (or ones if more than one is required) at the right time means the compiler can do more optimizations and more sanity checks making the code faster and less likely to be buggy.

So, C++ isn’t awful because it has a bunch of different casts, but because it allows them co-exist with the C-style cast.

Powered by blahgd