[CUWiN-Dev] memory-hungry/starved nodes

Tue Mar 14 07:02:14 CST 2006

Daniel and I went by Mike's memory-hungry node yesterday and, at Dave's 
advice, poked around to see what the problem was.

The most troubling thing we found was, of all things, two rather large 
instances of ntpd running. Consider the following:

Upon node boot:

# du -ks /mfs
2400    /mfs

# df -h
Filesystem    Size      Used     Avail Capacity  Mounted on
/dev/wd0a      30M      25M      3.4M    87%    /
tmpfs         192K     192K        0B   100%    /dev
tmpfs         2.2M     2.2M        0B   100%    /mfs
/etc           30M      25M      3.4M    87%    /permanent/etc
/home          30M      25M      3.4M    87%    /permanent/home
/tmp           30M      25M      3.4M    87%    /permanent/tmp
/var           30M      25M      3.4M    87%    /permanent/var
/mfs/etc      2.2M     2.2M        0B   100%    /etc
/mfs/home     2.2M     2.2M        0B   100%    /home
/mfs/tmp      2.2M     2.2M        0B   100%    /tmp
/mfs/var      2.2M     2.2M        0B   100%    /var

# top
load averages:  0.66,  0.47,  0.21                  up 0 days,  0:03 
02:42:18
29 processes:  1 runnable, 27 sleeping, 1 on processor
CPU states:  5.9% user,  0.0% nice, 18.2% system,  0.0% interrupt, 75.9% 
idle
Memory: 13M Act, 1984K Inact, 3816K Wired, 4300K Exec, 1400K File, 4732K 
Free
Swap:

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
 1749 root      18    0  1092K 3428K pause      0:00  0.00%  0.00% ntpd
 1724 root      18    0  1092K 1168K pause      0:00  0.00%  0.00% ntpd
 2612 root       2    0   616K  848K select     0:00  0.00%  0.00% dhclient
 1817 nobody     2    0   508K 1148K kqread     0:00  0.00%  0.00% thttpd
 1677 root       2    0   480K 1052K select     0:00  0.00%  0.00% zebra
 1128 root       2    0   316K 2404K netio      0:00  0.00%  0.00% sshd
 2923 twiltziu   2    0   316K 2048K select     0:00  0.00%  0.00% sshd
 1848 root       2    0   248K 1716K select     0:05  0.00%  0.00% sshd
 1929 root      10    0   248K  900K wait       0:01  0.10%  0.10% sh
 1870 root       2    0   204K  900K fifor      0:00  0.00%  0.00% sh
 1690 root      10    0   172K  784K wait       0:01  0.05%  0.05% sh
 6333 root      28    0   168K 1020K CPU        0:00  1.54%  0.34% top
 2930 twiltziu  10    0   152K  760K wait       0:00  0.00%  0.00% sh
 2928 root      10    0   144K  732K wait       0:00  0.00%  0.00% sh
 1703 root       2    0   136K 1256K kqread     0:03  1.03%  1.03% hslsd
 1750 root       2    0   112K  836K kqread     0:00  0.00%  0.00% syslogd
 1911 root      10    0   108K  816K nanoslee   0:00  0.00%  0.00% cron

# pkill ntpd

# df -h
Filesystem    Size      Used     Avail Capacity  Mounted on
/dev/wd0a      30M      25M      3.4M    87%    /
tmpfs         400K     192K      208K    48%    /dev
tmpfs          11M     2.2M      8.7M    20%    /mfs
/etc           30M      25M      3.4M    87%    /permanent/etc
/home          30M      25M      3.4M    87%    /permanent/home
/tmp           30M      25M      3.4M    87%    /permanent/tmp
/var           30M      25M      3.4M    87%    /permanent/var
/mfs/etc       11M     2.2M      8.7M    20%    /etc
/mfs/home      11M     2.2M      8.7M    20%    /home
/mfs/tmp       11M     2.2M      8.7M    20%    /tmp
/mfs/var       11M     2.2M      8.7M    20%    /var

# top
load averages:  1.42,  0.71,  0.31                  up 0 days,  0:04 
02:43:13
27 processes:  26 sleeping, 1 on processor
CPU states:  8.9% user,  0.0% nice, 18.8% system,  0.0% interrupt, 72.3% 
idle
Memory: 10M Act, 1984K Inact, 484K Wired, 3900K Exec, 1800K File, 11M Free
Swap:

  PID USERNAME PRI NICE   SIZE   RES STATE      TIME   WCPU    CPU COMMAND
 2612 root       2    0   616K  848K select     0:00  0.00%  0.00% dhclient
 1817 nobody     2    0   508K 1148K kqread     0:00  0.00%  0.00% thttpd
 1677 root       2    0   480K 1052K select     0:00  0.00%  0.00% zebra
 1128 root       2    0   316K 2404K netio      0:00  0.00%  0.00% sshd
 2923 twiltziu   2    0   316K 2048K select     0:00  0.15%  0.15% sshd
 1848 root       2    0   248K 1716K select     0:05  0.00%  0.00% sshd
 1929 root      10    0   248K  900K wait       0:01  0.10%  0.10% sh
 1870 root       2    0   204K  900K fifor      0:00  0.00%  0.00% sh
 1690 root      10    0   172K  784K wait       0:01  0.05%  0.05% sh
 7227 root      28    0   168K 1020K CPU        0:00  0.88%  0.20% top
 2930 twiltziu  10    0   152K  760K wait       0:00  0.00%  0.00% sh
 2928 root      10    0   144K  732K wait       0:00  0.00%  0.00% sh
 1703 root       2    0   136K 1256K kqread     0:04  1.32%  1.32% hslsd
 1750 root       2    0   112K  836K kqread     0:00  0.00%  0.00% syslogd
 1911 root      10    0   108K  816K nanoslee   0:00  0.00%  0.00% cron
 1905 root       2    0    72K  884K kqread     0:00  0.00%  0.00% inetd
    1 root      10    0    68K  740K wait       0:00  0.00%  0.00% init

Killing other programs gave up some more memory as well, but none were as 
large as ntpd.

Interestingly, about 5 minutes after we kill ntpd the system locks. We're 
not sure if this has anything to do with that process or not, since it was 
alwasy about 5-10 minutes after we turned the node on.

We  don't know whether this might be related to the node not having any free 
space to finish bootup properly:

# tail /var/log/messages
Aug 23 02:40:30 cuw cuw_config: Creating pipe /var/run/cuwconf_pipe
Aug 23 02:40:40 cuw /sbin/dhclient-script: reason PREINIT
Aug 23 02:40:44 cuw syslogd: /var/log/daemon: No space left on device
Aug 23 02:40:44 cuw syslogd: /var/log/daemon: No space left on device
Aug 23 02:40:45 cuw /sbin/dhclient-script: reason BOUND
Aug 23 02:40:45 cuw /sbin/dhclient-script: Routers: 192.168.1.1
Aug 23 02:41:04 cuw su: twiltziu to root on /dev/ttyp0
Aug 23 02:42:50 cuw hslsd: send LSU on interface 
fdb4:542d:dc11:b792:202:6fff:fe01:b792 failed
Aug 23 02:42:50 cuw hslsd: send LSU on interface 
fdb4:542d:dc11:1461:200:24ff:fec1:1461 failed
Aug 23 02:42:50 cuw hslsd: send LSU on interface 
fdb4:542d:dc11:1460:200:24ff:fec1:1460 failed

Daniel and I thought perhaps we could do something like edit rc.d to not 
cause ntpd to run on startup, which might give us enough breathing room to 
be able to "downgrade" the node to an older version until a solution is 
found for nodes with less memory. I realize this isn't terribly useful 
development-wise, but it would help the network in that area (since Mike's 
house is a gateway).

Tom