Promising Linux Admin Blog
Maintaining compiled versions of elisp packages
Looks like my “portable elisp system,” which started with Peter Teichman’s work, needs some slickening of its auto-recompilation. There are a few issues:
- Some packages, e.g. w3m-el, has internal recompilation dependencies among its files. In other words, if you want things to Just Work, you’d better use their makefile
- Some packages have invalid elisp in them (see org-mode’s EXPERIMENTAL directory)
- It doesn’t account for files being removed from packages. For example, Git’s vc integration was moved out of the Git tree and into the emacs tree, but I still had the .elc file hanging around, masking the file it should have been using.
Not sure what I’ll do about all this yet. If anyone has really bright ideas, please comment!
Speedy Mail (Sending) From Emacs
I’ve long ago given up resisting Emacs as the most productive working environment for me. But a few things still make it a drag. One is that emacs sends mail in the foreground, holding up my UI while it connects to my server and ships the message out.
The way around this problem is to use a local MTA (Mail Transfer Agent). I’ve done it before with exim (I’ll try to dig up my old config if you ask for it) and for my Mac I’ve just done it again with postfix. I followed this thread (and posted followups about the parts that didn’t work for me as initially prescribed).
Once done, I set message-send-mail-function to ‘message-send-mail-with-sendmail, and was off to the races.
One remaining worry is what happens when the connection to the server fails; is the mail lost, postponed, or what? When I have time to look into that, I’ll post an update here.
The other thing this brings up: what I need to do to make reading from my IMAP server more responsive? That’s another project for another day.
News Flash: MacOS NDG Either
It turns out that MacOS has its share of annoying quirks that can result in hours of delicious yak shaving. Here are a couple I just wasted time on:
- My bash prompts started showing up as “108:~” Turns out this was due to the Mac’s goofy way of choosing a hostname combined with the fact that when I rebooted my machine I was connected at a TMobile hotspot. Adding
HOSTNAME=my.fully.qualified.host.nameto /etc/hostconfig works around the issue, but it’s not encouraging that that file starts with the comment# This file is going away, and isn’t changed when I update the hostname in my sharing preferences pane. - Spotlight started indexing my remotely-mounted TimeMachine volume, and my attempts to tell it not to by using the Spotlight prefs pane resulted in an unknown error. Eventually I stumbled across
# mdutil -i off /Volumes/TimeMachine # mdutil -E /Volumes/TimeMachine/
, but I’m still not entirely sure that will work. Fingers crossed.
Iozone Author Speaks Out… Hilariously
When my experiment in benchmarking ZFS-Fuse yielded more data than I knew what to do with, I googled around a bit and found at least one other person in a similar position who contacted the author (Don Capps) of my benchmarking tool (Iozone) to get his take on the results. I figured it was worth a shot, and found Don to be extremely generous with his time and expertise. He distilled my numbers down into this graph, which made things a lot easier for me to grasp:
Basically, Don only looked at the results where the file size exceeded the system’s RAM size, since any transfer that fits in RAM isn’t going to tell you much about the underlying filesystem technology.
He also told me that if I was getting those kinds of speeds on commodity hardware, I should feel pretty good about my results.
So I posted a little about these results in a relevant mailing list, and someone else posted some numbers for a competing FUSE filesystem that were so much better that I had to ask Don for his opinion. Naturally, Don wanted details about the poster’s hardware and testing protocol, and suggested that we were very likely seeing pure cache effect in those numbers. Unfortunately, I’ve been unable to get any details—just one among several reasons I’m not giving those numbers much weight. In any case, Don followed up with a few messages about what sort of setup one would need to reproduce those claimed speeds. Those followups are the point of this post. They’re reposted here with Don’s permission, and IMO they speak for themselves:
Since you are interested in the science, I thought
I would describe some ways to get 1.2 Gbytes/sec
off the platter. ( It can be done, not easily, but
if one has ~infinite resources…..)Note: Assume all values below are ballpark and not any
specific hw.———
Assuming ~40 Mbytes/sec/disk ( Typical modern disk drive )Then to get to 1.2 Gbytes/sec == 1200 Mbytes/sec
1200/40 == 120/4 == 30 disk drives.Now we need someway to connect 30 disks. Assuming
we can get 10 disks in a JBOD, we’ll nee 3 disk
enclosures… Well… Not exactly. We still need
to have an aggregate interconnect of 1200 Mbytes/sec.
Ok.. Fibre Channel (1 Gigabit FC) can do around 100 Mbytes/sec
so… 1200/100 = 12 fibre channel connections. That’s
a bit of a bummer as most PC’s don’t have 12 PCI slots.
So… We will need to go to 2 Gigabit fibre and use 6
slots,… Oh darn… Most PC’s doesn’t have 6 free
PCI slots, so we’ll probably need 3 dual ported 2 Gigabit
FC cards. Since each of these cards is going to be
sustaining 400 Mbytes/sec, it’s probably a good
idea to make these PCI express slots.
So far we now have:
30 disks
6 Disk enclosures with 5 disks in each.
3 Dual ported 2 Gigabit FC cards.Now on to the next bottleneck… Be sure that one
starts with a motherboard that has a backplane that
can sustain 1200 Mbytes/sec.Next up, integrity… I doubt that most folks are
going to want to be ripping through data a 1.2 Gbytes/sec
and not care about their data. So… Chances are good
that they’ll want some level of RAID. RAID 1 would be
a good choice for speed, but it does mean that we’ll
need 60 disks instead of 30. If we use RAID 5 then we
will still need more disks, but not as many more. The
bummer of RAID 5 is that it generally slows down the
writer. To make up for that issue, we’ll have to
either choose more disks (RAID1) or a smarter RAID
enclosure, that can do the RAID5 XOR ops independently
of the system CPU, and hopefully double buffered, and
with multiple XOR engines and data paths. All doable,
but it does increase the cost of the system.So.. Here we are. We can do 1.2 Gbytes/sec, but it
is not going to be cheap or easily achieved. If we
ballpark this we get something like:* 3 dual ported 2Gbit FC controllers with multiple RAID5
XOR engines… ~ $3,000
* 40 to 60 disks .. ~$4,000 to $6,000.
* 6 JBOD enclosures.. ~$6,000
* 6 FC cables… ~$600
* PC with nice MB for that 1200 Mbytes/sec
backplane.. ~$2,000
…..
~$15,600 to ~$17,600( And that could go up higher if one wanted dual path
HA type connectivity as it would push one to dual
ported enclosures and quad ported FC cards )Enjoy,
Don CappsP.S. The above may get one to 1.2 Gbytes/sec for sequential
workloads, but it will not be nearly so speedy if
that workload were to shift towards a random I/O
access pattern…![]()
P.P.S. Once you have this beast built, then you can
start thinking about the environmental impact.
It’s pretty likely that these 6 disk enclosures,
60 disks, and the PC, are generating a fairly
significant quantity of heat, noise, and making
the electric meter spin at rate you have never
seen before, and can not afford to sustain
BUT, it will be beautiful and a work, of both,
science and art… Make sure you install
plenty of blue LEDs, as the blinking lights with
this many disks is mesmerizing, and will satisfy your
wife that you have constructed something really
interesting and have not been simply wasting your
time…![]()
Don, this is hilarious and educational. Do you mind if I post it on my blog
(with attribution, of course)?
David,
I don’t mind. But I did leave off a few other thoughts…
Environmental impact continued:
It’s fairly likely you’ll need to hire an electrician to
come out and put in a special circuit and rewire the
bedroom (where you have the storage system) as the current
draw is probably going to exceed the typical breaker used
for a bedroom. + $500You also may need to call the air-conditioner folks and
upgrade that 3 ton handler to a 5 ton handler, as the
thermal load is pretty high and without addition cooling
capacity, your house may become a sauna. + $6,000
It may also be possible to construct the beast inside of
a water cooled chamber and put a heat exchanger outside
your house. A small cooling tower should do the trick
but you may wish to check with the homeowners association
before you install the external tower.Make sure the room is very dark and those blinking blue
LEDs look their best, otherwise you may need to explain
to your wife why you spent $23,500 on this project, instead
of a new car, a fur coat, a diamond ring, or a European
vacation … Trust me, you really want those LEDs to be
awesome…Enjoy,
Don CappsP.S. If you would like I could send photos of my home
bedroom lab. Yep… My wife really liked the blinking
blue lights, but then again, she is a computer scientist
too![]()
I’m not going to post a picture of the inside of Don’s bedroom here, but I can tell you that while the rack is impressive, it’s not nearly as scary as I expected. My guess is he’s not trying to reproduce these claimed performance numbers
So Outdated Already?
Today I finally got VMWare up and running on Hydra, and simultaneously installed Windows XP Pro x64 edition on that and on my new MacBookPro 17″ (more on that later!) under VMWare Fusion. The neverending sequence of updates required every time you install Windows completed in—well, I’m not sure; I’m still waiting—let’s just say a lot less time on the MacBook than on my relatively muscular server. Yeah, the MacBook has a higher clock rate but the server has twice as many cores, eight times as many disks, and dual I/O channels.
In fact, these updates seem to be deadly slow on that VM. I don’t know how to account for it; top says vmware-vmx is using a maximum of 36% of one core and zfs-fuse is using maybe 11% max. I’m looking at an “Installing Updates” window that has had an empty progress bar for the past ten minutes. It’s almost as though there was some sort of deadlock in the filesystem. Grrr…
Never Partition Part of an Active RAID Array
Repeat after me: I will never change the partition table on a disk that’s part of an active RAID array. I will never change the partition table on a disk that’s part of an active RAID array. I will never change the partition table on a disk that’s part of an active RAID array. I will never change the partition table on a disk that’s part of an active RAID array…
I should’ve known this, but naturally all these lovely device aggregation technologies such as LVM, RAID, and ZFS have to store their meta-information somewhere, and that ends up being in the little holes between partitions and at the beginning and end of the disk. Changing the partition table while a RAID array is known by the system to exist on a device is likely to earn you all kinds of pain, and apparently partitioners are likely to stomp on these little areas; I found this out the hard way.
So if your root directory is on RAID and you even need to partition some free space on one of those disks, reboot the system from a rescue disk and do your work with all the RAID arrays stopped. You might even need to destroy the arrays (i.e. lose the array metadata, but not the data on the RAID partitions) and re-construct them after partitioning. And if you are going to re-partition and format the disks: explicity destroy all the RAID, ZFS, and LVM structures first—that information can hang around and come back to bite you if you don’t tell the system that your pools and volume groups are no more.
Lesson learned.
Stubbornly Persistent ZFS Pools (and what to do about them)
The first time I set up ZFS-Fuse on Hydra, I misinterpreted the admonition to use “whole disk vdevs” to mean that I should create pools from the entire disk device:
# zpool create raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd
But it turns out that when you do that on Solaris, it will actually create a regular partition for you. Since I need partitions anyway, this is good news. But I found that after re-partitioning my disks, it still looked to ZFS as though there were pools there. The information was hiding in some unused areas of the disks. It wasn’t a big deal, but every time I started ZFS, it would complain that it couldn’t find my old pool, “olympic” (get it, olympic pool? oh, never mind)
Fortunately these posts by Ricardo Correia made it clear exactly what I needed to do, and because I had set up fully-redundant RAID, it was fairly easy, if a bit nervewracking, to zero those areas of the disks and re-sync them. My first real experience with RAID. Here’s how it went.
First, find a disk for which zdb -l will report some of the broken pools present:
# zdb -l /dev/sdf
--------------------------------------------
LABEL 0
--------------------------------------------
version=13
name='olympic'
state=0
txg=279903
pool_guid=4681941973924109929
hostid=8323329
hostname='recovery'
top_guid=8406764786620297180
guid=8406764786620297180
vdev_tree
type='disk'
id=1
guid=8406764786620297180
path='/dev/sdh'
whole_disk=0
metaslab_array=14
metaslab_shift=32
ashift=9
asize=500103118848
is_log=0
DTL=89
--------------------------------------------
LABEL 1
--------------------------------------------
failed to unpack label 1
--------------------------------------------
LABEL 2
--------------------------------------------
version=13
name='olympic'
state=0
txg=279903
pool_guid=4681941973924109929
hostid=8323329
hostname='recovery'
top_guid=8406764786620297180
guid=8406764786620297180
vdev_tree
type='disk'
id=1
guid=8406764786620297180
path='/dev/sdh'
whole_disk=0
metaslab_array=14
metaslab_shift=32
ashift=9
asize=500103118848
is_log=0
DTL=89
--------------------------------------------
LABEL 3
--------------------------------------------
version=13
name='olympic'
state=0
txg=279903
pool_guid=4681941973924109929
hostid=8323329
hostname='recovery'
top_guid=8406764786620297180
guid=8406764786620297180
vdev_tree
type='disk'
id=1
guid=8406764786620297180
path='/dev/sdh'
whole_disk=0
metaslab_array=14
metaslab_shift=32
ashift=9
asize=500103118848
is_log=0
DTL=89
Now take the disk’s partitions out of all md arrays in which they participate:
# mdadm /dev/md0 -f /dev/sdf1 -r /dev/sdf1 mdadm: set /dev/sdf1 faulty in /dev/md0 mdadm: hot removed /dev/sdf1 # mdadm /dev/md1 -f /dev/sdf6 -r /dev/sdf6 mdadm: set /dev/sdf6 faulty in /dev/md1 mdadm: hot remove failed for /dev/sdf6: Device or resource busy
Occasionally the remove will fail as in the 2nd example above; just repeat the command in that case.
Note: in my case, there were no active ZFS pools that I wanted to keep on the system; only mdRAID arrays. If you have active ZFS pools you’ll want to do something similar with those; taking them offline before the dd and allowing them to resilver afterward.
So, just to be safe, I’m going to clear the first and last 2MB of each disk. The first 2M is easy:
# dd if=/dev/zero of=/dev/sdf bs=1M count=2
To zero out the final 2M, use fdisk to discover the actual size of the disk:
# fdisk /dev/sdf The number of cylinders for this disk is set to 60801. There is nothing wrong with that, but this is larger than 1024, and could in certain setups cause problems with: 1) software that runs at boot time (e.g., old versions of LILO) 2) booting and partitioning software from other OSs (e.g., DOS FDISK, OS/2 FDISK) Command (m for help): p Disk /dev/sdf: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00000000 Device Boot Start End Blocks Id System /dev/sdf1 * 1 6 48163+ fd Linux raid autodetect /dev/sdf2 7 732 5831595 5 Extended /dev/sdf5 7 124 947803+ 82 Linux swap / Solaris /dev/sdf6 125 732 4883728+ fd Linux raid autodetect
Now calculate the number of 1M blocks into the disks we want to start zeroing, and issue another dd. The calculation rounds down, so zero three blocks instead of two.
# python -c 'print 500107862016/(1024*1024) - 2' 476938 # dd if=/dev/zero of=/dev/sdf bs=1M count=3 seek=476938
Now we’ve clobbered the partition table, so copy it from an identically-partitioned disk:
# sfdisk -d /dev/sde | sfdisk /dev/sdf Checking that no-one is using this disk right now ... OK Disk /dev/sdf: 60801 cylinders, 255 heads, 63 sectors/track Warning: extended partition does not start at a cylinder boundary. DOS and Linux will interpret the contents differently. Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdf1 0+ 729 730- 5863693+ 5 Extended /dev/sdf2 0 - 0 0 0 Empty /dev/sdf3 0 - 0 0 0 Empty /dev/sdf4 0 - 0 0 0 Empty /dev/sdf5 0+ 121 122- 979902 82 Linux swap / Solaris /dev/sdf6 122+ 729 608- 4883728+ fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sdf1 * 63 96389 96327 fd Linux raid autodetect /dev/sdf2 96390 11759579 11663190 5 Extended /dev/sdf3 0 - 0 0 Empty /dev/sdf4 0 - 0 0 Empty /dev/sdf5 96453 1992059 1895607 82 Linux swap / Solaris /dev/sdf6 1992123 11759579 9767457 fd Linux raid autodetect Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).)
Add the partitions back, and wait for them to re-synchronize:
# mdadm -a /dev/md0 /dev/sdf1 && mdadm -a /dev/md1 /dev/sdf6 # watch cat /proc/mdstat
When synchronization is done, the only step that remains is to reinstall GRUB if the disk was bootable:
# grub
Probing devices to guess BIOS drives. This may take a long time.
[ Minimal BASH-like line editing is supported. For
the first word, TAB lists possible command
completions. Anywhere else TAB lists the possible
completions of a device/filename. ]
grub> device (hd1) /dev/sdf
device (hd1) /dev/sdf
grub> root (hd1,0)
root (hd1,0)
grub> setup (hd1)
setup (hd1)
Checking if "/boot/grub/stage1" exists... no
Checking if "/grub/stage1" exists... yes
Checking if "/grub/stage2" exists... yes
Checking if "/grub/xfs_stage1_5" exists... yes
Running "embed /grub/xfs_stage1_5 (hd1)"... 19 sectors are embedded.
succeeded
Running "install /grub/stage1 (hd0) (hd1)1+19 p (hd1,0)/grub/stage2 /grub/menu.lst"... succeeded
Done.
Did I remove all traces of the pool? Yup:
# zdb -l /dev/sdf -------------------------------------------- LABEL 0 -------------------------------------------- failed to unpack label 0 -------------------------------------------- LABEL 1 -------------------------------------------- failed to unpack label 1 -------------------------------------------- LABEL 2 -------------------------------------------- failed to unpack label 2 -------------------------------------------- LABEL 3 -------------------------------------------- failed to unpack label 3
Just to be sure, I rebooted and selected /dev/sdf as my BIOS boot device. It worked! Lather, rinse, repeat (because this is going to take a while).
Update: After handling the three disks that contribute to more than one MD device, I wrote a little script to handle the rest of them. Voilà:
set -e
# set -x
if ! mdadm --detail -t /dev/md1 > /dev/null; then
echo waiting for healthy array
until mdadm --detail -t /dev/md1 > /dev/null ; do echo -n . ; sleep 1 ; done
fi
for x in /dev/sd[cdei]; do
echo '**' taking $x offline:
mdadm /dev/md1 -f ${x}6
sleep 1
mdadm /dev/md1 -r ${x}6
echo clearing beginning
dd if=/dev/zero of=$x bs=1M count=2 > /dev/null 2>&1
echo clearing end
dd if=/dev/zero of=$x bs=1M count=2 seek=476938 > /dev/null 2>&1
echo checking for cleared pools
zdb -l $x | grep -qv olympic
echo no pool remains on $x. copying partition table
sfdisk -d /dev/sdb | sfdisk $x > /dev/null 2>&1
echo bringing disk back online
mdadm /dev/md1 -a ${x}6
echo waiting for resilver to complete
until mdadm --detail -t /dev/md1 > /dev/null ; do echo -n . ; sleep 1 ; done
echo resilvering complete. Starting next disk in 5 seconds
sleep 5
done
GRUB2: Still NDG
Today (yesterday?) I foolishly decided to try again with GRUB2 svn to /boot off RAID6 after being told by one of its devs that my distro’s version was way out of date. That didn’t go so well. I guess it’s still not ready for prime-time. I did, however, manage to move port most of Ubuntu’s debian package over to the new version of the source, so if anyone wants to pick up where I left off, just ask for it.
Oh, did I mention that I hosed my OS installation and needed to start over? Well, I hosed my OS installation and needed to start over. What a pain.
Phew
My server finally has a backup system. Well, we’ll have to see whether the backup cron job fires off tonight, but aside from that, it seems to be working. The code is available in our GitHub repo. Now I’m off to document that setup.
