版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明 (作者:张华 发表于:2018-07-25)
问题
客户重启物理机之后发现有一些OSD没有启动成功, 但手动运行ceph-disk命令(ceph-disk -v activate –mark-init systemd –mount /var/lib/ceph/osd/ceph-1)可以成功.从众多无关错误日志中提取到了如下有用日志, 很显然似乎在120秒的超时时间内没执行完:
May 22 06:05:19 cephosd06 systemd[1]: Starting Ceph disk activation: /dev/sdh1...
May 22 06:05:30 cephosd06 sh[3926]: main_trigger: main_activate: path = /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: get_dm_uuid: get_dm_uuid /dev/sdh1 uuid path is /sys/dev/block/8:113/dm/uuid
May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: command: Running command: /sbin/blkid -p -s TYPE -o value -- /dev/sdh1
May 22 06:05:30 cephosd06 sh[3926]: mount: Mounting /dev/sdh1 on /var/lib/ceph/tmp/mnt.0xG2_W with options noatime,inode64
May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -t xfs -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/tmp/mnt.0xG2_W
May 22 06:05:30 cephosd06 sh[3926]: command_check_call: Running command: /bin/mount -o noatime,inode64 -- /dev/sdh1 /var/lib/ceph/osd/ceph-45
May 22 06:05:30 cephosd06 ceph-osd[8052]: starting osd.45 at :/0 osd_data /var/lib/ceph/osd/ceph-45 /var/lib/ceph/osd/ceph-45/journal
May 22 06:05:31 cephosd06 sh[6944]: main_trigger: main_trigger: Namespace(cluster='ceph', dev='/dev/sdh1', dmcrypt=None, dmcrypt_key_dir='/etc/ceph/dmcrypt-keys', func=<function main_trigger at 0x7f574b3fc7d0>, log_stdout=True, prepend_to_path='/usr/bin', prog='ceph-disk', setgroup=None, setuser=None, statedir='/var/lib/ceph', sync=True, sysconfdir='/etc/ceph', verbose=True)
May 22 06:05:31 cephosd06 sh[6944]: command_check_call: Running command: /bin/chown ceph:ceph /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /sbin/blkid -o udev -p /dev/sdh1
May 22 06:05:31 cephosd06 sh[6944]: main_trigger: trigger /dev/sdh1 parttype 4fbd7e29-9d25-41b8-afd0-062c0ceff05d uuid ff8f7341-1c1e-4912-b680-41fd6999fcc8
May 22 06:05:31 cephosd06 sh[6944]: command: Running command: /usr/sbin/ceph-disk --verbose activate /dev/sdh1
May 22 06:07:20 cephosd06 systemd[1]: ceph-disk@dev-sdh1.service: Main process exited, code=exited, status=124/n/a
May 22 06:07:20 cephosd06 systemd[1]: Failed to start Ceph disk activation: /dev/sdh1.
原因
个原因造成或加剧此问题:
- ceph-disk(https://github.com/ceph/ceph/blob/jewel/src/ceph-disk/ceph_disk/main.py#L2966)自己会通过’–runtime’调用ceph-osd@{osd_id}.service一次(‘–runtime’意味着重启机器后ceph-osd服务不会自动重启). 所以只需https://canonical.my.salesforce.com/500D000001l5NsyIAE要在systemd中enable ceph-disk@.service, 不需要enable ceph-osd@.service(如果也enable ceph-osd@.service的话更容易造成并发问题).
- 其次, ceph-disk@.service里的timeout时间为120, 这个设置的有点小(upstream也意识到了这个值过小 - https://github.com/ceph/ceph/pull/15585 ), 当启动时运行了大量的ceph-disk@.service服务的话容易出现问题.
- 最后, 上层ceph-charm也会造成问题(https://bugs.launchpad.net/charm-ceph-osd/+bug/1779828)
vi systemd/ceph-disk@.service
ExecStart=/bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --lvi systemd/ceph-osd@.service
ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i
Restart=on-failure
StartLimitInterval=30min
StartLimitBurst=30
RestartSec=20s
ceph-osd触发ceph-disk流程
1, First ceph will create journal partition with the typecode 45b0969e-9b03-4f30-b4c6-b4b80ceff106, ceph-osd --cluster=ceph --show-config-value=osd_journal_size
uuid=$(uuidgen)
num=2
sgdisk --new=${num}:0:+128M --change-name=${num}:"ceph journal" --partition-guid=${num}:${uuid} --typecode=${num}:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=6002, partx/partprobe command will be called to update partition after running sgdisk to create partition, so partprobe will send udev event to udev daemon3, udev daemon will call '/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name' after receiving udev event created by partprobe according to the following udev rules:./udev/95-ceph-osd.rules
11 # JOURNAL_UUID
12 ACTION=="add", SUBSYSTEM=="block", \
13 ENV{DEVTYPE}=="partition", \
14 ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
15 OWNER:="ceph", GROUP:="ceph", MODE:="660", \
16 RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
17 ACTION=="change", SUBSYSTEM=="block", \
18 ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
19 OWNER="ceph", GROUP="ceph", MODE="660"./src/ceph-disk-udev
29 45b0969e-9b03-4f30-b4c6-b4b80ceff106)
30 # JOURNAL_UUID
31 # activate ceph-tagged journal partitions.
32 /usr/sbin/ceph-disk -v activate-journal /dev/${NAME}
33 ;;4, so the device /dev/disk/by-partuuid/9195fa44-68ba-49f3-99f7-80d9bcb50430 will be created5, Then the uuid of journal partition will be writed into the file /var/lib/ceph/osd/ceph-1/journal_uuid, and a soft link is linked into /var/lib/ceph/osd/ceph-1/journalroot@juju-332891-mitaka-ceph-0:~# ll /var/lib/ceph/osd/ceph-1/journal
lrwxrwxrwx 1 ceph ceph 58 Jun 1 02:46 /var/lib/ceph/osd/ceph-1/journal -> /dev/disk/by-partuuid/9195fa44-68ba-49f3-99f7-80d9bcb50430
root@juju-332891-mitaka-ceph-0:~# cat /var/lib/ceph/osd/ceph-1/journal_uuid
9195fa44-68ba-49f3-99f7-80d9bcb50430
伪码描述ceph-disk的执行流程
1, Prepare test disk
dd if=/dev/zero of=test.img bs=1M count=8096 oflag=direct
#sudo losetup -d /dev/loop0
sudo losetup --show -f test.img
sudo ceph-disk -v prepare --zap-disk --cluster ceph --fs-type xfs -- /dev/loop02, Clear the partition
parted --machine -- /dev/loop0 print
sgdisk --zap-all -- /dev/loop0
sgdisk --clear --mbrtogpt -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=6003, Create journal partition
ceph-osd --cluster=ceph --show-config-value=osd_journal_size
uuid=$(uuidgen)
num=2
sgdisk --new=${num}:0:+128M --change-name=${num}:"ceph journal" --partition-guid=${num}:${uuid} --typecode=${num}:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=6004, Create data partition
uuid=$(uuidgen)
sgdisk --largest-new=1 --change-name=1:"ceph data" --partition-guid=1:${uuid} --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=6005, Format data partition
parted --machine -- /dev/loop0 print
mkfs -t xfs -f -i size=2048 -- /dev/loop0p16, All default mount attributes should be empty
ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs7, Mount tmp directory - https://github.com/ceph/ceph/blob/jewel/src/ceph-disk/ceph_disk/main.py#L3169
mkdir /var/lib/ceph/tmp/mnt.uCrLyH
mount -t xfs -o noatime,inode64 -- /dev/loop0p1 /var/lib/ceph/tmp/mnt.uCrLyH
restorecon /var/lib/ceph/tmp/mnt.uCrLyH
cat /proc/mounts8, Activate - https://github.com/ceph/ceph/blob/jewel/src/ceph-disk/ceph_disk/main.py#L3192
#Get fsid and write fsid to tmp file ceph_fsid by using active function
fsid=$(ceph-osd --cluster=ceph --show-config-value=fsid)
cat << EOF > /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid
$fsid
EOF
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/ceph_fsid#Get osd_uuid and write it to the tmp file
osd_uuid=$(uuidgen)
cat << EOF > /var/lib/ceph/tmp/mnt.uCrLyH/fsid
$osd_uuid
EOF
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/fsid
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/fsid#Write magic to the tmp file
cat << EOF > /var/lib/ceph/tmp/mnt.uCrLyH/magic
ceph osd volume v026
EOF
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/magic
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/magic#Get journal_uuid and write it to the tmp file
journal_uuid # Get it by 'll /dev/disk/by-partuuid/ | grep loop0p2'
cat << EOF > /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid
$journal_uuid
EOF
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH/journal_uuid#Create journal link
ln -s /dev/disk/by-partuuid/f15b0bc2-8462-44c3-83f3-275646923f4a /var/lib/ceph/tmp/mnt.uCrLyH/journal#Retore file security for tmp directory
restorecon -R /var/lib/ceph/tmp/mnt.uCrLyH
chown -R ceph:ceph /var/lib/ceph/tmp/mnt.uCrLyH#Umount tmp directory
umount -- /var/lib/ceph/tmp/mnt.uCrLyH
rm -rf /var/lib/ceph/tmp/mnt.uCrLyH#Modify the typecode of OSD to 4fbd7e29-9d25-41b8-afd0-062c0ceff05d, which means READY
sgdisk --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/loop0
udevadm settle --timeout=600
flock -s /dev/loop0 partprobe /dev/loop0
udevadm settle --timeout=600
udevadm trigger --action=add --sysname-match loop09, Start OSD daemon - https://github.com/ceph/ceph/blob/jewel/src/ceph-disk/ceph_disk/main.py#L3471
#ceph-disk -v activate --mark-init systemd --mount /dev/loop0
blkid -p -s TYPE -o value -- /dev/loop0
ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
mkdir /var/lib/ceph/tmp/mnt.GoeBOu
mount -t xfs -o noatime,inode64 -- /dev/loop0 /var/lib/ceph/tmp/mnt.GoeBOu
restorecon /var/lib/ceph/tmp/mnt.GoeBOu
umount -- /var/lib/ceph/tmp/mnt.GoeBOu
rm -rf /var/lib/ceph/tmp/mnt.GoeBOu
systemctl disable ceph-osd@3
systemctl enable --runtime ceph-osd@3
systemctl start ceph-osd@3
Reference
[1] https://bugs.launchpad.net/charm-ceph-osd/+bug/1783113