Thursday, 5 April 2018

Monitoring VDO Volumes

My previous post showed you how to get deduplication working on Linux with VDO. In some ways, that's the post that could cause trouble - if you start using vdo across a number of hosts, how can you easily establish monitoring or even alerting?

So that's the problem we're going to focus on in this post.

Monitoring

There are a heap of different ways to monitor systems, but the rising star currently is Prometheus. Historically, I've used monitoring systems that require clients to push data to a central server but Prometheus turns this around. With Prometheus data collection is initiated by the Prometheus server itself - it's called a 'scrape' job. This approach simplifies client configurations and management, which is a huge bonus for large installations.

To make vdo data available, we need an exporter. The exporter provides a http endpoint that the Prometheus server will scrape metrics from. There are a heap of exporters available to Prometheus covering a plethora of different subsystems, but since vdo is new there isn't something you can just pick up and run with. Well that was the case...

vdo_exporter Project

The scrape job simply issues a GET request to the "/metrics" HTTP API endpoint on a host. Developing an API endpoint for this in python is fairly straight forward, and given the metrics themselves are all nicely grouped together under sysfs, it seemed a bit of a no-brainer to develop an exporter. My exporter can be found here. The project's repo contains the python code, a systemd unit file and what I hope is a sensible README file documenting how to install the exporter (if you have a firewall active, remember to open port 9286!)

I'm leaving the installation of the exporter as an exercise for the reader, and use the rest of this article to show you how to quickly stand up prometheus and grafana to collect and visualise the vdo statistics. For this example, I'm again using Fedora so for other distributions you may have to tweak 'stuff'.

Containers to the Rescue!

The prometheus and grafana projects both provide docker images on docker hub, so assuming you already have docker installed on your machine you can grab the images with the following;

docker pull quay.io/prometheus/prometheus
docker pull docker.io/grafana/grafana:4.6.3

Containers are inherently stateless, but for monitoring and dashboards we need to make sure that these containers use either different docker volumes, or persist data to the host's filesystem. For this exercise, I'll be exposing some directories on the host's filesystem (change these to suit!)

mkdir -p /opt/docker/grafana-prom/{etc,data}
chown 104 /opt/docker/grafana-prom/{etc,data}
chgrp 107 /opt/docker/grafana-prom/{etc,data}
mkdir -p /opt/docker/grafana-prom/prom-{etc,data}
chown 65534 /opt/docker/grafana-prom/prom-{etc,data}
chgrp 65534 /opt/docker/grafana-prom/prom-{etc,data}


To launch the containers and manage them as a unit, I'm using "docker-compose" - so if you don't have that installed, talk nicely to your package manager :)

Assuming you have docker-compose available, you just need a compose file (docker-compose.yml) to bring the containers together;

version: '2'

services:
  grafana:
    image: docker.io/grafana/grafana:4.6.3
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - /opt/docker/grafana-prom/etc:/etc/grafana:Z
      - /opt/docker/grafana-prom/data:/var/lib/grafana:Z
    depends_on:
      - prometheus
  prometheus:
    image: docker.io/prom/prometheus
    container_name: prometheus
    network_mode: "host"
    ports:
      - "9090:9090"
    volumes:
      - /opt/docker/grafana-prom/prom-etc:/etc/prometheus:Z
      - /opt/docker/grafana-prom/prom-data:/prometheus:Z

With the directories in place for the persistent data within the containers, and the compose file ready you just need to start the containers. Run the docker-compose command from the directory that holds your docker-compose.yml file.

[root@myhost grafana_prom]# docker-compose up -d
Creating network "grafanaprom_default" with the default driver
Creating prometheus ...
Creating prometheus ... done
Creating grafana ...
Creating grafana ... done

Configuring Prometheus

You should already have the vdo_exporter service running on your hosts that are using vdo, so the next step is to create a scrape job in prometheus to tell it to go and fetch the data. This is done by editing the prometheus.yml file - in my case this is in /opt/docker/grafana-prom/prom-etc.  Under the scrape_configs section add something like this to collect data from your vdo host(s)

# VDO Information
- job_name: "vdo_stats"
  static_configs:
    - targets: [ '192.168.122.98:9286']

Now reload Prometheus to start the data collection
[root@myhost grafana_prom]# docker exec -it prometheus sh
/prometheus $ kill -SIGHUP 1 

Configuring Grafana

To visualize the vdo statistics that Prometheus is collecting, Grafana needs two things; the data source definition pointing to the prometheus container, and a dashboard that presents the data.

  1. Login to your grafana instance (http://localhost:3000), using the default credentials (admin/admin)
  2. Click on the Grafana icon in the top left, and select Data Sources
  3. Click the "Add  data source" button
  4. Enter the prometheus details (and ensure you set the data source as the default)



  5. The grafana directory in the vdo_exporter project holds a file called VDO_Information.json. This json file is the dashboard definition, so we need to import it.
    • Click on the grafana icon again, highlight the Dashboards entry, then select the import option from the pop-up menu.
    • Click on the Upload.json File, and pick the VDO_Information.json file to upload.
  1. Now select the dashboard icon (to the right of the Grafana logo), and select "VDO Information". You should then see something like this

  1. As you add more hosts that are vdo enabled, just add the host's ip to the prometheus scrape configuration and reload prometheus. Simples..

Grafana provides a notifications feature which enables you to define threshold based alerting. You could define a trigger for low "physical space" conditions, or alert based on recovery being active - I leave that up to you! Grafana supports a number of different notification endpoints including PagerDuty, Sensu and even email! So take some time and review the docs to see how Grafana could best integrate into your environment.


And Remember...

VDO is not the proverbial "silver bullet". The savings from any compression and deduplication technology is dependent on the data you're storing, and vdo is no different. Also, each vdo volume requires additional RAM, so if you want to move vdo out of the test environment into production you'll need to plan for additional CPU and RAM to "make the magic happen"™.


Wednesday, 4 April 2018

Shrinking Your Storage Requirements with VDO

Whether you're using proprietary storage arrays or software defined storage, the actual cost of capacity can sometimes provoke responses like, "why do you you need all that space?" or "OK, but that's all the storage you're going to get, so make it last".

The problem is that storage is a commodity resource, it's like toner or ink in a printer. When you run out, things will stop and lots of people tend to lose their sense of humor. Controlling storage growth has been going on for over 10 years in the proprietary storage space, with one of the most successful companies being NetApp who introduced data deduplication with their ASIS (advanced Single Instance Storage) feature back in 2007. The message was that if you wanted to reduce storage consumption, you basically had to buy the more expensive "stuff" in the first place.

This was the "status quo" until Red Hat acquired Permabit in mid 2017...now compression and deduplication features are heading towards a Linux server near you!

That's the history lesson, now let's look at how you can kick the tyres on open sourced based compression and deduplication. For the remainder of this article, I'll walk through the steps you need to quickly get "dedupe" up and running with Fedora.


Installation

Since we're just testing, create a vm and install Fedora 27. Use libvirt, parallels, virtualbox...whatever takes your fancy - or maybe just use a cloud image in AWS. The choice is yours! Just try to ensure the vm has something like; 2 vcpus, 4GB RAM, an OS disk (20GB) and a data disk for vdo testing.

Once installed you'll need to enable an additional repository to pick up the vdo deduplication modules (kvdo - kernel virtual data optimizer)

dnf copr enable rhawalsh/dm-vdo
dnf install vdo kmod-kvdo
depmod

Configuration

In my test environment, I'm using a 20g vdisk for my vdo testing.
[root@f27-vdo ~]# lsblk
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vda    252:0    0   4G  0 disk 
└─vda1 252:1    0   4G  0 part /
vdb    252:16   0  20G  0 disk 

Now with the kvdo module in place, let's create a vdo volume of 100G using the 20G /dev/vdb device

[root@f27-vdo ~]# vdo create --name=vdo0 --device=/dev/vdb \
--vdoLogicalSize=100g
Creating VDO vdo0
Starting VDO vdo0
Starting compression on VDO vdo0
VDO instance 0 volume is ready at /dev/mapper/vdo0

Not exactly complicated :) Couple of things worth noting though;
  • by default new volumes are created with compression and deduplication enabled. If you don't like that you can play with the  --compression or --deduplication flags.
  • a vdo volume is actually a device mapper device, in this case /dev/mapper/vdo0. It's this 'dm' device that you'll use from here on in.

Usage

Now you have a vdo volume, next step is to get it deployed and understand how to report on space savings. The first thing is filesystem formatting. Make sure you use the -K switch to avoid issuing discards, remember a vdo volume is in effect a thin provisioned volume.

[root@f27-vdo ~]# mkfs.xfs -K /dev/mapper/vdo0

With the filesystem in place, the next step would normally be updating fstab...right? Well not this time. For vdo volumes, the boot time startup sequence between fstab and the vdo service is a problem - so we need to use a mount service to ensure vdo volumes are mounted correctly. 
The vdo rpm provides a sample mount service definition (/usr/share/doc/vdo/examples/systemd/VDO.mount.example). For this example, I'm going to mount the vdo volume at /mnt/vdo0

mkdir /mnt/vdo0
cp /usr/share/doc/vdo/examples/systemd/VDO.mount.example /etc/systemd/system/mnt-vdo0.mount

Then update the mount unit to look like this
[Unit]
Description = Mount filesystem that lives on VDO0
name = mnt-vdo0.mount
Requires = vdo.service systemd-remount-fs.service
After = multi-user.target
Conflicts = umount.target

[Mount]
What = /dev/mapper/vdo0
Where = /mnt/vdo0
Type = xfs
Options = discard

[Install]
WantedBy = multi-user.target

Reminder: mount services are named to reflect their intended mount location within the filesystem.

Now reload systemd, enable the mount and start it
systemctl daemon-reload
systemctl enable mnt-vdo0.mount
systemctl start mnt-vdo0.mount
[root@f27-vdo ~]# df -h /mnt/vdo0
Filesystem         Size Used Avail Use% Mounted on
/dev/mapper/vdo0   100G 135M 100G    1% /mnt/vdo0

At this point you've used the vdo command to create the volume, but there is also a command to look at the volume's statistics called vdostats. To give us something to look at I copied the same 200MB disk image to the volume 20 times, which will also help to explain vdo overheads.

[root@f27-vdo ~]# df -h /mnt/vdo0
Filesystem        Size  Used Avail Use% Mounted on
/dev/mapper/vdo0  100G  4.5G   96G   5% /mnt/vdo0

[root@f27-vdo ~]# vdostats --hu vdo0
Device               Size   Used   Available   Use% Space saving%
vdo0                20.0G   4.2G       15.8G    21%           95%

Wait a minute...at a logical layer, the filesystem says that it's 4.5G used, but at the physical vdo layer it's saying practically the same thing AND that there's a 95% saving! So which is right? The answer is both :) The vdo subsystem persists metadata on the volume (lookup maps etc), which accounts for a chunk of the physical space used, and the savings value is derived purely from the logical blocks "in" and the physical, unique blocks written. If you need to understand more you can dive into the sysfs filesystem. 
Each vdo volume stores and maintains statistics under  /sys/kvdo/<vol_name>/statistics (which is where vdostats gets it's information from!)

The most useful stats I've found to understand how space is consumed are;

  • overhead_blocks_used : metadata for the volume. The overhead is proportional to the physical size of the volume; for example, on an 8TB device, the overhead was around 9GB
  • data_blocks_used: this is the count of the physical blocks consumed by user data
  • logical_blocks_used: the count of blocks consumed at the filesystem level
In my case, the "overhead_blocks_used" was 4GB, and the "data_blocks_used" around 200MB. The savings% value is derived from  data_blocks_used / logical_blocks_used, since it only applies to actual user data written to the volume, which equates to around 95%. Now it makes sense!

Final Words

Deduplication is a complex beast, but hopefully the above will at least get you up and running with this new Linux feature.

If you decide to use vdo across a number of servers, running vdostats isn't really a viable option. For that it would be more useful to leave the command line behind at look at solutions like prometheus and grafana to track capacity usage and generate alerts. Spoiler alert!...that's the subject of my next post :)

Useful Links






Sunday, 10 December 2017

Want to Install Ceph, but afraid of Ansible?



There is no doubt that Ansible is a pretty cool automation engine for provisioning and configuration management. ceph-ansible builds on this versatility to deliver what is probably the most flexible Ceph deployment tool out there. However, some of you may not want to get to grips with Ansible before you install Ceph...weird right?

No, not really.


If you're short on time, or just want a cluster to try ceph for the first time, a more guided installation approach may help. So I started a project called ceph-ansible-copilot

The idea is simple enough; wrap the ceph-ansible playbook with a text GUI. Very 1990's, I know, but now instead of copying and editing various files you simply start the copilot tool, enter the details and click 'deploy'. The playbook runs in the background within the GUI and any errors are shown there and then...no more drowning in an ocean of scary ansible output :)

The features and workflows of the UI are described in the project page's README file.

Enough rambling, lets look at how you test this stuff out. The process is fairly straight forward;
  1. configure some hosts for Ceph
  2. create the Ansible environment
  3. run copilot
The process below describes each of these steps using CentOS7 as the deployment target for Ansible and the Ceph cluster nodes.
1. Configure Some Hosts for Ceph
Call me lazy, but I'm not going to tell you how to build vm's or physical servers. To follow along, the bare minimum you need are a few virtual machines - as long as they have some disks on them for Ceph, you're all set!

2. Create the Ansible environment
Typically for a Ceph cluster you'll want to designate a host as the deployment or admin host. The admin host is just a deployment manager, so it can be a virtual machine, a container or even a real (gasp!) server. All that really matters is that your admin host has network connectivity to the hosts you'll be deploying ceph to.

On the admin host, perform these tasks (copilot needs ansible 2.4 or above)
> yum install git ansible python-urwid -y
Install ceph-ansible (full installation steps can be found here)
> cd /usr/share
> git clone https://github.com/ceph/ceph-ansible.git
> cd ceph-ansible
> git checkout master
Setup passwordless ssh between the admin host and for candidate ceph hosts
> ssh-keygen
> ssh-copy-id root@<ceph_node>
On the admin host install copilot
> cd ~
> git clone https://github.com/pcuzner/ceph-ansible-copilot.git
> cd ceph-ansible-copilot
> python setup.py install 
3. Run copilot
The main playbook for ceph-ansible is in /usr/share/ceph-ansible - this is where you need to run copilot from (it will complain if you try to run it in some other place!)
> cd /usr/share/ceph-ansible
> copilot
Then follow the UI..

Example Run
Here's a screen capture showing the whole process, so you can see what you get before you hit the command line.



The video shows the deployment of a small 3 node ceph cluster, 6 OSDs, a radosgw (for S3), and an MDS for cephfs testing. It covers the configuration of the admin host, the copilot UI and finally a quick look at the resulting ceph cluster. The video is 9mins in length, but for those of us with short attention spans, here's the timeline so you can jump to the areas that interest you.

00:00 Pre-requisite rpm installs on the admin host
01:12 Installing ceph-ansible from github
01:52 Installing copilot
02:58 Setting up passwordless ssh from the admin host to the candidate ceph hosts
04:04 Ceph hosts before deployment
05:04 Starting copilot
08:10 Copilot complete, review the Ceph hosts



What's next?
More testing...on more and varied hardware...

So far I've only tested 'simple' deployments using the packages from ceph.com (community deployments) against a CentOS target. So like I said, more testing is needed, a lot more...but for now there's enough of the core code there for me to claim a victory and write a blog post!

Aside from the testing, these are the kinds of things that I'd like to see copilot handle
  • collocation rules (which daemons can safely run together)
  • resource warnings (if you have 10 HDD's but not enough RAM, or CPU...issue a warning)
  • handle the passwordless ssh setup. copilot already checks for passwordless ssh, so instead of leaving it to the admin to resolve any issues, just add another page to the UI.
That's my wishlist - what would you like copilot to do? Leave a comment, or drop by the project on github.

Demo'd Versions
  • copilot 0.9.1
  • ceph-ansible MASTER as at December 11th 2017
  • ansible 2.4.1 on CentOS




Tuesday, 5 July 2016

De-mystifying gluster shards

Recently I've been working on converging glusterfs with oVirt - hyperconverged, open source style. oVirt has supported glusterfs storage domains for a while, but in the past a virtual disk was stored as a single file on a gluster volume. This helps some workloads, but file distribution and functions like self heal and rebalance have more work to do. The larger the virtual disk, the more work gluster has to do in one go.

Enter sharding

The shard translator was introduced with version 3.7, and enables large files to be split into smaller chunks(shards) of a user defined size. This addresses a number of legacy issues when using glusterfs for virtual machine storage - but does introduce an additional level complexity. For example, how do you now relate a file to it's shard, or vice-versa?

The great thing is that even though a file is split into shards, the implementation still allows you to relate files to shards with a few simple commands.
  
Firstly, let's look at how to relate a file to it's shards;


And now, let's go the other way. We start with a shard, and end with the parent file.


Hopefully this helps others getting to grips with glusterfs sharding (and maybe even oVirt!)

Sunday, 29 May 2016

Making gluster play nicely with others

These days hyperconverged strategies are everywhere. But when you think about it, sharing the finite resources within a physical host requires an effective means of prioritisation and enforcement. Luckily, the Linux kernel already provides an infrastructure for this in the shape of cgroups, and the interface to these controls is now simplified with systemd integration.

So lets look at how you could use these capabilities to make Gluster a better neighbour in a collocated or hyperconverged  model. 

First some common systemd terms, we should to be familiar with;
slice : a slice is a concept that systemd uses to group together resources into a hierarchy. Resource constraints can then be applied to the slice, which defines 
  • how different slices may compete with each other for resources (e.g. weighting)
  • how resources within a slice are controlled (e.g. cpu capping)
unit : a systemd unit is a resource definition for controlling a specific system service
NB. More information about control groups with systemd can be found here

In this article, I'm keeping things simple by implementing a cpu cap on glusterfs processes. Hopefully, the two terms above are big clues, but conceptually it breaks down into two main steps;
  1. define a slice which implements a CPU limit
  2. ensure gluster's systemd unit(s) start within the correct slice.
So let's look at how this is done.

Defining a slice

Slice definitions can be found under /lib/systemd/system, but systemd provides a neat feature where /etc/systemd/system can be used provide local "tweaks". This override directory is where we'll place a slice definition. Create a file called glusterfs.slice, containing;

[Slice]
CPUQuota=200%

CPUQuota is our means of applying a cpu limit on all resources running within the slice. A value of 200% defines a 2 cores/execution threads limit.

Updating glusterd


Next step is to give gluster a nudge so that it shows up in the right slice. If you're using RHEL7 or Centos7, cpu accounting may be off by default (you can check in /etc/systemd/system.conf). This is OK, it just means we have an extra parameter to define. Follow these steps to change the way glusterd is managed by systemd

# cd /etc/systemd/system
# mkdir glusterd.service.d
# echo -e "[Service]\nCPUAccounting=true\nSlice=glusterfs.slice" > glusterd.service.d/override.conf

glusterd is responsible for starting the brick and self heal processes, so by ensuring glusterd starts in our cpu limited slice, we capture all of glusterd's child processes too. Now the potentially bad news...this 'nudge' requires a stop/start of gluster services. If your doing this on a live system you'll need to consider quorum, self heal etc etc. However, with the settings above in place, you can get gluster into the right slice by;

# systemctl daemon-reload
# systemctl stop glusterd
# killall glusterfsd && killall glusterfs
# systemctl daemon-reload
# systemctl start glusterd


You can see where gluster is within the control group hierarchy by looking at it's runtime settings

# systemctl show glusterd | grep slice
Slice=glusterfs.slice
ControlGroup=/glusterfs.slice/glusterd.service
Wants=glusterfs.slice
After=rpcbind.service glusterfs.slice systemd-journald.socket network.target basic.target

or use the systemd-cgls command to see the whole control group hierarchy

├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 19
├─glusterfs.slice
│ └─glusterd.service
│   ├─ 867 /usr/sbin/glusterd -p /var/run/glusterd.pid --log-level INFO
│   ├─1231 /usr/sbin/glusterfsd -s server-1 --volfile-id repl.server-1.bricks-brick-repl -p /var/lib/glusterd/vols/repl/run/server-1-bricks-brick-repl.pid 

 │   └─1305 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/lib/glusterd/glustershd/run/glustershd.pid -l /var/log/glusterfs/glustershd.log
├─user.slice
│ └─user-0.slice
│   └─session-1.scope
│     ├─2075 sshd: root@pts/0  
│     ├─2078 -bash
│     ├─2146 systemd-cgls
│     └─2147 less
└─system.slice

At this point gluster is exactly where we want it! 

Time for some more systemd coolness ;) The resource constraints that are applied by the slice are dynamic, so if you need more cpu, you're one command away from getting it;

# systemctl set-property glusterfs.slice CPUQuota=350%

Try the 'systemd-cgtop' command to show the cpu usage across the complete control group hierarchy.

Now if jumping straight into applying resource constraints to gluster is a little daunting, why not test this approach with a tool like 'stress'. Stress is designed to simply consume components of the system - cpu, memory, disk. Here's an example .service file which uses stress to consume 4 cores

[Unit]
Description=CPU soak task

[Service]
Type=simple
CPUAccounting=true
ExecStart=/usr/bin/stress -c 4
Slice=glusterfs.slice

[Install]
WantedBy=multi-user.target

Now you can tweak the service, and the slice with different thresholds before you move on to bigger things! Use stress to avoid stress :)

And now the obligatory warning. Introducing any form of resource constraint may resort in unexpected outcomes especially in hyperconverged/collocated systems - so adequate testing is key.

With that said...happy hacking :)




Tuesday, 26 April 2016

Using LIO with Gluster

In the past, gluster users of have been able to open up their gluster volumes to iSCSI using the tgt daemon. This has been covered in the past on other blogs and also documented on gluster.org.

But, tgt has been superseded in more recent distro's by LIO. LIO provides a number of different local storage options to be utilised as SCSI targets, including; FILEIO, BLOCK, PSCSI and RAMDISK. These SCSI targets are implemented as modules in kernel space, but what isn't immediately obvious is that LIO also provides a userspace framework called TCMU. TCMU enables userspace files to become iSCSI targets. 

With LIO, the easiest way to exploit gluster as an iSCSI target was through the FILEIO 'storage engine' over FUSE. However, the high number of context switches incurred within FUSE is likely to reduce the performance potential to your 'client' -  especially for random I/O access patterns.

Until now, FUSE was your only option. But Andy Grover at Red Hat has just changed things. Andy has developed tcmu-runner which utilises the TCMU framework, allowing a glusterfs target to be used over gluster's libgfapi interface. Typically, with libgfapi you can expect less context switching, and improved performance.

For those like me, with short attention spans, here's what the improvement looked like when I compared LIO/FUSE with LIO/gfapi using a couple of fio  based workloads.

Read Improvement
Mixed Workload Improvement

In both charts, IOPS and latency significantly improves using LIO/GFAPI, and further still by adopting the arbiter volume.

As you can see, for a young project, these results are really encouraging. The bad news is that to try tcmu-runner you'll need to either build systems based on Fedora F24/rawhide or compile it yourself from the github repo. Let's face it, there's always a price to pay for new shiny stuff :)

For the remainder of this article, I'll walk through the configuration of LIO and the iSCSI client that I used during my comparisons.

Preparing Your Environment

In the interests of brevity, I'm assuming that you know how to build servers,  create a gluster trusted pool and define volumes. Here's a checklist of the tasks you should do in order to prepare a test environment;
  1. build 3 Fedora24 nodes and install gluster (3.7.11) on each peer/node
  2. on each node, ensure /etc/gluster/glusterd.vol contains the following setting - option rpc-auth-allow-insecure on. This is needed for gfapi access. Once added, you'll need to restart glusterd.
  3. install targetcli (targetcli-2.1.fb43-1) and tcmu-runner (tcmu-runner-1.0.4-1) on each of your gluster nodes
  4. form a gluster trusted pool, and create a replica 3 volume or replica with arbiter volume (or both!) 
  5. issue "gluster vol set <vol_name> server.allow-insecure on" to enable libgfapi access to the volume
There are several ways to configure the iSCSI environment, but for my tests I adopted the following approach;
  • two of my three gluster nodes will be iSCSI gateways (LIO targets)
  • each gateway will have it's own iqn (iSCSI Qualified Name)
  • each gateway will only access the gluster volume from itself, so if gluster is down on this node so is the path for any attached client (makes things simple)
  • high availability for the LUN is provided by client side multipathing
Before moving on, you can confirm that targetcli/tcmu-runner are providing the gluster integration by simply running 'ls' from the targetcli.

# targetcli ls
o- / ...............
  o- backstores ....
  | o- block .......
  | o- fileio ......
  | o- pscsi .......
  | o- ramdisk .....
  | o- user:glfs ...    <--- gluster gfapi available through tcmu
  | o- user:qcow ...
  o- iscsi .........
  o- loopback ......
  o- vhost ......

With the preparation complete, let's configure the LIO gateways.

Configuring LIO - Node 1

The following steps provide an example configuration You'll need to make changes to naming etc specific to your test environment.

  1. Mount the volume (called iscsi-pool), and allocate the file that will become the LUN image
  2. # fallocate -l 100G mytest.img
  1. Enter the targetcli shell. The remaining steps all take place within this shell.
  1. Create the backing store connection to the glusterfs file
  2. /backstores/user:glfs create myLUN 100G iscsi-pool@iscsi-3/mytest.img
  1. Create the node's target portal (this is the name the client will connect to). In this example 'iscsi-3' is the node name
  2. /iscsi/ create iqn.2016-04.org.gluster:iscsi-3
    NB. this will create the target IQN and the iscsi portal will be enabled and listening on port 3260
  1. On the client, 'grab' it's iqn from /etc/iscsi/initiatorname.iscsi, then add it to the gateway
  2. /iscsi/iqn.2016-04.org.gluster:iscsi-3/tpg1/acls/ create iqn.1994-05.com.redhat:14a2b41fe9e4
  1. Add the LUN, "myLUN", to the target and automatically map it to the client(s) 
  2. /iscsi/iqn.2016-04.org.gluster:iscsi-3/tpg1/luns create /backstores/user:glfs/myLUN 0
  1. Issue saveconfig to commit the configuration (config is stored in /etc/target/saveconfig.json)

Configuring LIO - Node 2 

When a LUN is defined by targetcli, a wwn is automatically generated for it. This is neat, but to ensure multipathing works we need the LUN exported by the gateways to share the same wwn - if they don't match, the client will see two devices, not two paths to the same device.

So for subsequent nodes, the steps are slightly different.
  1. On the first node, look at /etc/target/saveconfig.json. You'll see a storage object item for the gluster file you've just created, together with the wwn that was assigned (highlighted).
  2.   "storage_objects": [
        {
          "config": "glfs/iscsi-pool@iscsi-3/mytest.img",
          "name": "myLUN",
          "plugin": "user",
          "size": 107374182400,
          "wwn": "653e4072-8aad-4e9d-900e-4059f0e19e7e"
        }
  1. Open the targetcli shell on node 2, and define a LUN pointing to the same backing file as node 1, but this time explicitly specifying the wwn (from step 1)
  2. /backstores/user:glfs create myLUN 100G iscsi-pool@iscsi-1/mytest.img 653e4072-8aad-4e9d-900e-4059f0e19e7e
    (if you cd to /backstores/user:glfs and use help create you'll see a summary of the options available when creating the LUN)
  1. With the LUN in place, you can follow steps 4-7 above to create the iqn, portal and LUN masking for this node.

At this point you have;
  • 3 gluster nodes
  • a gluster volume with a file defined, serving as an iscsi target
  • 2 gluster nodes defined as iscsi gateways
  • each gateway exports the same LUN to a client (supporting multipathing)

Next up...configuring the client.

Client Configuration

To get the client to connect to your 'exported' LUN(s), you first need to ensure that the following rpms are installed on the client; device-mapper-multipath, iscsi-initiator-utils and preferably sg3_utils. With these packages in place you can move on to configure multipathing and connect to you LUN(s).
  • Multipathing : the example below shows a devices section from /etc/multipath.conf that I used to ensure my exported LUNs are seen as multipath devices. With this in place, you can take a node down for maintenance and your LUN remains accessible (as long as your volume has quorum!)
#
# LIO iSCSI
devices {
    device {
        vendor "LIO-ORG"
        path_grouping_policy "multibus"
# I tested with a path_selector of "round-robin" and "queue-length"
        path_selector "queue-length 0"
        path_checker "directio"
        prio "const"
        rr_weight "uniform"
    }
}

  • iscsi discovery/login : to login to the gluster iscsi gateway's just use the iscsiadm command (from iscsi-initiator-utils rpm)

# iscsiadm -m discovery -t st -p <your_gluster_node_1> -l
# iscsiadm -m discovery -t st -p <your_gluster_node_2> -l

# #check your paths are working as expected with multipath command
# multipath -ll
mpathd (36001405891b9858f4b0440285cacbcca) dm-2 LIO-ORG ,TCMU device   
size=8.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 33:0:0:1 sdc 8:32 active ready running
  `- 34:0:0:1 sde 8:64 active ready running
mpathb (3600140596a3a65692104740a88516aba) dm-3 LIO-ORG ,TCMU device   
size=8.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 33:0:0:0 sdb 8:16 active ready running
  `- 34:0:0:0 sdd 8:48 active ready running
mpathf (36001405653e40728aad4e9d900e4059f) dm-6 LIO-ORG ,TCMU device   
size=1.0G features='0' hwhandler='0' wp=rw
`-+- policy='queue-length 0' prio=1 status=active
  |- 35:0:0:0 sdf 8:80 active ready running
  `- 33:0:0:2 sdg 8:96 active ready running

You can see in this example, I have three LUN's exported, and each one has two active paths (one to each gluster node). By default, the iscsi node definition in (/var/lib/iscsi/nodes) uses a setting of node.startup=automatic, which means LUN(s) will automagically reappear on the client following a reboot.

But from the client's perspective, how do you know which LUN is from which glusterfs volume/file? For this, sg_inq is your friend...

# sg_inq -i /dev/dm-6
VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 49
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the addressed logical unit
      vendor id: LIO-ORG
      vendor specific: 653e4072-8aad-4e9d-900e-4059f0e19e7e
  Designation descriptor number 2, descriptor length: 20
    designator_type: NAA,  code_set: Binary
    associated with the addressed logical unit
      NAA 6, IEEE Company_id: 0x1405
      Vendor Specific Identifier: 0x653e40728
      Vendor Specific Identifier Extension: 0xaad4e9d900e4059f
      [0x6001405653e40728aad4e9d900e4059f]
  Designation descriptor number 3, descriptor length: 39
    designator_type: vendor specific [0x0],  code_set: ASCII
    associated with the addressed logical unit
      vendor specific: glfs/iscsi-pool@iscsi-3/mytest.img

The highlighted text shows the configuration string you specified when you created the LUN in targetcli. If you run the same command against the devices themselves (/dev/sdf or /dev/sdg) you'd see the connection string from each of respective gateways. Nice and easy!


And Finally...

Remember, this is all shiny and new - so if you try it, expect some rough edges! However, I have to say that it looks promising, and during my tests I didn't lose any data...but YMMV :)

Happy testing!












Tuesday, 31 March 2015

Using SSL with Glusterfs

Wow - it's been a while since my last post!

Recently, I needed to configure glusterfs with SSL and found that the documention that describes how to do it is actually pretty thin.  What's annoying is that this feature has been around since 2013!

First the caveat - I'm not an expert with SSL, but I arrived at this working process after digging through mail lists and a great article from Zbyszek Żółkiewski

There are 8 steps to follow, so nothing too taxing :)
  1. Create the keys and certificates
  • On each node, perform the following;
  • # cd /etc/ssl
    # openssl genrsa -out glusterfs.key 1024
    # openssl req -new -x509 -days 3650 -key glusterfs.key -subj /CN=<hostname> -out glusterfs.pem
  • This step creates a private key(.key) and associated certificate(.pem) on each node. The common name (CN), I've used is the hostname, so each certificate is unique to each gluster node and/or client. You may opt for a different scheme - but the important thing is the CN chosen here is reflected in step 6.
  1. Combine the pem files to a single file
  • Use scp to copy the .pem file from each node to a single node in the cluster (I'm calling it the primary host for the purpose of this article)
  • # scp glusterfs.pem root@<primary-host>:/etc/ssl/<this-hostname>.pem
    On the primary host concatenate the files
    # cat glusterfs.pem host2.pem host3.pem > glusterfs.ca
  1. Distribute the common 'ca' file to all nodes
  • On the primary host distribute the common CA containing the certs from all nodes/clients
  • # scp /etc/ssl/glusterfs.ca root@<hostX>:/etc/ssl/.
  1. Stop the volume you want to enable SSL on

  2. # gluster vol stop <volume-name>
  1. Restart glusterd

  2. # systemctl restart glusterd
  1. Update the volume to enable SSL

  2. # gluster vol set <volume-name> client.ssl on
    # gluster vol set <volume-name> server.ssl on
    # gluster vol set <volume-name> auth.ssl-allow host-1,host-2,host-3
  • The comma separated list should consist of the CN's used when generating the .pem files on each host, from step '1'.
  1. Start the volume

  2. # gluster vol start <volume-name>
  1. Check SSL is enabled on the I/O Path
  • Although you can use vol info to check the SSL setting is in place, the best way to confirm that SSL is actually being used is to look at one of the log files;
  • # grep SSL /var/log/glusterfs/glustershd.log
    [2015-03-31 06:58:34.674091] I [socket.c:3799:socket_init] 0-vol-client-2: SSL support on the I/O path is ENABLED
    [2015-03-31 06:58:34.679316] I [socket.c:3799:socket_init] 0-vol-client-1: SSL support on the I/O path is ENABLED
    [2015-03-31 06:58:34.680784] I [socket.c:3799:socket_init] 0-vol-client-0: SSL support on the I/O path is ENABLED
That's it - enjoy a more secure glusterfs!