Supporting JBOD/HBA disks in libstoragemgmt.

Discussion:

Gris Ge

2015-10-22 14:10:39 UTC

Hi Team,

I would like to share the possible approaches on supporting JBOD/HBA
disks:

Background:
HP SmartArray:
Only configured to HBA mode, all RAIDed volume will be purged,
all physical disks are exposed to OS directly without
any configure. User has to convert back to RAID mode before
making any config changes to the card.

LSI MegaRAID:
JBOD mode only expose 'unconfigred good' disks to OS, existing
RAIDed volumes are still functional. In JBOD mode, user
can still convert JBOD disks back to hidden RAIDed disk.

Only in JBOD mode, can user expose disks directly to OS.

LibstorageMgmt:
* Explicitly documented Volume can only came from Pool.
* Assuming(not documented) only volumes is OS accessible.
* I am intend to introduce new plugin -- SES to support SCSI
enclosure and new methods like 'disk_locate_led_on()',
'lsm.Client.fans()', 'lsm.Client.sensors()' and etc for
JBOD disk system which Ceph is using.

Options:
A):
`lsm.System.mode == lsm.System.MODE_HBA` indicate besides
checking lsm.Volume, lsm.Disk might also OS accessible.

Add `lsm.Disk.STATUS_HBA` to indicate disks are exposed to OS
directly.

Pros:
Minimum code required for HBA/JBOD disks.

Cons:
API user needs to do extra check on lsm.System.mode
and call lsm.Client.disks() also to check OS
accessible volume/disk.

B):
Create psudo lsm.Pool and lsm.Volume for each HBA/JBOD disk.

Pros:
No workflow change required no API user side.

Cons:
Need extra code to create psudo pool/volume.

I personally prefer option A).

Please kindly let me know you ideas.

Thank you very much.
Best regards.

--
Gris Ge

Tony Asleson

2015-10-22 15:09:21 UTC

Permalink

Post by Gris Ge
Hi Team,

Hi Gris!

Post by Gris Ge
I would like to share the possible approaches on supporting JBOD/HBA
Only configured to HBA mode, all RAIDed volume will be purged,
all physical disks are exposed to OS directly without
any configure. User has to convert back to RAID mode before
making any config changes to the card.
JBOD mode only expose 'unconfigred good' disks to OS, existing
RAIDed volumes are still functional. In JBOD mode, user
can still convert JBOD disks back to hidden RAIDed disk.

In a mixed configuration what do things look like? There is a pool for
the volume(s) that are participating in the RAID, but not a pool for the
un-configured disks? What would the the lsm.System.mode be in this
case? lsm.System.MODE_ITS_COMPLICATED :-)

Post by Gris Ge
Only in JBOD mode, can user expose disks directly to OS.
* Explicitly documented Volume can only came from Pool.
* Assuming(not documented) only volumes is OS accessible.

Unless a volume is masked & mapped it's not accessible for external
storage arrays, so volume doesn't automatically imply OS accessible.
It's really just the abstraction of a disk device which could be virtual
or physical.

Post by Gris Ge
* I am intend to introduce new plugin -- SES to support SCSI
enclosure and new methods like 'disk_locate_led_on()',
'lsm.Client.fans()', 'lsm.Client.sensors()' and etc for
JBOD disk system which Ceph is using.

Does this need to be a new plugin or can we add this functionality to
existing plugins? Users would like to be able to blink a disk, check
fans etc. on a NetApp, EMC and other external storage arrays just like a
user with a local storage via HBA too right? We may not be able to
support them at this time, but in the future we might.

If a user has an external storage array with no configured 'pool(s)' and
no volumes we would just have the disks. Today we don't have the ability
to take 1 or more disks and add them to a pool, we expect the user to do
that using the vendor tools. But today a user could query the disks and
get information about each of them or call method(s) which take a disk
as an argument, like blink correct?

As for fans/sensors as a single HBA could have multiple JBOD's attached
to it and each enclosure can have one or more fans & sensors etc. how do
we distinguish the fans/sensors in each JBOD vs. the system abstraction
which is the HBA?

Post by Gris Ge
`lsm.System.mode == lsm.System.MODE_HBA` indicate besides
checking lsm.Volume, lsm.Disk might also OS accessible.
Add `lsm.Disk.STATUS_HBA` to indicate disks are exposed to OS
directly.
Minimum code required for HBA/JBOD disks.
API user needs to do extra check on lsm.System.mode
and call lsm.Client.disks() also to check OS
accessible volume/disk.
Create psudo lsm.Pool and lsm.Volume for each HBA/JBOD disk.

Just to clarify, we would need to create 1 pseudo pool and N number of
pseudo volumes (1 for each disk) correct?

Post by Gris Ge
No workflow change required no API user side.
Need extra code to create psudo pool/volume.
I personally prefer option A).
Please kindly let me know you ideas.

Knowing nothing else but the pros/cons my preference would be to go with
B. We either complicate our implementation of the plugin or we push the
complexity to every client that uses the library. I would rather keep
the clients API simple and consistent so that they can use the same code
whether they are walking through a SAN or a local HBA. We shouldn't
trade plugin complexity for client complexity.

However, do we even need to worry about option A or B in the case of a
HBA with some disks? We already give them the ability to list the
disks, could we add a method which takes a disk as an argument to blink?

-Tony

Gris Ge

2015-10-23 01:16:28 UTC

Permalink

Post by Tony Asleson

Post by Gris Ge
Hi Team,

Hi Gris!

In a mixed configuration what do things look like? There is a pool
for the volume(s) that are participating in the RAID, but not a pool
for the un-configured disks? What would the the lsm.System.mode be
in this case? lsm.System.MODE_ITS_COMPLICATED :-)

In definition of lsm.System.mode:
* lsm.System.MODE_HARDWARE_RAID
The logical volume(aka, RAIDed virtual disk) can be exposed
to OS while hardware RAID card is handling the RAID
algorithm. Physical disk _can not_ be exposed to OS directly.

* lsm.System.MODE_HBA
The physical disks _can_ be exposed to OS directly.
SCSI enclosure service might be exposed to OS also.

So, as long as any card can expose physical disks to OS directly,
they are in HBA mode.

Post by Tony Asleson

Post by Gris Ge
Only in JBOD mode, can user expose disks directly to OS.
* Explicitly documented Volume can only came from Pool.
* Assuming(not documented) only volumes is OS accessible.

The new plugin is to support SCSI enclosure services provided by
mpt2sas/mpt3sas and other JBOD HBA cards.

Yes. We surely can add methods to other plugin to support:
disk_locate_led_on(lsm.Disk)

Post by Tony Asleson
If a user has an external storage array with no configured 'pool(s)'
and no volumes we would just have the disks. Today we don't have the
ability to take 1 or more disks and add them to a pool, we expect
the user to do that using the vendor tools. But today a user could
query the disks and get information about each of them or call
method(s) which take a disk as an argument, like blink correct?

Yes. They will.

Post by Tony Asleson
As for fans/sensors as a single HBA could have multiple JBOD's attached
to it and each enclosure can have one or more fans & sensors etc. how do
we distinguish the fans/sensors in each JBOD vs. the system abstraction
which is the HBA?

Each enclosure will be treated as a system which is already coded
in the ses plugin. This is what it looks like:

====
[***@megadeth ~]$ sudo lsmenv ses lsmcli ls
Device alias: ses
URI: ses://
lsmcli ls
ID | Name | Status | Info | FW Ver
----------------------------------------------------------------------------------------
5000155359566000 | PROMISE 3U-SAS-16-D BP rev 0107 esp count 2 | Unknown | |
500015535990a000 | PROMISE 3U-SAS-16-D BP rev 0107 esp count 2 | Unknown | |
5000155359785000 | PROMISE 3U-SAS-16-D BP rev 0107 esp count 2 | Unknown | |
====

The esp count indicate it might[1] be a multi-domain SAS enclosure,
user can latter check disk interface address to find out.

[1]: As SES SPEC indicate.

Post by Tony Asleson

Just to clarify, we would need to create 1 pseudo pool and N number of
pseudo volumes (1 for each disk) correct?

One pseudo pool for each HBA/JBOD disk.

Post by Tony Asleson

Post by Gris Ge
No workflow change required no API user side.
Need extra code to create psudo pool/volume.
I personally prefer option A).
Please kindly let me know you ideas.

Knowing nothing else but the pros/cons my preference would be to go
with B. We either complicate our implementation of the plugin or we
push the complexity to every client that uses the library. I would
rather keep the clients API simple and consistent so that they can
use the same code whether they are walking through a SAN or a local
HBA. We shouldn't trade plugin complexity for client complexity.
However, do we even need to worry about option A or B in the case of
a HBA with some disks? We already give them the ability to list the
disks, could we add a method which takes a disk as an argument to blink?

This makes me think again. We don't need neither options.
In HBA mode, disks are OS accessible, in RAID mode, they follow the
volume way.

But clearly when sys admin put their MegaRAID in this mode, they
should know what they are doing, there is no need for us to provide
a way to identify which disks is JBOD while which disk is RAID member
for this rare usage.

Post by Tony Asleson
-Tony

Thanks for the feedback.

--
Gris Ge

Handzik, Joe

2015-10-22 15:23:05 UTC

Permalink

Resending a second time, the mailing list didn't like that I hadn't signed up for it yet, and the links are wrong on the lsm wiki...

Option B would be really really rough. I have patches here: https://github.com/HP-Scale-out-Storage/libstoragemgmt/tree/wip-all

They implement a bunch of really useful (to me and solutions I work with, at least) features for physical disks, such as:

-support for exposure of scsi device node (some code already existed for this, but I now expose the node via a method call)
-support for exposure of scsi generic node
-support for discovery of a SEP associated with a disk, and caching of the SEP's scsi generic node (leveraged for features below)
-support for disk's sas address
-support for disk's port/bay/box information (some code already existed for this, but I'm querying via a mechanism that doesn't involve hpssacli)
-support to enable/disable physical disk IDENT LEDs via sg_ses
-support to enable/disable physical disk FAULT LEDs via sg_ses

There are a few RAID mode features in there too:

-support for exposure of volume's scsi device node
-support for exposure of volume's scsi generic node
-support to enable/disable volume's IDENT LEDs via hpssacli

Just wanted to get that out there to make sure we don't duplicate work. If you haven't started much implementation work yet, I'd prefer that you take a look at my patches as they're submitted and we can rework the implementation if necessary. The code I have here is already in use by a teammate of mine for a Lustre solution, and I will be using this code within Ceph in the near future (that work hasn't started yet, but is my next task after upstreaming everything you see in wip-all).

In general, I don't think you should do much at all to restrict or wrap around physical disks. Frankly, knowing what I know about roadmaps and LSI/Avago featuresets today, all of the code we might implement should just:

-check to see if physical disks are exposed to the host (lots of ways to do this, as you can see in my patches)
-if they are, collect extra data

As you say, a user can very easily integrate a check at discovery time to figure out what mode they're in. If they're in HBA mode, just look for physical devices. If in RAID mode, just look for logical devices (unless looking to configure). If in some sort of mixed mode (there is support in the wild today from at least LSI to expose physical disks and logical volumes simultaneously), check both.

Thanks for pinging out about this!

Joe

-----Original Message-----
From: Gris Ge [mailto:***@redhat.com]
Sent: Thursday, October 22, 2015 9:11 AM
To: libstoragemgmt-***@lists.fedorahosted.org
Cc: Handzik, Joe
Subject: Supporting JBOD/HBA disks in libstoragemgmt.

Hi Team,

I would like to share the possible approaches on supporting JBOD/HBA
disks:

Background:
HP SmartArray:
Only configured to HBA mode, all RAIDed volume will be purged,
all physical disks are exposed to OS directly without
any configure. User has to convert back to RAID mode before
making any config changes to the card.

LSI MegaRAID:
JBOD mode only expose 'unconfigred good' disks to OS, existing
RAIDed volumes are still functional. In JBOD mode, user
can still convert JBOD disks back to hidden RAIDed disk.

Only in JBOD mode, can user expose disks directly to OS.

LibstorageMgmt:
* Explicitly documented Volume can only came from Pool.
* Assuming(not documented) only volumes is OS accessible.
* I am intend to introduce new plugin -- SES to support SCSI
enclosure and new methods like 'disk_locate_led_on()',
'lsm.Client.fans()', 'lsm.Client.sensors()' and etc for
JBOD disk system which Ceph is using.

Options:
A):
`lsm.System.mode == lsm.System.MODE_HBA` indicate besides
checking lsm.Volume, lsm.Disk might also OS accessible.

Add `lsm.Disk.STATUS_HBA` to indicate disks are exposed to OS
directly.

Pros:
Minimum code required for HBA/JBOD disks.

Cons:
API user needs to do extra check on lsm.System.mode
and call lsm.Client.disks() also to check OS
accessible volume/disk.

B):
Create psudo lsm.Pool and lsm.Volume for each HBA/JBOD disk.

Pros:
No workflow change required no API user side.

Cons:
Need extra code to create psudo pool/volume.

I personally prefer option A).

Please kindly let me know you ideas.

Thank you very much.
Best regards.

--
Gris Ge

Gris Ge

2015-10-23 01:47:32 UTC

Permalink

Post by Handzik, Joe
Resending a second time, the mailing list didn't like that I hadn't
signed up for it yet, and the links are wrong on the lsm wiki...
https://github.com/HP-Scale-out-Storage/libstoragemgmt/tree/wip-all
They implement a bunch of really useful (to me and solutions I work
-support for exposure of scsi device node (some code already existed
for this, but I now expose the node via a method call)

Post by Handzik, Joe
-support for exposure of scsi generic node

When user needs 'SCSI pass-through' they are could use
/dev/bsg/<sdX_scsi_id>

The sysfs is /sys/block/sdb/device/bsg, so there should be a udev way
to do so.

Post by Handzik, Joe
-support for discovery of a SEP associated with a disk, and caching
of the SEP's scsi generic node (leveraged for features below)

I believe a SES plugin could do so.

Post by Handzik, Joe
-support for disk's sas address

The lsm.Disk is for a physical disk, so we should allows each
disk has two SAS addresses, or be more precisely, two port IDs, since
disk might be SATA/FC/NVMe.

Post by Handzik, Joe
-support for disk's port/bay/box information (some code already
existed for this, but I'm querying via a mechanism that doesn't
involve hpssacli)

I check what I can do via SES.

Post by Handzik, Joe
-support to enable/disable physical disk IDENT LEDs via sg_ses

Can you renamed to follow <noun>+<verb> term:
disk_ident_led_set()
disk_ident_led_clear()

Post by Handzik, Joe
-support to enable/disable physical disk FAULT LEDs via sg_ses

Any user case we need fault LED?

Post by Handzik, Joe
-support for exposure of volume's scsi device node
-support for exposure of volume's scsi generic node

Check above. It's not libstoragemgmt's job to handle OS stuff.

Post by Handzik, Joe
-support to enable/disable volume's IDENT LEDs via hpssacli
Just wanted to get that out there to make sure we don't duplicate
work. If you haven't started much implementation work yet, I'd

I am working the open source SES plugin, should be ready in 2 weeks.
https://github.com/cathay4t/libstoragemgmt/tree/ses
So no overlap yet.

Post by Handzik, Joe
prefer that you take a look at my patches as they're submitted and
we can rework the implementation if necessary. The code I have here
is already in use by a teammate of mine for a Lustre solution, and I
will be using this code within Ceph in the near future (that work
hasn't started yet, but is my next task after upstreaming everything
you see in wip-all).

Thanks. I will try to merge them and open PR for reviews.

Best regards.

--
Gris Ge

Handzik, Joe

2015-10-23 15:59:35 UTC

Permalink

Hey Gris,

Thanks for the reply. I'll insert my replies inline w/ a [JH] tag.

Tldr; on this is that I want to find a way to get some semblance of all the functionality I've put together into the hpsa plugin. I'm very open to suggestions once I get the code reviews up, none of it is perfect but it has at least been proven useful within one solution already, and I have line-of-sight to how I'd integrate it into Ceph.

Joe

-----Original Message-----
From: Gris Ge [mailto:***@redhat.com]
Sent: Thursday, October 22, 2015 8:48 PM
To: Handzik, Joe
Cc: libstoragemgmt-***@lists.fedorahosted.org
Subject: Re: Supporting JBOD/HBA disks in libstoragemgmt.

As libstoragemgmt also support SAN arrays, I would like suggest user to use VPD83 NAA ID like this in Linux:

RAID mode:
/dev/disk/by-id/wwn-0x<lsm.Volume.vpd83>
HBA mode:
/dev/disk/by-id/wwn-0x<lsm.Disk.id>
# Not implemented yet, but could change HBA/JBOD disk
# to use VPD83 NAA ID as lsm.Disk.id.
# MegaRAID provides so. The hpssacli has no such info not
# yet, but it contains /dev/sdX where I could get VPD83
# NAA ID from.

And meanwhile, a dual domains SAS disk could be two sdX in OS, user should not use /dev/sdX returned by this method when multipath is enabled but search mpath using VPD83 NAA ID.

We just provide a ID where user could match it with OS stuff, that would be enough for libstoragemgmt.

[JH] - I'm not opposed to pulling in the VPD83 data, I actually think that's a good idea. I also understand your concern about dual domain disks...unfortunately, lsm is equally susceptible to that issue today, correct? For our purposes though, we're just reporting device topology, I feel it should be left up to the user to understand that they're in that sort of multipathing config. I think lsm being able to expose that data would actually be very valuable, my goal for lsm would be to avoid using other tools (take a look at my patches...I'm using both lsscsi and sg_ses within the hpsa plugin in my branch, the goal being to accomplish the sort of matching that I think you're describing).

Post by Handzik, Joe
-support for exposure of scsi generic node

When user needs 'SCSI pass-through' they are could use /dev/bsg/<sdX_scsi_id>

The sysfs is /sys/block/sdb/device/bsg, so there should be a udev way to do so.

[JH] - I don't fully understand the harm in what I'm exposing though. As long as the output is well understood, it's not less troublesome than most linux tools to date. With my implementation though, the user can avoid invoking those other tools separately. That's the win for my purposes, along with C and python bindings to query the data.

Post by Handzik, Joe
-support for discovery of a SEP associated with a disk, and caching of
the SEP's scsi generic node (leveraged for features below)

I believe a SES plugin could do so.

[JH] - Yes, it certainly can...again, not sure why we couldn't put it in the hpsa plugin too.

Post by Handzik, Joe
-support for disk's sas address

The lsm.Disk is for a physical disk, so we should allows each disk has two SAS addresses, or be more precisely, two port IDs, since disk might be SATA/FC/NVMe.

[JH] - With Smart Array today, both ports will not be exposed. I'm willing to go back to the drawing board on this one a bit if necessary.

Post by Handzik, Joe
-support for disk's port/bay/box information (some code already
existed for this, but I'm querying via a mechanism that doesn't
involve hpssacli)

I check what I can do via SES.

[JH] - I know sg_ses can pull some of this, but we have ready-made ways to get this data out of Smart Array tools and drivers today. I'd like to use those if possible, if for no other reason than to provide a quick turnaround.

Post by Handzik, Joe
-support to enable/disable physical disk IDENT LEDs via sg_ses

Can you renamed to follow <noun>+<verb> term:
disk_ident_led_set()
disk_ident_led_clear()

[JH] - Definitely can! I'll try to remember to do that before submitting a patch.

Post by Handzik, Joe
-support to enable/disable physical disk FAULT LEDs via sg_ses

Any user case we need fault LED?

[JH] - Sure, lots. For an integrated solution, if you want to communicate to the user that a drive should be replaced (from your software stack's perspective), you could enable the FAULT LED through python and c APIs instead of cobbling together a bunch of bash scripting.

Post by Handzik, Joe
-support for exposure of volume's scsi device node -support for
exposure of volume's scsi generic node

Check above. It's not libstoragemgmt's job to handle OS stuff.

[JH] - I still don't understand this. This data is valuable for multiple solutions that I work with. I'm just implementing a passthrough mechanism and exposing what other tools already collect...just removing the collection responsibility from other portions of the application. I'm open to reworking this stuff, but I don't want to throw this functionality away, I'll 100% use it in Ceph, and it's being used in early Lustre software today.

I am working the open source SES plugin, should be ready in 2 weeks.
https://github.com/cathay4t/libstoragemgmt/tree/ses
So no overlap yet.

[JH] - I saw that, looks good! I'll be interested to take it for a spin when it's done. Given that it's a native C implementation, would there be python bindings for it? If not, that'd be a dealbreaker for use in a couple of solutions that I can think of.

Post by Handzik, Joe
prefer that you take a look at my patches as they're submitted and we
can rework the implementation if necessary. The code I have here is
already in use by a teammate of mine for a Lustre solution, and I will
be using this code within Ceph in the near future (that work hasn't
started yet, but is my next task after upstreaming everything you see
in wip-all).

Thanks. I will try to merge them and open PR for reviews.

[JH] - I have lots of cleanup to do, the code sprawled quite a big on me and it's over 2k lines...gotta chop it down and clean it up. I'm optimistic that we can work in what I need for some higher-level solution work!

Best regards.
--
Gris Ge

Gris Ge

2015-10-25 06:13:37 UTC

Permalink

Post by Handzik, Joe
Hey Gris,
Thanks for the reply. I'll insert my replies inline w/ a [JH] tag.
https://github.com/cathay4t/libstoragemgmt/tree/ses
[JH] - I saw that, looks good! I'll be interested to take it for a
spin when it's done. Given that it's a native C implementation,
would there be python bindings for it? If not, that'd be a
dealbreaker for use in a couple of solutions that I can think of.

It's a libstoragemgmt plugin, so user has C and Python binding from
libstoragemgmt.

Post by Handzik, Joe
Thanks. I will try to merge them and open PR for reviews.
[JH] - I have lots of cleanup to do, the code sprawled quite a big
on me and it's over 2k lines...gotta chop it down and clean it up.
I'm optimistic that we can work in what I need for some higher-level
solution work!

This talk inspires me a lot on udev things, let me create an
standalone library for udev storage things your mentioned above
(get sdx sgX, and etc). Anyway, talk is mean, I will show you the
code/demo soon.

Post by Handzik, Joe
Best regards.
--
Gris Ge

--
Gris Ge

Handzik, Joe

2015-10-26 15:18:12 UTC

Permalink

Glad the C implementation has a python API...I guess I never really looked closely enough to understand if the unit test C framework had python bindings too.

We can look into the udev stuff, but I have several teammates who would prefer that what I'm working on not be coupled to udev. That would leave me with a few options:

1. Change their minds (unlikely)
2. Maintain a fork of libstoragemgmt (seems likely in the short term, until I get everything upstreamed your way...would be a longer term solution if you don't want some of what I submit)
3. Implement something other than libstoragemgmt for what we need in our storage solutions

I think this can all exist happily in libstoragemgmt, and I would rather not reinvent the wheel. As you said, talk is "mean". ;) I plan to have at least one patch up today.

Thanks for all the discussion here!

Joe

-----Original Message-----
From: libstoragemgmt-devel-***@lists.fedorahosted.org [mailto:libstoragemgmt-devel-***@lists.fedorahosted.org] On Behalf Of Gris Ge
Sent: Sunday, October 25, 2015 1:14 AM
To: LibStorageMgmt developer mailing list
Subject: Re: Supporting JBOD/HBA disks in libstoragemgmt.

Post by Handzik, Joe
Hey Gris,
Thanks for the reply. I'll insert my replies inline w/ a [JH] tag.
https://github.com/cathay4t/libstoragemgmt/tree/ses
[JH] - I saw that, looks good! I'll be interested to take it for a
spin when it's done. Given that it's a native C implementation, would
there be python bindings for it? If not, that'd be a dealbreaker for
use in a couple of solutions that I can think of.

It's a libstoragemgmt plugin, so user has C and Python binding from libstoragemgmt.

Post by Handzik, Joe
Thanks. I will try to merge them and open PR for reviews.
[JH] - I have lots of cleanup to do, the code sprawled quite a big on
me and it's over 2k lines...gotta chop it down and clean it up.
I'm optimistic that we can work in what I need for some higher-level
solution work!

This talk inspires me a lot on udev things, let me create an standalone library for udev storage things your mentioned above (get sdx sgX, and etc). Anyway, talk is mean, I will show you the code/demo soon.

Post by Handzik, Joe
Best regards.
--
Gris Ge

--
Gris Ge