Properly identifying SAN LUNs

I frequently work in large SAN environments, and I always want to verify the identity of any SAN disks (LUNs) which I receive before I write data to them. The rule is trust but verify, as it is disastrous to accidentally overwrite critical data in a shared storage environment.

SAN LUNs are virtual disks presented from a storage appliance to our AIX systems. The storage appliance is responsible for data integrity (ie: scrubbing), redundancy (ie: RAID), and often replication to backup systems or offsite to DR. It's preferred to use a centralized storage solution over each system maintaining a RAID array and related features.

This guide assumes new LUNs have been presented to an AIX system, and the administrator wants to confirm the LUNs before writing data to them. Once properly identified LUNs have been confirmed, they can be used with confidence.

Characteristics of a LUN

LUNs should be identified by:

Device name

Dynamically generated in order of discovery (ie: hdisk12), and subject to change if the device table is cleared and re-scanned.

Storage array

WWN of the storage array providing the LUN, if there is more than one (ie: 0x5005082810000bdb).

LUN Serial Number or Unique ID

The serial number is typically a long hexadecimal string, and visible in AIX and on the Storage Server UI (ie: 332136005046810810023C200000000000004).

Size

The expected size of the LUN, generally requested at allocation time.

Paths

Each HBA connected to the SAN can contribute one or more paths from the AIX host to the storage array. Generally there are 4 or more paths for redundancy, depending on the type of array.

PVID

The AIX LVM 16 digit hexadecimal Physical Volume ID from lspv, as any AIX disk belonging to a volume group will have a PVID. These should be unique across systems with few exceptions.

Device Names

SAN LUNs in AIX are classified as disk devices and typically listed as an hdiskX device in the device table (ie: from lsdev -Cc disk).

Some older device drivers (ie: IBM's original SDD, EMC's PowerPath) had alternate device names like hdiskpowerX or vpathX, but most systems use AIX's native MPIO now.

% lsdev -Cc disk
hdisk0  Available 01-00-00 SAS RAID 0 Disk Array
hdisk1  Available 01-00-00 SAS RAID 0 Disk Array
hdisk2  Available 03-01-01 MPIO IBM 2076 FC Disk
hdisk3  Available 03-01-01 MPIO IBM 2076 FC Disk

That doesn't tell us much, as AIX hdisk device names are just numbered in the order discovered. We do know that we have some local SAS disks, as well as fiber channel (FC) LUNs.

It's common during troubleshooting a SAN to rmdev -dRl fcsX and purge all the SAN attached HBAs (fcsX), paths, and LUNs. Next they are re-scanned using cfgmgr, and confirmed. This clears counters and dead paths, however the hdisk and fcs device numbers may change! This depends on the device discovery order and if any new hardware has been added.

Do not rely on the device names in backup scripts or other automated processes. They can change over the lifetime of a system.

Storage Array

It's important to know what storage array to expect LUNs to be served from when there is more than one array in the environment.

IBM arrays are listed by major model number (4 digits) in the device description. Other manufacturers may vary.

% lsdev -Cc disk
...
hdisk2  Available 03-01-01 MPIO IBM 2076 FC Disk
hdisk3  Available 03-01-01 MPIO IBM 2076 FC Disk

These LUNs are from an IBM 2076 V7000 storage array.

The best attribute to verify is the storage array WWN, and the WWPNs of the controller ports.

AIX's MPIO shows this under the lsattr -El hdiskX attributes, as node_name and ww_name.

% lsattr -El hdisk3
PCM             PCM/friend/fcpother                                 Path Control Module              False
PR_key_value    none                                                Persistant Reserve Key Value     True+
algorithm       shortest_queue                                      Algorithm                        True+
...
node_name       0x5005079230000128                                  FC Node Name                     False
...
ww_name         0x5005079230150128                                  FC World Wide Name               False

If provided with specific field names, lspath can also show the WWPNs of the storage array:

% lspath -F status:name:parent:connection
Enabled:hdisk1:fscsi0:5005079230132bc3,1000000000000
Enabled:hdisk2:fscsi0:5005079230132bc3,2000000000000
Enabled:hdisk3:fscsi0:5005079230132bc3,3000000000000
Enabled:hdisk4:fscsi0:5005079230132bc3,4000000000000
...

LUN Serial Number

The LUN serial number is the most important identifier of any LUN. Even with duplicate data and PVIDs (ie: a snapshot), the serial numbers will be unique! They are assigned at the storage controller when created, and are guaranteed to be unique across all LUNs from that controller.

The LUN serial number is the only trustworthy method of LUN identification.

AIX's MPIO stores the serial number as part of the unique_id attribute of lsattr -El hdiskX:

% lsattr -El hdisk1
PCM             PCM/friend/fcpother                                 Path Control Module              False
PR_key_value    none                                                Persistant Reserve Key Value     True+
...
unique_id       33213600507293081006E280000000000000204214503IBMfcp Unique device identifier         False
ww_name         0x5005072930150ba8                                  FC World Wide Name               False

The serial number is also in the output of lspv -u:

% lspv -u
hdisk1          00cdb8105a9b1234                    rootvg          active      2C020220343123454ec010ad002ad84100000001nvme
hdisk3          none                                None                        332136005066810820069E00000000000025304214503IBMfcp                  f4993d47-0379-7984-98ac-1ccc6f034ac0
hdisk4          none                                None                        332136005066810820069E00000000000025404214503IBMfcp                  0bd599ab-426e-ebee-f8a0-5d4a80e1e9d8

The example above has an NVMe drive, and two SAN LUNs. These LUNs have a serial number in the fifth column, and an AIX specific UUID in the sixth column. For interoperability, only rely on the fifth column serial number.

If the system is an older version of AIX (ie: 5.3), or using a non-IBM driver, the lsattr command may not list the unique id and lspv -u may not be available.

In that case, you can query the ODM for the value using odmget and grep for the value. Note that grep -p matches paragraphs, so the first grep matches all attributes for the hdisk, and the second grep matches the unique id.

% odmget CuAt | grep -p hdisk10 | grep -p uniq
CuAt:
        name = "hdisk10"
        attribute = "unique_id"
        value = "332136005012340810062A20000000000000404214503IBMfcp"
        type = "R"
        generic = "D"
        rep = "nl"
        nls_index = 79

AIX often adds prefix and suffix information to the LUN depending on the manufacturer and connection type. The prefix is often the WWPN of the storage server, while the suffix will be related to the connection type. Generally the most significant digits of the serial number will be in the middle of the id.

For example:

332136005066810820069E00000000000025404214503IBMfcp
     ^^^^^^^^^^^^^^^^^         ^^^^^^    ^^^^^^^^^^
     Server WWPN/WWNN          LUN ID    IBM suffix

The storage controller or server often has many ports, so this number may vary. The IBM suffix is fairly static for a given storage type.

The LUN ID of "0254" is the most significant digits, and should match with what the storage administrator sees on their console.

When requesting new LUNs, always have the storage administrator return a list of LUN serial numbers to confirm in lspv -u after scanning them with cfgmgr.

LUN Size

The LUN size is another good data point to help identify a LUN. It can be hard to find before adding the LUN to a volume group.

Some LUNs will have a size_in_mb attribute in lsattr -El hdiskX, however it's not consistent across device types:

% lsattr -El hdisk0
PCM             PCM/friend/sisarray                                       Path Control Module           False
algorithm       fail_over                                                 Algorithm                     False
...
size_in_mb      571292                                                    Size in Megabytes             False

Using getconf is consistently available across device types and supported:

% getconf DISK_SIZE /dev/hdiskX
571292

Previously the bootinfo command was often used to find disk sizes, but that was deprecated by IBM in 2018 [1].

Some proprietary drivers present the size information with other LUN details, but these are the supported AIX commands.

Paths

It's important to confirm the number of paths to a LUN meets the expectations in the environment. At minimum two paths are required for redundancy. Additional paths add up quickly connecting to multiple storage controllers and across switches.

Note that on AIX a path is from one client WWPN to one server WWPN. There can be multiple paths traversing the same HBA port, physical cable, and SAN switch port.

Many drivers will warn that path operations degrade after a large number of paths. Generally I recommend 16 paths or less per LUN. The optimum number of paths varies by storage vendor and driver.

To check paths, use AIX's lspath:

% lspath
Enabled hdisk0 fscsi0
Enabled hdisk0 fscsi1
...

Paths are displayed with their state, the hdisk device, and the parent adapter. Path states can include:

Enabled

This path is up and functional.

Failed

This path is down, and no IO is passing. It could recover.

Missing

A missing path was previously known, and was in a Failed state when a reboot occurred. At startup it was marked Missing. Missing paths do not automatically recover, and must be cleared or re-detected with cfgmgr.

PVID

Finally an AIX unique Physical Volume Identifier (PVID) is good for identifying LUNs. We can rely on lspv -u to show all the PVIDs:

% lspv -u
hdisk1          00cdb8105a9b1234                    rootvg          active      2C020220343123454ec010ad002ad84100000001nvme
hdisk3          none                                None                        332136005066810820069E00000000000025304214503IBMfcp                  f4993d47-0379-7984-98ac-1ccc6f034ac0
hdisk4          none                                None                        332136005066810820069E00000000000025404214503IBMfcp                  0bd599ab-426e-ebee-f8a0-5d4a80e1e9d8

New LUNs should always have no PVID. Full stop!

If the LUN has a PVID, it may be in use on another system. Verify with the SAN administrator using the LUN serial number that the LUN is not mapped to any other system.

There are valid cases where a LUN may already have a PVID, like a SAN snapshot, HA cluster, or VIO server. However this document focuses on confirming new LUNs for a system.

Writing to a LUN with an existing PVID will cause data loss. Go back and double-check the configuration.

Summary of recommendations

  • Always ask for a list of LUN serial numbers when making a new LUN request of SAN administrators.

  • Never rely on device names (ie: hdisk12) in scripts or documentation. Use the serial number instead.

  • Always confirm new hdisk devices for SAN LUNs match the serial numbers expected before writing any data to them (ie: extendvg).

  • New LUNs always have no PVID, or there is a problem.