MPECS Inc. Blog: NVMe

Showing posts with label NVMe. Show all posts

Tuesday, 15 January 2019

Custom Intel X299 Workstation: Intel VROC RAID 1 NVMe WinSat Disk Score

We just finished a custom build for a client of ours in the US.

Intel Core i9-7980XE

Intel liquid cooling kit

Gigabyte X299 UD4
128GB Kingston RAM
Intel Virtual RAID on CPU (Intel VROC)

SKU: VROCISSDMOD

(2) 1TB Intel SSD Pro 7600p Series SSDs in RAID 1
ZOTAC RTX 2080 Ti NVIDIA
Intel X540-T2 10GbE
Antec P100 with EVGA 850Wat Gold

The machine is extremely fast but quiet.

After kicking the tires a bit with Windows 10 Pro 64-bit and some software installs post burn-in we get the following performance out of the Intel NVMe RAID 1 pair:

C:\Temp>winsat disk
Windows System Assessment Tool
> Running: Feature Enumeration ''
> Run Time 00:00:00.00
> Running: Storage Assessment '-ran -read -n 0'
> Run Time 00:00:00.77
> Running: Storage Assessment '-seq -read -n 0'
> Run Time 00:00:02.38
> Running: Storage Assessment '-seq -write -drive C:'
> Run Time 00:00:01.64
> Running: Storage Assessment '-flush -drive C: -seq'
> Run Time 00:00:00.45
> Running: Storage Assessment '-flush -drive C: -ran'
> Run Time 00:00:00.38
> Dshow Video Encode Time                      0.00000 s
> Dshow Video Decode Time                      0.00000 s
> Media Foundation Decode Time                 0.00000 s
> Disk Random 16.0 Read                       1020.73 MB/s          8.8
> Disk Sequential 64.0 Read                   3203.52 MB/s          9.3
> Disk Sequential 64.0 Write                  1456.24 MB/s          8.8
> Average Read Time with Sequential Writes     0.090 ms          8.8
> Latency: 95th Percentile                     0.146 ms          8.9
> Latency: Maximum                             0.316 ms          8.9
> Average Read Time with Random Writes         0.058 ms          8.9
> Total Run Time 00:00:05.91

The machine is destined for a surveying company that's getting into high end image and video work with drones.

All in all, we are very happy with the build and we're sure they will be too!

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
www.s2d.rocks !
Our Web Site
Our Cloud Service

Friday, 11 January 2019

Some Thoughts on the S2D Cache and the Upcoming Intel Optane DC Persistent Memory

Intel has a very thorough article that explains what happens when the workload data volume on a Storage Spaces Direct (S2D) Hyper-Converged Infrastructure (HCI) cluster starts to "spill over" to the capacity drives in a NVMe/SSD Cache with HDD Capacity for storage.

IOPs performance on NVMe + HDD configuration with Windows Server 2016 and Storage Spaces Direct

Essentially, any workload data that needs to be shuffled over to the hard disk layer will suffer a performance hit and suffer it big time.

In a setup where we would have either NVMe PCIe Add-in Cards (AiCs) or U.2 2.5" drives for cache and SATA SSDs for capacity the performance hit would not be as drastic but it would still be felt depending on workload IOPS demands.

So, what do we do to make sure we don't shortchange ourselves on the cache?

We baseline our intended workloads using Performance Monitor (PerfMon).

Here is a previous post that has an outline of what we do along with links to quite a few other posts we've done on the topic: Hyper-V Virtualization 101: Hardware and Performance

We always try to have the right amount of cache in place for the workloads of today but also with the workloads of tomorrow across the solution's lifetime.

S2D Cache Tip

TIP: When looking to set up a S2D cluster we suggest running with a higher count smaller volume cache drive set versus just two larger capacity drives.

Why?

For one, we get a lot more bandwidth/performance out of three or four cache devices versus two.

Secondly, in a 24 drive 2U chassis if we start off with four cache devices and lose one we still maintain a decent ratio of cache to capacity (1:6 with four versus 1:8 with three).

Here are some starting points based on a 2U S2D node setup we would look at putting into production.

Example 1 - NVMe Cache and HDD Capacity

4x 400GB NVMe PCIe AiC
12x xTB HDD (some 2U platforms can do 16 3.5" drives)

Example 2 - SATA SSD Cache and Capacity

4x 960GB Read/Write Endurance SATA SSD (Intel SSD D3-4610 as of this writing)
20x 960GB Light Endurance SATA SSD (Intel SSD D3-4510 as of this writing)

Example 3 - Intel Optane AiC Cache and SATA SSD Capacity

4x 375GB Intel Optane P4800X AiC

24x 960GB Light Endurance SATA SSD (Intel SSD D3-4510 as of this writing)

One thing to keep in mind when it comes to a 2U server with 12 front facing 3.5" drives along with four or more internally mounted 3.5" drives is their heat and available PCIe slots. Plus, the additional drives could also place a constraint on the processors that are able to be installed also due to thermal restrictions.

Intel Optane DC Persistent Memory

We are gearing up for a lab refresh when Intel releases the "R" code Intel Server Systems R2xxxWF series platforms hopefully sometime this year.

That's the platform Microsoft set an IOPS record with set up with S2D and Intel Optane DC persistent memory:

The new HCI industry record: 13.7 million IOPS with Windows Server 2019 and Intel® Optane™ DC persistent memory

We have yet to see to see any type of compatibility matrix as far as the how/what/where Optane DC can be set up but one should be happening soon!

It should be noted that they will probably be frightfully expensive with the value seen in online transaction setups where every microsecond counts.

TIP: Excellent NVMe PCIe AiC for lab setups that are Power Loss Protected: Intel SSD 750 Series

Intel SSD 750 Series Power Loss Protection: YES

These SSDs can be found on most auction sites with some being new and most being used. Always ask for an Intel SSD Toolbox snip of the drive's wear indicators to make sure there is enough life left in the unit for the thrashing it would get in a S2D lab! :D

Acronym Refresher

Yeah, gotta love 'em! Being dyslexic has its challenges with them too. ;)

IOPS: Inputs Outputs per Second
AiC: Add-in Card
PCIe: Peripheral Component Interconnect Express
NVMe: Non-Volatile Memory Express
SSD: Solid-State Drive
HDD: Hard Disk Drive
SATA: Serial ATA
Intel DC: Data Centre (US: Center)

Thanks for reading!

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
www.s2d.rocks !
Our Web Site
Our Cloud Service

Thursday, 10 January 2019

Server Storage: Never Use Solid-State Drives without Power Loss Protection (PLP)

Here's an article from a little while back with a very good explanation of why one should not use consumer grade SSDs anywhere near a server:

Don’t do it: consumer-grade solid-state drives (SSD) in Storage Spaces Direct

While the article points specifically to Storage Spaces Direct (S2D) it is also applicable to any server setup.

The impetus behind this post is pretty straight forward via a forum we participate in:

IT Tech: I had a power loss on my S2D cluster and now one of my virtual disks is offline
IT Tech: That CSV hosted my lab VMs
Helper 1: Okay, run the following recovery steps that help ReFS get things back together
Us: What is the storage setup in the cluster nodes?
IT Tech: A mix of NVMe, SSD, and HDD
Us: Any consumer grade storage?
IT Tech: Yeah, the SSDs where the offline Cluster Storage Volume (CSV) is
Us: Mentions above article
IT Tech: That's not my problem
Helper 1: What were the results of the above?
IT Tech: It did not work :(
IT Tech: It's ReFS's fault! It's not ready for production!

The reality of the situation was that there was live data sitting in the volatile cache DRAM on those consumer grade SSDs that got lost when the power went out. :(

We're sure that most of us know what happens when even one bit gets flipped. Error Correction on memory is mandatory for servers for this very reason.

To lose an entire cache worth across multiple drives is pretty much certain death for whatever sat on top of them.

Time to break-out the backups and restore.

And, replace those consumer grade SSDs with Enterprise Class SSDs that have PLP!

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
www.s2d.rocks !
Our Web Site
Our Cloud Service

Tuesday, 23 January 2018

Storage Spaces Direct (S2D): Sizing the East-West Fabric & Thoughts on All-Flash

Lately we've been seeing some discussion around the amount of time required to resync a S2D node's storage after it has come back from a reboot for whatever reason.

Unlike a RAID controller where we can tweak rebuild priorities, S2D does not offer the ability to do so.

It is with very much a good thing that the knobs and dials are not exposed for this process.

Why?

Because, there is a lot more going on under the hood than just the resync process.

While it does not happen as often anymore, there were times where someone would reach out about a performance problem after a disk had failed. After a quick look through the setup the Rebuild Priority setting turned out to be the culprit as someone had tweaked it from its usual 30% of cycles to 50% or 60% or even higher thinking that the rebuild should be the priority.

S2D Resync Bottlenecks

There are two key bottleneck areas in a S2D setup when it comes to resync performance:

East-West Fabric

10GbE with or without RDMA?
Anything faster than 10GbE?

Storage Layout

Those 7200 RPM capacity drives can only handle ~110MB/Second to ~120MB/Second sustained

The two are not the mutually exclusive culprit depending on the setup as they both can play together to limit performance.

The physical CPU setup may also come into play but that's for another blog post. ;)

S2D East-West Fabric to Node Count

Let's start with the fabric setup that the nodes use to communicate with each other and pass storage traffic along.

This is a rule of thumb that was originally born out of a conversation at a MVP Summit a number of years back with a Microsoft fellow that was in on the S2D project at the beginning. We were discussing our own Proof-of-Concept that we had put together based on a Mellanox 10GbE and 40GbE RoCE (RDMA over Converged Ethernet) fabric. Essentially, at 4-nodes a 40GbE RDMA fabric was _way_ too much bandwidth.

Here's the rule of thumb we use for our baseline East-West Fabric setups. Note that we always use dual-port NICs/HBAs.

Kepler-47 2-Node

Hybrid SSD+HDD Storage Layout with 2-Way Mirror
10GbE RDMA direct connect via Mellanox ConnectX-4 LX
This leaves us the option to add one or two SX1012X Mellanox 10GbE switches when adding more Kepler-47 nodes

2-4 Node 2U 24 2.5" or 12/16 3.5" Drives with Intel Xeon Scalable Processors

2-Way Mirror: 2-Node Hybrid SSD+HDD Storage Layout
3-Way Mirror: 3-Node Hybrid SSD+HDD Storage Layout
Mirror-Accelerated Parity (MAP): 4 Nodes Hybrid SSD+HDD Storage Layout
2x Mellanox SX1012X 10GbE Switches

10GbE RDMA direct connect via Mellanox ConnectX-4 LX

4-7 Node 2U 24 2.5" or 12/16 3.5" Drives with Intel Xeon Scalable Processors

4-7 Nodes: 3-Way Mirror: 4+ Node Hybrid SSD+HDD Storage Layout
4+ Nodes: Mirror-Accelerated Parity (MAP): 4 Nodes Hybrid SSD+HDD Storage Layout
4+ Nodes: Mirror-Accelerated Parity (MAP): All-Flash NVMe cache + SSD
2x Mellanox Spectrum Switches with break-out cables

25GbE RDMA direct connect via Mellanox ConnectX-4/5
50GbE RDMA direct connect via Mellanox ConnectX-4/5

8+ Node 2U 24 2.5" or 12/16 3.5" Drives with Intel Xeon Scalable Processors

4-7 Nodes: 3-Way Mirror: 4+ Node Hybrid SSD+HDD Storage Layout

4+ Nodes: Mirror-Accelerated Parity (MAP): 4 Nodes Hybrid SSD+HDD Storage Layout
4+ Nodes: Mirror-Accelerated Parity (MAP): All-Flash NVMe cache + SSD

2x Mellanox Spectrum Switches with break-out cables

50GbE RDMA direct connect via Mellanox ConnectX-4/5

100GbE RDMA direct connect via Mellanox ConnectX-4/5

Other than the Kepler-47 setup we always have at least a pair of Mellanox ConnectX-4 NICs in each node for East-West traffic. It's our preference to separate out the storage traffic and the rest.

All-Flash Setups

There's a lot of talk in the industry about all-flash.

It's supposed to solve the biggest bottleneck of them all: Storage!

The catch is, bottlenecks are moving targets.

Drop in an all-flash array of some sort and all of a sudden the storage to compute fabric becomes the target. Then, it's the NICs/HBAs on the storage _and_ compute nodes, and so-on.

If you've ever changed a single coolant hose in an older high miler car you'd see what I mean very quickly. ;)

IMNSHO, at this point in time, unless there is a very specific business case for all-flash and the fabric in place allows for all that bandwidth with virtually zero latency, all-flash is a waste of money.

One business case would be for a cloud services vendor that wants to provide a high IOPS and vCPU solution to their clients. So long as the fabric between storage and compute can fully utilize that storage and the market is there the revenues generated should more than make up for the huge costs involved.

Using all-flash as a solution to a poorly written application or set of applications is questionable at best. But, sometimes, it is necessary as the software vendor has no plans to re-work their applications to run more efficiently on existing platforms.

Caveat: The current PCIe bus just can't handle it. Period.

A pair of 100Gb ports on one NIC/HBA can't be fully utilized due to the PCIe bus bandwidth limitation. Plus, we deploy with two NICs/HBAs for redundancy.

Even with the addition of more PCIe Gen 3 lanes in the new Intel Xeon Scalable Processor Family we are still quite limited in the amount of data that can be moved about on the bus.

S2D Thoughts and PoCs

The Storage Spaces Direct (S2D) hyper-converged or SOFS only solution set can be configured and tuned for a very specific set of client needs. That's one of its beauties.

Microsoft remains committed to S2D and its success. Microsoft Azure Stack is built on S2D so their commitment is pretty clear.

So is ours!

Proof-of-Concept (PoC) Lab

S2D 4-Node for Hyper-Converged and SOFS Only

Hyper-V 2-Node for Compute to S2D SOFS

This is the newest edition to our S2D product PoC family:

Kepler-47 S2D 2-Node Cluster

The Kepler-47 picture is our first one. It's based on Dan Lovinger's concept we saw at Ignite Atlanta a few years ago. Components in this box were similar to Dan's setup.

Our second generation Kepler-47 is on the way to being built now.

Kepler-47 v2 PoC Ongoing Build & Testing

This new generation will have an Intel Server Board DBS1200SPLR with an E3-1270v6, 64GB ECC, Intel JBOD HBA I/O Module, TPM v2, and Intel RMM. OS would be installed on a 32GB Transcend 2242 SATA SSD. Connectivity between the nodes will be Mellanox ConnectX-4 LX running at 10GbE with RDMA enabled.

Storage in Kepler-47 v2 would be a combination of one Intel DC P4600 Series PCIe NVMe drive for cache, two Intel DC S4600 Series SATA SSDs for performance tier, and six HGST 6TB 7K6000 SAS or SATA HDDs for capacity. The PCIe NVMe drive will optional due it is cost.

We already have one or two client/customer destinations for this small cluster setup.

Conclusion

Storage Spaces Direct (S2D) rocks!

We've invested _a lot_ of time and money in our Proof-of-Concepts (PoCs). We've done so because we believe the platform is the future for both on-premises and data centre based workloads.

Thanks for reading! :)

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Web Site
Our Cloud Service