Thursday, 13 July 2017

Mellanox PPC SwitchX Update v3.6.4006

Mellanox has released a firmware update for their SwitchX switches: v3.6.4006.

We've already updated our two SX1012 switches to v3.6.3508 as per our blog post Mellanox Prep for RoCE RDMA. That means that we'll be able to upgrade without any intermediary steps as per the section Upgrade From Previous Versions.

When looking into the Release Notes for the new firmware version we see:


Note that in our case we are running ConnectX-3 Pro ( MCX354A) adapters. So, we'll be keeping firmware 2.4.5030 on those NICs until such time as Mellanox lets us know that we are able to bump them up to 2.4.7000.


Looking in the Changes and New Features section there doesn't seem to be anything specific to us however there are quite a few items listed for versions between v3.6.3508 and v3.6.4006!

There are a few items in the General Known Issues section that we need to be aware of.
  • Point 32: Statistics files are reset which means graphs get reset.
  • Point 49 indicates that a faulty cable may cause other ports to delay their "rise". 
  • Point 50 is important. 40GbE passive copper cables 5m in length may experience "rise" issues if connected to a third party 40GbE NIC.
  • Point 93: Break-out Cables
    • Odd ports might suffer from Tx drops even when global flow control is enabled.
      Set the egress poll to 8M using the following command:
      “pool ePool0 direction egress-mc size 8M type dynamic”.
  •  Point 128: QoS: ETS does not work on SN2100 switch system.
I suggest checking out the Bug Fixes section near the end of the document. ;)

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service
Twitter: @MPECSInc

Tuesday, 11 July 2017

ALPS “Touch Pad Diagnostics” Pop-Up Fix

One of our newly set up Windows 10 Enterprise 64-bit Toshiba Tecra Z50-C laptops was throwing the following pop-up seemingly at random:

image

ALPS Touch Pad Diagnostics

Collect Diagnostics
Press the button to collect the result to log file.

The pop-up got to be quite annoying very quickly.

Fix Steps

A quick search turned up the following:

1: Elevate a CMD
2: Copy and Paste the following into CMD:
reg add HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\SynTP\Parameters\Debug /v DumpKernel /d 00000000 /t REG_DWORD /f
3: Reboot

Conclusion

Note that the above registry setting is on one line. Once it was set up in the registry the problem went away

As always, make sure to test the fix first!

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service
Twitter: @MPECSInc

Monday, 10 July 2017

VELCRO Ties: It’s not easy being green!

This spring and now summer I’ve been spending a lot of time in the garden. From getting things ready for planting, planting, and now keeping the weeds at bay and the plants from being thirsty!

Well, on one of the trips into one of our Canadian Tire big box hardware stores I stumbled across these:

image

Yeah, okay, so what you might say?

Well, that 45' role was $6 before taxes. $6!

Needless to say, I purchased a good number of roles while I was there to use as wire ties in our Proof-of-Concept (PoC) systems plus others.

I personally don't have a problem with the colour green. ;)

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service
Twitter: @MPECSInc

Thursday, 22 June 2017

Disaster Preparedness: KVM/IP + USB Flash = Recovery. Here’s a Guide

One of the lessons the school of hard knocks taught us was to be ready for anything. _Anything_

One client space had such a messed up HVAC system that the return air system, which is in the suspended ceiling, didn’t work because most of the vent runs up there were not capped properly. We had quite a few catastrophic failures at that site, as the server closet’s temperatures were always high but especially so in the winter. Fortunately, they moved out of that space a few years ago.

Since we began delivering our standalone Hyper-V solutions, Storage Spaces Direct (S2D) Hyper-Converged or SOFS clusters, and Scale-Out File Server and Hyper-V clusters into Data Centers around the world, we have discovered that there is a distinct difference in Data Centre services quality. As always, “buyer beware” and “We get what we pay for”.
So, what does that mean?

It means that, whether on-premises (“premises” not “premise”), hybrid, or all-in the Cloud we need to be prepared.
For this post, let us focus in on being ready for a standalone system or a cluster node failure.

There are four very important keys to being ready:

  1. Systems warranties
    1. 4-Hour Response in a cluster setting
    2. Next Business Day (NBD) for others is okay
  2. KVM over IP (KVM/IP) is a MUST
    1. Intel RMM
    2. Dell iDRAC Enterprise
    3. iLO Advanced
    4. Others …
  3. Bootable USB Flash Drive (blog post)
    1. 16GB flash drive
    2. NTFS formatted ACTIVE
    3. Most current OS files set up and maintained
    4. Keep those .WIM files up to date (blog post)
  4. Either Managed UPS or PDU
    1. Gives us the ability to power cycle the server’s power supply or supplies.

In our cluster settings, we have our physical DC (blog post)  set up for management. We can use that as our platform to get to the KVM/IP to begin our repair and/or recover processes.

This is what the back of our (S2D) cluster looks like now:

image

The top flash drive is what we have been using for the last few years. A Kingston DTR3 series flash drive.

The bottom one is a Corsair VEGA. We have also been trying out the SanDisk Cruzer Fit that is even smaller than that!

The main reason for the change is to remove that fob sticking off the back or front of the server. In addition, the VEGA or Fit have such a small profile we can ship them plugged in to the server(s) and not worry about someone hitting them once the server is in production.

Here is a quick overview of what we do in the event of a problem:

  1. Log on to our platform management system
  2. Open the RMM web page
  3. log on
  4. Check the baseboard management controller logs (BMC/IPMI logs page)
    1. If we see the logs indicate a hard-stop them we’re on to initiate warranty replacement
    2. If it is in the OS somewhere then we can either fix or rebuild
  5. Rebuild
    1. Cluster: Clean up domain AD, DNS, DHCP, and Evict OLD Node from Cluster
    2. Reset or Reboot
    3. Function Key to Boot Menu
    4. Boot flash
    5. Install OS to OS partition/SSD
    6. Install and configure drivers
    7. Cluster: Join Domain
      1. Server name would be the same as downed node
      2. Update Kerberos Constrained Delegation
    8. Install and configure the Hyper-V and/or File Services Roles
    9. Set up networking
    10. Standalone: Import VMs
    11. Cluster: Join and Live Migrate on then off to test

Done.

The key reason for the RMM and flash drive? We just accomplished the above without having to leave the shop.

And, across the entire life of the solution if there is a hiccup to deal with we’re dealing with it immediately instead of having to travel to the client site, data centre, or third party site to begin the process.

One more point: This setup allows us to deploy solutions all across the world so long as we have an Internet connection at the server’s site.

There is absolutely no reason to deploy servers without a RMM or desktops/laptops/workstations without vPro. None. Nada. Zippo.

When it comes to getting in to the network, we can use the edge to VPN in or have an IP/MAC filtered firewall rule with RDP inbound to our management platform. One should _never_ open the firewall to a listening RDP port no matter what port it would be listening on.

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service
Twitter: @MPECSInc

Wednesday, 21 June 2017

S2D: Mellanox Prep – Connecting to the Switches via SSHv2

To update the Mellanox switch pair we can use the Web interface.

 image

Or, we could use the command line.

But, just what utility should we use to do that?

For us, we’ve been using Tera Term for the longest time now.

It’s a simple utility to use and best of all, it’s free!

Now, it took a _long_ while to figure out how to get in via SSHv2! Note that in our case the two switches we have in our Proof-of-Concept (PoC) setup were demo units we ended up purchasing.

During the first SSH connection the switch should prompt to run the Mellanox configuration wizard which it did for  as seen below:

image

Here are the steps we took to get connected:

  • HOSTNAME: 14 Characters or less (NETBIOS)
  • IP Address: Static or DHCP Reserved
  • DNS: Set up DNS A Record for HOSTNAME at IP
  • Tera Term: Use challenge/response to log in
    • image

We’re now off to get our switches set up for RoCE (RDMA over Converged Ethernet).

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service
Twitter: @MPECSInc

Friday, 16 June 2017

S2D: Mellanox Prep for RoCE RDMA

We are in the process of setting up a pair of Mellanox SX1012B (40GbE) switches and CX354A ConnectX-3 Pro NICs (CX3) for a Storage Spaces Direct (S2D) four node cluster.

Setup:

  • (2) MSX1012B Switches
  • (2) NETGEAR 10GbE Switches
  • Intel Server System R2224WTTYS
    • (2) CX354A NICs per Node
    • (2) Intel X540-T2 10GbE NICs

The Mellanox gear will provide the East-West path between nodes while the NETGEAR/Intel 10GbE will provide for node access and the virtual switch setup.

Out of the box, the first thing to look at is … the Release Notes for the MLNX-OS version v3.6.3508 that is current as of this writing.

As we can see, the firmware level we need to have on our CX3 NICs:

image
The most current version of the CX3 firmware is 2.4.7000 which we had downloaded prior to reading the release notes. ;)
We will make sure to install 2.40.5030 once we are ready to do so:

image
Now, as far as the Mellanox switch OS goes (MLNX-OS), there may be a bit of a process needed depending on how old the current OS is on them!

imageimage2.5Upgrade From Previous Releases
Older versions of MLNX-OS may require upgrading to one or more intermediate versions prior to upgrading to the latest. Missing an intermediate step may lead to errors. Please refer to Table 2and Table 3 to identify the correct upgrade order.
Note that there are two types of switch operating systems depending on the underlying hardware. In our case, it is PPC as opposed to x86.

In our case, the current OS version is 3.4.0012, so our process will be:

  1. 3.4.0012 –> 3.4.2008
  2. 3.4.2008 --> 3.5.1016
  3. 3.5.1016 –> 3.6.2002
  4. 3.6.2002 –> 3.6.3004
  5. 3.6.3004 –> 3.6.3508

Fortunately, the Mellanox download site makes picking and choosing the various downloads a simple process.

This is what the update process looks like:

image

image

One can expect to budget between 30-90 minutes per upgrade session.

NOTE: When both switch management consoles were opened in separate IE tabs one would fail out on the upload after a minute or so. Once we opened a separate Firefox browser session for one of the switches the upgrade moved seamlessly.

Once complete, we will move on to our preliminary settings scope for this project.

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service
Twitter: @MPECSInc

Monday, 15 May 2017

WannaCry Mitigation plus Windows XP and Server 2003 Patch

By now most of the world has heard about the WannaCry malware put together from purported NSA exploit "tools".

The simplest thing to do is to disable or remove SMBv1 on our networks: How to enable and disable SMBv1, SMBv2, and SMBv3 in Windows and Windows Server (Microsoft Support).

Dealing with SMBv1

On Windows 7:

First, we need the following put into a text file:

sc.exe config lanmanworkstation depend= bowser/mrxsmb20/nsi
sc.exe config mrxsmb10 start= disabled
pause
shutdown -r -t 0 -f

image

In Notepad click File then Save As and name exactly as follows:

"Windows7 SMBv1 DISABLE.BAT"

image

NOTE: The quotes " are necessary

Right click on the resulting BATCH file and Run As Administrator:

image

An administrator's username and password will be required for this step. A local admin or domain account would work.

A status window will show:

image

NOTE: Windows 7 should show SUCCESS for both steps

As the message says, press any key to continue.

NOTE: The script automatically reboots the machine so make sure users save and close before running.

On Windows 10:

  1. Click Start and type PowerShell
  2. Right click on the result and Run as Administrator
  3. Remove-WindowsOptionalFeature –Online –FeatureName SMB1Protocol
    • You should see:
    •      image

That fully removes the problematic component in Windows.

Windows Server

Open an elevated PowerShell window:

Remove-WindowsFeature –Name FS-SMB1

image

Backup & Restore

For users that almost exclusively work from their computer over server or cloud based resources with no local backup it's important that they back up their machines daily! They should have at least three 2.5" USB3 fast disk drives in rotation.

We use ShadowProtect Desktop by StorageCraft to back up our client's endpoints.

A critical component in the backup regime is an air-gap. Just as it is for the entire organization's server infrastructure.

Windows XP and Server 2003

Get the Security Updates ASAP and install them!

The files may be able to be set up to be delivered via your favourite patching mechanism. Please check that out to get these patches out to as many systems as is possible.

Windows Firewall

One mitigation step would be to set up a Group Policy object that denies File & Print (445) Inbound from any system but necessary such as servers and/or domain controllers.

Malware Mitigation

As always, the best form of mitigation is a well trained user. Patch and train the human is the best methodology going.

A a small plug, our xD mail sanitation and continuity service flags and renders inert links that say one thing but point to another location. This has put link shortening services like Bit.Ly at a disadvantage but we're willing to pay that price to keep our users sage. Just ask us how!

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service

Thursday, 27 April 2017

Surface Pro 4: Creator's Update Graphics Driver Issue

As an FYI, after updating to the Creator's Update Windows 10 version the graphics subsystems on the Surface Pro 4 seems to start behaving badly. This is especially true if connected to external monitors via a Surface Dock (Gen1 or Gen2).

An updated driver can be obtained here: Intel Iris 540 Driver for Windows 10

The SP4 had 15.44 while the download is 15.45 as of this writing!

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service

Wednesday, 15 March 2017

Windows Server 2016 March 2017 Update: Full & Delta Available

We can download both the full March Cumulative Update or there is now a Delta available.

image

Delta Update Windows Server 2016

The update is quite critical for those of us that run clusters on Windows Server 2016.

  • Addresses issue which could cause ReFS metadata corruption
  • Several fixes for Enable-ClusterS2D cmdlet for setting up Storage Spaces Direct
  • Addresses issue with Update-ClusterFunctionalLevel cmdlet during rolling upgrades if any of the default resource types are not registered
  • Optimization to the ordering when draining a S2D node with Storage Maintenance Mode
  • Addresses servicing issue where the Cluster Service may not start automatically on the first reboot after applying an update
  • Improved the bandwidth of SSD/NVMe drives available to application workloads during S2D rebuild operations.
  • Addresses issue with all flash S2D systems with cache devices where there was unnecessary read data from both tiers that would degrade performance

A full list is here: March 14, 2017—KB4013429 (OS Build 14393.953)

The Delta Update can be used to update our .WIM files for our Windows Server 2016 flash drive based installers (Blog post How-To).

Note that the last Cumulative Update took a good hour to run on our VMs and nodes. This one is sounding like it may take as long or longer depending on whether the Delta of full Cumulative Update gets installed.

Here’s a direct link to the Microsoft Update Page for KB4013429.

Happy Patching! ;)

Philip Elder
Microsoft High Availability MVP
MPECS Inc.
Co-Author: SBS 2008 Blueprint Book
Our Cloud Service