• 1. London, UK
  • 2. New York, NY
  • 3. Sydney, Australia
  • 4. Melbourne, Australia
  • 5. Moscow, Russia
  • 6. Singapore
  • 7. Paris, France
  • 8. Chicago, IL
  • 9. Hong Kong
  • 10. Houston, TX
Bharat Suneja

Sunday, July 17, 2005


NetApp Data OnTap 7 and Flexvolumes

Posted by Bharat Suneja at 1:21 PM
The Saturday (July 16th) was spent on migrating an Exchange cluster from an older NetApp 840 to a new 920 filer. My personal experience with NetApp's Exchange solutions - from the older filers that used NetApp's VLD protocol to the newer iSCSI filers - has been a string of hit or miss affairs.

The Windows hosts (particularly apps like Exchange that insist on using "locally attached" disks and won't work with CIFS shares) use NetApp's SnapDrive snap-in that plugs into the Computer Management console (compmgmt.msc) and makes the host think the mounted volume/LUN is a local drive. Many times when you disconnect an iSCSI LUN (or a VLD "drive" for older filers), attempts to reconnect fail for a number of reasons. I've lived with it happily for what seems like a long while.

The Saturday upgrade went through the same reconnection issues and took several hours with a NetApp PS consultant on site and a NetApp tech support engineer on the phone. (The good thing was the consultant didn't have to go through the normal support queues that customers go through - as pleasant an experience as any support call can be :)

The fact that we did a whole bunch of mini-upgrades during this time - Windows Server 2003 SP1, Microsoft iSCSI initiator 1.06, SnapDrive 3.1X, et al) didn't seem to get in the way, but it's not something I would necessarily recommend doing.

The high point of the day was watching the upgraded SnapManager (SME 3.1P1) - NetApp's solution for performing snapshot backups of the Exchange store - rip through the backups and verifications. SME performs a verification of the backed up stores using eseutil. The entire backup and verification for 2 storage groups that included 2 mailbox stores of about 40 Gigs and a smaller public folder store completed in under 19 minutes - compared to an hour and 3-7 minutes on the old filer. The verification operation was what took the most time, perhaps more than 18 minutes, and it can be run on a passive node in the cluster, or on any other server or desktop - you don't need an Exchange server to do it.

The performance boost mostly came from Data OnTap 7's FlexVolumes. FlexVolumes use all the spindles (hard disks) in a filer to create what's known as an "aggregate". You create volumes in the aggregate, and create LUNs (or "Qtrees" and LUNs - refer to NetApp documentation for explanation of the terms) in the volume. As a result, data is distributed across all the spindles, giving a phenomenal performance boost compared to volumes sitting on far fewer spindles in a filer.

The other good part is you can expand/shrink the LUNs and volumes on the fly, without the need for taking the hosts/applications offline. In this case, Windows/Exchange hummed happily along as we increased the volume and LUN sizes on the fly.

Once we tested ability to reconnect to the old filer, the rest of the upgrade was quite troublefree. The cluster nodes connected to the replicated (aka "SnapMirrored") LUNs on the new filer and the EVS came up without any hiccups.

Overall, a good display of NetApp's technology, professional services, and tech support in a little under 12 hours, and unhappy as I am with its barebones SnapManager for Exchange backup utility and SnapDrive, it can count on me for future business and referrals.


Monday, November 08, 2004

Want to upgrade NetApp SnapDrive 3.x to 3.1?

You'll need 3 hotfixes, 2 of these cannot be downloaded from microsoft.com. You need to call Microsoft Product Support Services for these :
1. KB838894 - An updated Storport storage driver (version 5.2.3790.173) is available for Windows Server 2003
2. KB831112 - [Call PSS] You cannot import a transportable shadow volume in Windows Server 2003
3. KB840281 - [Call PSS] Volume information is lost when you extend a partition by using the DiskPart tool and then move the volume in a Windows Server 2003 cluster

Most of these hotfixes that are available from PSS are intended to correct specific issues only, and should not be applied if these issues are not faced. These come with the standard PSS disclaimer - these hotfixes have not been regression-tested, backup before you apply, et al.


Tuesday, August 31, 2004


Running 3-4 node cluster on iSCSI

Posted by Bharat Suneja at 6:34 PM
Clusters with more than 2 nodes using iSCSI storage are not officially supported by Microsoft or NetApp. Yet. Probably being tested in labs.

Added 3rd node to my cluster. Failed over Cluster, MSDTC and Exchange groups successfully!

Wolfpack (that's what the cluster is called..) is now truly a pack of wolves! (The nodes are called Wolf1, Wolf2, Wolf3.... :)

Labels: ,

Friday, July 23, 2004


NetApp Volume Fails To Mount After Power Outage

Posted by Bharat Suneja at 6:54 PM
A power outage (are rolling blackouts back in California??) earlier in the morning, and the resulting mess. The new cluster (Windows 2003/Exchange 2003) with NetApp's iSCSI filer came up happily. The older one (that uses NetApp's older VLD protocol) failed.

Issues: L: drive (Storage Group 1, Logs) would not connect/map, so Exchange Group in cluster would not come online.

Disconnected drives, reconnected. Did not work.
NetApp tech support recommended upgrading to SnapDrive 2.1 (3.0 onwards is only for iSCSI and Fiber Channel - no support for the older VLD protocol). That also required a post-SP3 Hotfix (Q816990) - only available through Microsoft PSS.

Recent experiences with NetApp and specific Microsoft hotfixes have resulted in long conference calls with PSS, so I decided to apply Windows 2000 SP4 instead.

Once this was done, the drive mappings were totally inconsistent - 4 out of 5 drives would connect, but none would map (to drive letters), 1 would map (Q), 2 would map (Q and L, or Q and E) or 4 would map. On reboot, some would disappear completely. Completely random behaviour.

After more than half a day of troubleshooting, NetApp concluded - to our dismay - that it might have been caused because of Spanning Tree Protocol (STP) used on switches to allow multiple redundant paths and avoid loops. Apparently, NetApp filers and STP don't work well together - it's documented by NetApp and they won't support it.

Can't really turn STP off on a network. Solution was to use PortFast on those particular ports (with NetApp filers and Exchange cluster nodes connected). On Cisco switches :
set spantree portfast module/port enable

Another fix was to increase the NfsAdminRetryCount value to 7 in the Registry.  Location: HKLM\SYSTEM\CurrentControlSet\Services\NAScsipt\Parameters

Finally, we were down to 2 drives mapping consistently  after reboots (Q and L), and 3 drives connecting but not mapping to drive letters. Needed to go to Cluster Admin and bring the drive resources online. This made them map, and they showed up in Windows Explorer.

The Exchange Virtual Server was up shortly before 7:30 PM.

The infamous "Crazy Friday Breakdowns" law at work again!


Wednesday, July 07, 2004

I've always had a love/hate relationship with NetApp's SnapManager for Exchange (SME) backup utility.

To restore a store, SME needs to take the Physical Disk resource offline. (Actually swaps it with the volume/LUN that has the backup). When this happens, the Exchange System Manager (and thereby the entire Exchange Virtual Server) goes down - because it depends on all the Physical Disk resources. Weird!

That means even if some disks and the stores residing on them are OK, all your users will suffer an outage when SME does a restore of a single store or a storage group.

I wish NetApp finds a better way to do this - or perhaps set up the cluster to not depend on the Physical Disk resources - or ALL the physical disk resources.

Alternatively, I'm looking at changing the System Attendant Resources's dependencies - just try and remove the disks with the Storage Group to be restored from the list.

From the SME docs:
In a Windows cluster, all resources that are directly or indirectly dependent on the LUN that is to be restored are taken offline, as well as the LUN itself. This means that the entire Exchange virtual server (all storage groups) is offline.

Do not attempt to manage any cluster resources while the restore is running.

If a cluster move group operation occurs during the restore (for example, if the node that owns the resources goes down) you must restart the SnapManager user interface and the restore.


Wednesday, June 30, 2004

When you try to install the NetApp VSS Provider (to connect iSCSI filers to Windows Server 2003), you need to install Hotfixes from KB833167 and KB824354.

You install the hotfixes. Then install the VSS Provider. And here's what you get -

Incorrect version of the following files was detected. Please install Hotfix Q833167.
Esent.dll version must be 5.2.3790.109 or newer
Ftdisk.sys version must be 5.2.3790.113 or newer

But you've already installed 833167!! You reinstall several times. No change. The VSS Provider installer keeps flashing the annoying message. PSS does not have a clue what's going on. Neither does NetApp.

The VSS Provider installer from NetApp specifically looks for the file versions and dates.

Hotfix 833167 should've updated ESENT.DLL and FTDISK.SYS. But Microsoft released an updated hotfix 833167 that did not contain those two files - because the files had not changed from the RTM version. Result: The system still had the RTM versions.

After a lot of digging, PSS was able to put together the files that shipped with the original update and make the hotfix available. Last time I heard, Microsoft was going to update that KB Article, and NetApp should've done something to change their installation routine to accept the RTM versions of ESENT.DLL and FTDISK.SYS.