• 1. London, UK
  • 2. New York, NY
  • 3. Sydney, Australia
  • 4. Melbourne, Australia
  • 5. Moscow, Russia
  • 6. Singapore
  • 7. Paris, France
  • 8. Chicago, IL
  • 9. Hong Kong
  • 10. Houston, TX
Bharat Suneja

Thursday, September 24, 2009


Another Gmail Outage

Posted by Bharat Suneja at 9:57 AM
After a widespread outage earlier this month, Google's Gmail web-based email service is reporting yet another outage today— this time affecting only "a small subset of users". More from Stephen Shankland in Gmail outage hits 'small subset of users'.

Labels: ,

Thursday, September 03, 2009


Gmail Outages And Cloud Availability

Posted by Bharat Suneja at 8:50 AM
Google's Gmail service had yet another widespread outage on Tuesday at 12:30 PM which lasted more than 3 hours between 100 minutes (according to Google) to 2 ½ hours (according to PC World). News of the outage quickly spread like a wildfire on social media networks, where it quickly earned the epitaph of Gfail. A great day for Twitter and Facebook! Even by Google's own account, it was a "big deal".

Google's Ben Treynor, VP Engineering and Site Reliability Czar, apologized for the outage in a blog post on the Gmail blog, and explained the technical details of what caused it. I like his well-crafted response for most part, and although he calls it as it is ("Gmail's web interface had a widespread outage..."), when your web interface is the primary or only interface used by most customers to access your service, for customers the service is down.

With each outage, I've reminded myself that in spite of the best efforts of system and service architects to build as much high availability and quick recovery mechanisms in place, outages do occur— just as they do in your on-premise systems and services, and that there's nothing to be alarmed about unless it forms a pattern.

The latest outage, and the reported reason for it— a capacity miscalculation according to News.com's Tom Krazit, makes me a little uncomfortable. I use Gmail (there, I said it... just as I use Hotmail/Live, Yahoo!, several flavors of Exchange Server, and other POP/IMAP-based messaging systems), but using it as the primary email system for business, even for free, would be a difficult decision.

As organizations consider the move to the cloud, high availability is one of the many factors that must be carefully considered, and the potential of widespread loss of productivity must be factored in when calculating the cost savings. Additionally, if you rely on a cloud-based e-mail service, an outage like this also brings to a standstill the frantic e-mail and collaboration activity that goes on inside an organization that's dependent on e-mail. What adds to further loss of productivity is the fact that most users using a web-based e-mail service do not have a local copy of their data. Gmail, and other web-based e-mail providers do provide access to e-mail accounts using POP/IMAP e-mail clients, which allows you to download messages to your computer. But when was the last time you used a POP/IMAP client to access your web-based e-mail service?

To make the situation worse, if you depend on the same cloud-based service for your productivity apps such as word processing, spreadsheets, etc., you may as well have taken the day (or at least the few hours) off. Nik Curilovic reports in Gmail Now Really Down - Can I Get My Email Back Please (Update: Its Back) on TechCrunch.com:
I use Apps For Domain for everything - my contacts, my email, my todo list, my chat, my documents and more recently, my phone. As soon as it went down, I noticed in less than a second. I am now completely stuck, after a few months of being impressed by how I was able to run my entire life on Google.
Gmail is covered by the Google Apps SLA, which promises an uptime of 99.9%. Going by the proverbial "Nines of High Availability" calculation you've no doubt heard many times over in high availability presentations, three nines (or 99.9% up time) allows approximately 8.76 hours of unplanned down-time in a year. Yesterday's outage consumed more than one third of that.

Gmail's Site Reliability Manager, Acacio Cruz, says in a Current Gmail Outage post on the official Google blog:
Obviously we’re never happy when outages occur, but we would like to stress that this is an unusual occurrence.
PC World's JR Raphael notes:
While "frequent" would probably be an exaggeration when it comes to describing Gmail outages, "unusual" might be missing the mark by a hair, too.
Raphael chronicles Gmail outages in Gmail Outage Marks Sixth Downtime in Eight Months.

The total downtime, as stated by Raphael in the above article, is approximately 71 hours or more! The three longest outages lasted 30 hours, 24 hours, and 15 hours respectively. Yesterday's outage was world-wide.

To Gmail's defense, with the highly distributed nature of web-based services that have a global reach and are likely spread over many data centers in different parts of the world, a single user wouldn't have been affected by all outages. But it's an alarming number nevertheless. If you were affected by all the outages, it would translate to less than 99% availability (99.97% availability allows you little over one day of downtime), a figure most organizations wouldn't be comfortable with. On the flip side, Google would've rewarded you with 7 days of free service, "at no charge to customer" according to the Google Apps SLA— if you notify Google within 30 days from the time you become eligible for the credit.

If your organization require a higher SLA for its messaging system, and you've deployed high availability configurations to achieve higher uptime, this cloud clearly isn't for you.

How does Gmail's SLA and uptime stack up against your organization's internal SLA for e-mail? Will your users be satisfied with Gmail's report card?

Labels: ,

Tuesday, July 01, 2008

New whitepapers have been released today on TechNet.

Whitepaper: Continuous Replication Deep Dive
- written by Ross Smith IV and Scott Schnoll

This whitepaper discusses the different components of Continuous Replication— used by LCR, CCR and SCR, how replication works, backups and log file truncation, what happens during scheduled and unscheduled outages, and how Continuous Replication compares with other replication solutions.

The whitepaper is available here.

Whitepaper: Planning for Large Mailboxes with Exchange Server 2007
- written by Tom Di Nardo

This whitepaper discusses planning and operational issues faced when dealing with large mailboxes, including planning storage, long database backup and online/offline maintenance times.

The whitepaper is available here.

Labels: , , ,

Monday, March 17, 2008

Standby Continuous Replication (SCR) is a new High Availability feature in Exchange Server 2007 SP1. It uses Continuous Replication (also used by LCR and CCR) to replicate Storage Groups from a clustered or non-clustered mailbox server, known as a SCR source, to a clustered or non-clustered mailbox server, known as a SCR target.

SCR is managed using the Exchange shell - no management features exist in the EMC to configure or manage it.

Unlike LCR and CCR, which are designed to have a single copy of a Storage Group (consisting of an Exchange Store EDB + transaction logs & system files), SCR is designed to have many-to-one and one-to-many "replication relationships". (A SCR relationship or partnership - not formally defined terms, but simply used to explain the concept here - is SCR replication of a particular Storage Group from a SCR source server to a particular SCR target server).

A Storage Group from one SCR source can be replicated to multiple SCR target servers, and Storage Groups from one or more SCR source mailbox servers can be replicated to a single SCR target mailbox server.

By default, the Replication Service delays replaying 50 transaction logs to the SCR replica Database. Additionally, you can configure the following parameters to control how SCR replicas behave:
ReplayLagTime: specifies how long the Replication Service waits before replaying replicated transaction logs to the replica Database (EDB) on the target. Default:1 day
TruncationLagTime sets a lag time for truncating log files on that replica. Provided the other requirements are met for log file truncation on the SCR replica, log files are not truncated till ReplayLagTime + TruncationLagTime has elapsed. Default:0.

Why do I need the delay?

Replay lag gives you the protection of having a copy of your database from back in time. This back-in-time copy can be used to recover from logical corruption, pilot errors etc.

Additionally, if there is no delay, in the case of a lossy failover of the SCR source to a LCR or CCR replica, the (new source) Database will be behind its SCR target(s), requiring reseeding. Not something one would want to do for large Databases over WAN links (or even locally within the same datacenter). Delaying the last 50 transaction logs from being replayed to the SCR target avoids the need to reseed.

However, a large number of transaction logs not replayed to the Database means increased storage requirements for the SCR target, and also an increase in the time it takes to activate it in case of failure of the SCR source. Before it can be brought online, all the logs will need to be replayed.

To avoid this, you can set the ReplayLagTime to 0 (from the default of 1 day). Note, the replay will still lag behind by 50 transaction logs - a hard-coded limit enforced by SCR that cannot be changed. The TruncationLagTime can be set higher, so logs are replayed but not truncated. You can then take VSS snapshots of the target for the point-in-time copies.

Once setup using the Enable-StorageGroupCopy command, the ReplayLagTime and TruncationLagTime cannot be changed without disabling and re-enabling that SCR relationship for the Storage Group.

How can I see ReplayLagTime and TruncationLagTime? The following command shows the SCR targets a Storage Group is being replicated to:

Get-StorageGroup "SG Name" | fl

However, neither the above command, nor Get-StorageGroupCopyStatus show the lag times.

The parameters are returned as an array when you use the former (Get-StorageGroup) - only the name of the SCR target is displayed in the StandbyMachine property.

To see the lag times:

$sg = Get-StorageGroup "MyServer\MyStorageGroupName"

Here's what it looks like:

Figure 1: Displaying the Replay and Truncation lag time

Can I change ReplayLagTime and TruncationLagTime without reseeding the Database? You need to disable replication and re-enable it to add or modify the lag times. :

Disable-StorageGroupCopy "Storage Group Name" -StandbyMachine "SCR Target Server"

When disabling SCR, you get prompted to delete all files in the replica folder on the SCR target. Skip that. Reseeding is not required if you do not delete the files:

WARNING: Storage group "DFMAILMAN.e12labs.com\dfmailman-sg1" has standby continuous replication (SCR) disabled. Manually delete all SCR target files from "C:\Exchange Server\Mailbox\First Storage Group" and "C:\Exchange Server\Mailbox\First Storage Group\Mailbox Database.edb" on server "mirror".

Now, let's enable SCR with the replay and truncation lag times:

Enable-StorageGroupCopy "Storage Group Name" -StandbyMachine "SCR Target Server" -ReplayLagTime 1.00:00:00 -TruncationLagTime 2.00:00:00

Once replication is enabled again, make sure to test replication status using:

Get-StorageGroupCopyStatus "SG Name" -StandbyMachine "SCR Target Server"

Labels: , , ,

Saturday, August 18, 2007

Exchange Server 2007's Cluster Continuous Replication (CCR) clusters are not dependent on shared storage (when used with MNS quorum and a File Share Witness in Windows Server 2003). There are no protocol virtual server resources like SMTP, POP3, IMAP4, etc. — Exchange Server 2007's Clustered Mailbox Server (CMS) role is designed to be a Mailbox server. It cannot co-exist with any other Exchange 2007 server role.

This results in a greatly simplified cluster resource dependency model.

Note, Public Folder database support is available, with some limitations.

By default, the resources set up in a <CCR cluster:
  • 1. IP Address
  • 2. Network Name (depends on: IP Address resource)
  • 3. Microsoft Exchange System Attendant (depends on: Network Name resource)
  • 4. Microsoft Exchange Information Store (depends on: Network Name resource)
  • 5. Microsoft Exchange Database Instance (depends on: Microsoft Exchange Information Store resource)
Notice, there are no shared disk resources in the above list!

Screenshot: Exchange Server 2007 resources in a CCR cluster
Figure 1: Exchange Server 2007 resources in a CCR Cluster

This makes adding a new Storage Group to a CCR cluster easier. Generally you use a new set of volumes for the new Storage Group's transaction logs and mailbox Store, but there are no additional disk resources to add to the CMS in Cluster Administrator.

To add a new Storage Group:
1. In the Exchange console, create a new Storage Group
2. Add a new Mailbox Database to the Storage Group

That's it! Fire up cluster administrator and you'll see a new Microsoft Exchange Database Instance resource created for the new Storage Group, with the right dependencies.
Use the following command in Exchange shell to verify the new Storage Group is being replicated to the passive node.

Get-StorageGroupCopyStatus "Second Storage Group" | Select SummaryCopyStatus,CCRTargetNode

Labels: , ,

Thursday, June 21, 2007


Some more details on SCR

Posted by Bharat Suneja at 11:24 AM
Besides Terry Myerson's post on the team blog providing some details about SP1 (read previous post "Exchange Server 2007 SP1: A bag of goodies!"), there weren't a lot of details about SCR available publicly until TechEd 2007 in Orlando earlier this month.

Looking at the search engine keywords used to reach this blog, and the number of questions on the Exchange chat on TechNet today (transcript will be posted on the TechNet site soon), I'm not surprised to see interest in Standby Continuous Replication (SCR) is quite high.

The Cluster Continuous Replication (CCR) functionality present in the shipping (RTM) version of Exchange Server 2007 is a great choice for high availability within a data center, and can be implemented across data centers as well, albeit with some limitations and considerations like placement of the File Share Witness [read previous post "CCR Over WAN: Failover and FSW questions answered"], extending the subnet across data centers, etc. The latter is a limitation of high-availability clustering as it's implemented in Windows Server 2003, and not an inherent limitation of Exchange or its Database Continuous Replication component that makes capabilities like LCR, CCR and SCR possible. It should be removed with the release of Windows Server 2008, aka "Longhorn Server".

However, SCR is what really provides the needed capability to survive a data center outage.

Paul Robichaux has some more details in Windows IT Pro's Exchange and Outlook UPDATE newsletter today, titled "SCR in the Spotlight" (the link should be available without subscription for short time).

Labels: ,

Wednesday, May 16, 2007


Cluster Continuous Replication and Public Folders

Posted by Bharat Suneja at 12:17 PM
In previous versions of Exchange Server, Exchange Virtual Servers (EVSes) are not very different from standalone servers. Besides mailboxes, they can host protocol virtual servers (SMTP, IMAP4, POP3, HTTP/OWA), Public Folders, etc.

Exchange Server 2007's clustering model is simplified further to provide high availability for mailboxes. There is no protocol support - SMTP is the domain of Hub Transport servers, IMAP4, POP3 and HTTP (OWA, Outlook Anywhere or RPC over HTTP, Exchange ActiveSync) are the responsibility of Client Access Server role. Unlike standalone/non-clustered Exchange Server 2007 servers, Clustered Mailbox Servers (CMS - the Exchange 2007 term for EVS) do not co-exist with any other server role.

Clustered Mailbox Servers can host Public Folders, but there are some caveats. The Public Folder Store hosted by the CMS should be the only Public Folder Store in the Organization. If you have Public Folder Stores on other Exchange servers in the Organization, the Public Folder Store on a Clustered Mailbox Server will fail to mount in the case of an unscheduled failover, until the original server and all transaction logs for the Storage Group hosting the Public Folder Store are available.

This is documented in "Planning for Cluster Continuous Replication" in Exchange Server 2007 documentation.

Public Folders have their own high-availability mechanism built-in, and it's been around for a long time. It's Public Folder replication. Clustered Mailbox Servers (using Cluster Continuous Replication) are not good candidates for replication.

Labels: ,

Tuesday, May 08, 2007

To determine whether a mailbox server is clustered or standalone, and if clustered - whether it's using Cluster Continuous Replication (CCR) or Single Copy Cluster (SCC), use the following command:

Get-MailboxServer | select name,ClusteredStorageType

The possible values:
1. NonShared = CCR cluster
2. Shared = SCC cluster
3. Disabled = standalone / non-clustered mailbox server

Labels: , , , ,

Thursday, April 26, 2007


CCR Over WAN: Failover and FSW questions answered

Posted by Bharat Suneja at 8:28 AM
Exchange Server 2007's Cluster Continuous Replication (CCR) feature provides a way to set-up geographically-dispersed clusters to protect against data center failure (aka "site failure"). Though the documentation provides plenty of detail on how to set up CCR clusters in a single data center - where both cluster nodes and the computer hosting the File Share Witness are in the same data center - the documentation on how to set this up across data centers has been skimpy, or even non-existent.

Matt Richoux' post on the Exchange team blog provides more detail on such topologies, placement of the File Share Witness, failover scenarios, and related issues. Read the post, titled "Placement of the File Share Witness (FSW) on a Geographically Dispersed CCR Cluster".

In a nutshell: 1) To facilitate CCR nodes across data centers, a CNAME record should be used to configure the FSW 2) Failover to CCR node in remote data center is not automatic (however, as Matt points out, the FSW can be placed in a third data center to achieve automatic failovers) 3) Be aware of the split brain syndrome that may occur if the first/primary data center comes back up with the (formerly) active node and File Share Witness set to start up automatically.

A frequent question - is CCR is a good solution for geo-dispersed clusters, particularly in context of the manual steps required to failover? It's too early to say, given that Service Pack 1 is bringing us Standby Continuous Replication (SCR) - which is designed to work across data centers. However, in a lot of cases, automatic failover between data centers - generally located on the other end of a WAN link - is not desirable. You need an administrator to make the judgment whether an entire data center or site has failed, and a failover to another data center should be performed.

As Matt notes in the post, you can easily script the steps outlined in the post.

Another limitation of such deployments, perhaps till SCR arrives in SP1, is the fact that CCR clusters are limited to 2 nodes. It's not possible to have 2 nodes in the primary data center, and replicate to a 3rd node in a remote data center. This would provide the ability to fail over locally first, in case of a single node failure, and fail over to the remote data center in case of a data center failure.

Exchange Server 2007's Database Continuous Replication features provide answers to some of the most frequently asked questions by users in different forums - can I replicate an Exchange server or the Stores to another server (CCR does this), to another location (CCR and SCR), or to another disk (volume) on the same server (LCR). These features are some of the more important reasons to consider upgrading.

Labels: , ,

Sunday, October 01, 2006


HOW TO: Change IP addresses on a cluster

Posted by Bharat Suneja at 10:48 AM
A "Re-IP-ing" project around the corner where you need to change IP addresses on all hosts in a subnet? Or just a cluster? When you change the IP addresses on cluster nodes, cluster resources go offline.

With the IP addresses on nodes changed (while moving to different network/subnets), when you start Cluster Admin you can't connect to the cluster. The IP address resource in your Cluster Group still has the old IP, which means your cluster nodes and the cluster itself are on different networks!

To change the IP address on the cluster, start Cluster Administrator and connect to the node name or . (dot). Go to IP address resource | properties | Parameters and change the IP address (and subnet, if required). Make sure the correct network is selected from the Network drop-down. (In most such situations, it'll be the Public network that's impacted).

Now you can bring the Cluster Group online.

If you have any other resource groups (like an Exchange Virtual Server resource group for Exchange clusters), you may need to repeat the process for those groups.


Monday, October 31, 2005

Creating an additional SMTP Virtual Server in a non-clustered environment is pretty straightforward. The same task is a little more complicated in a clustered environment.

Before you begin:
Need to consider whether you want the new SMTP VS to listen on the same IP address but on a different port or listen on the default SMTP port on a new IP address. Most likely you'll end up choosing the later, but the former choice may be perfectly valid for certain requirements like setting up an authenticated SMTP VS (not referred to in a MX record).

Click here to see the Flash movie

Creating a new IP Address Resource
(you can skip this step if you're going to use the same IP address as the Default SMTP VS, but with a different port)

If you decide to create a new IP address, do that in Cluster Administrator.
1. In Cluster Administrator, select the group your Exchange Virtual Server (EVS) resides in (we'll call it "Exchange Group").
2. Right-click the group New Resources
3. In the new resource dialog box type in the Name (optional: type in description), select IP Address as the Resource Type
4. In the Dependencies dialog box, do not select any resource as a dependency.
5. In TCP/IP Address Parameters type the IP address, subnet mask, and the network on which you want to create this resource on (the public network where users can access it, let's call it "Public" here)
6. Click Finish. Do not bring the resource online yet.

To make this IP address visible in ESM when creating an additional SMTP VS, you will need to make the Network Name resource for your EVS dependent on this IP Address. If you do not do this, it will not show up as a choice and your new SMTP VS will be forced to bind to the EVS IP address which has the default SMTP VS bound as well, creating a conflict.

Note: The following step will take your Exchange Virtual Server offline. Users will not be able to access the EVS till you bring it back online.
7. Right click on the Exchange Group Take Offline.
8. Locate the Network Name resource for the EVS Properties Dependencies tab click Modify
9. In Modify Dependencies, select the IP Address 2 and click the right arrow (-->) to add it as a dependency for the Network Name
10. Click OK to exit.
11. Right-click the Exchange group Bring Online.

Now we're ready to create the SMTP Virtual Server.
1. Open Exchange System Manager console and locate your server (EVS)
2. Expand Protocols SMTP
3. Right-click SMTP New SMTP Virtual Server
4. In the New SMTP Virtual Server wizard, type in a name for the new VS click next
5. Select the new IP address you created earlier (should be selected by default)
6. Click Finish. The new SMTP VS instance is created.
Note the SMTP VS icon looks slightly different than the running Default SMTP VS.

We're not ready to start this SMTP VS yet. The next step is to go back to Cluster Administrator and create a resource for it.

1. In Cluster Administrator, go to your Exchange group, right-click group New Resource
2. Type in a name for the new resource - let's say "SMTP VS2 Instance"
3. Select "Microsoft Exchange SMTP Server Instance" as the Resource Type
4. In the Possible Owners dialog box click Next (all nodes that are possible owners for the EVS are selected by default)
5. In the Dependencies dialog box select Microsoft Exchange System Attendant click Next
6. In the Virtual Server Instance dialog box select the new SMTP VS (selected by default) click Finish.

The new SMTP VS is now created. Bring it online by right-clicking on the new instance in Cluster Administrator select resource right click Bring Online

You can now test the new SMTP VS.
1. Go to Exchange Sytem Manager and expand Protocols SMTP (hit refresh if it's already expanded). You will see the icon change. The SMTP VS has started.
2. From command prompt, telnet to the new SMTP VS on port 25.

Labels: ,

Tuesday, October 25, 2005


Problem creating EVS (System Attendant Resource)

Posted by Bharat Suneja at 5:00 PM
When creating the System Attendant resource for a new EVS, you get the following error:

Invalid network address
win 32 Error code C00706ab

Creation of SA resource fails.

Network card binding order.

1. Go to Network Connections folder (command line: ncpa.cpl)
2. Go to Advanced menu Advanced.
3. In Adapters and Bindings tab, under Connections make sure the public interface for the node appears first in the list. Use the up/down arrows on the right to bring the interface to the top of the binding order and click OK to get out of the dialog box. [see screenshot]

Figure 1: The Public network interface should be the first one in the network binding order

Create the System Attednant resource.


Monday, July 11, 2005


Exchange 12 to axe Active/Active Clustering?

Posted by Bharat Suneja at 3:05 PM
Oliver Rist's Enterprise Windows column in InfoWorld talks about many facts/announcements related to the Exchange Server product roadmap - most of it discussed/announced at TechEd in Orlando and at other venues.

He also talks about Active/Active clustering being history with Exchange 12. I'm a little surprised - I thought Microsoft would perhaps work towards making Active/Active Clustering a more elegant High Availability solution - for those who do want to go that route despite all its pitfalls.

Read the complete column on InfoWorld.com: http://www.infoworld.com/article/05/07/07/28OPenterwin_1.html

Replicate the Store: One of the most exciting things for Exchange admins would perhaps be log shipping or synchronous replication of the Exchange store. That's the biggest single point of failure in clustered Exchange environments. Some NAS/SAN vendors have Exchange solutions in place that do replicate or clone the store using different methods, but it'd be nice to have Exchange (and Microsoft) support it natively without any third-party tools.


Saturday, February 05, 2005

The ClusApi.h file, which is part of the PlatformSDK should contain this...

ClusterResourceStateUnknown = -1,
ClusterResourceInherited = 0,
ClusterResourceInitializing = 1,
ClusterResourceOnline = 2,
ClusterResourceOffline = 3,
ClusterResourceFailed = 4,
ClusterResourcePending = 128,
ClusterResourceOnlinePending = 129,
ClusterResourceOfflinePending = 130

Alain is (now) WMI Program Manager at Microsoft and author of :
1. Understanding Windows Management Instrumentation (WMI) Scripting and
2. Leveraging Windows Management Instrumentation (WMI) Scripting - 2 of the best references on WMI. He also writes for Windows Scripting Solutions newsletter (published by Penton Media, the publishers of Windows IT Pro & SQL Server magazines).

Labels: , ,

Friday, January 28, 2005


Scripting: ExchangeClusterResource class

Posted by Bharat Suneja at 1:40 PM
The WMI ExchangeClusterResource class has 5 properties:
1) Name: returns name of Exchange cluster resource
2) Type: specifies the cluster resource type (IP Address, network name, etc.)
3) Owner: specifies the cluster node that the resource is running on (changes with failover)
4) VirtualMachine: returns name of the Virtual Machine that owns the resource
5) State: Shows the current state of resource, returns a numerical value.

The value I get for all online resources is 2 - but I haven't been able to find any cross-reference that translates the numbers into something, except for the Windows Clustering section of the Platform SDK that mentions the following values (not-numerical) for the GetClusterResourceState function.

Return Code Description
ClusterResourceInitializingThe resource is performing initialization.
ClusterResourceOnlineThe resource is operational and functioning normally.
ClusterResourceOfflineThe resource is not operational.
ClusterResourceFailedThe resource has failed.
ClusterResourcePendingThe resource is in the process of coming online or going offline.
ClusterResourceOnlinePendingThe resource is in the process of coming online.
ClusterResourceOfflinePendingThe resource is in the process of going offline.
ClusterResourceStateUnknownThe operation was not successful. For more information about the error, call the Win32 function GetLastError.

Assuming these values map serially to the numerical values returned by the WMI class, 2 would mean the resource is operational and functioning normally.

However, when moving the MSDTC resource, the numerical value returned while the resources were moving was 129. That blows holes in the theory.

Labels: , ,

Tuesday, August 31, 2004


Running 3-4 node cluster on iSCSI

Posted by Bharat Suneja at 6:34 PM
Clusters with more than 2 nodes using iSCSI storage are not officially supported by Microsoft or NetApp. Yet. Probably being tested in labs.

Added 3rd node to my cluster. Failed over Cluster, MSDTC and Exchange groups successfully!

Wolfpack (that's what the cluster is called..) is now truly a pack of wolves! (The nodes are called Wolf1, Wolf2, Wolf3.... :)

Labels: ,

Friday, July 09, 2004

You've just built a shiny new Windows Server 2003 cluster, installed Exchange Server 2003, created an Exchange Virtual Server (EVS) and tested MAPI, HTTP, Cluster Failover, et al - things look great!

User calls, can't access mail on new EVS using IMAP4.

When you create the Exchange Virtual Server by creating the System Attendant resource in Cluster Admin, the IMAP4 and POP3 servers are not created automatically (unlike Exchange 2000) because "Exchange Server 2003 is designed to comply with the Microsoft Trustworthy Computing initiative" according to KB818480 - which is all good.

And here's when you have a fleeting moment of monumental stupidity - YOU DON'T READ THE KNOWLEDGEBASE ARTICLE COMPLETELY! (contrary to public opinion, many KB articles are in fact quite accurate, provide the complete solution, and help you avoid making blunders).

And here's what happens when you don't do that... :

You go ahead and try to create the IMAP4 virtual server resource in Cluster Admin. IMAP4 virtual server created successfully. Wonderful! All that's left is right-click, bring resource online.

And DISASTER! (Don't try this on a production box.. :)

This brings down your entire EVS. Goes offline. You hope it's once, to initialize the IMAP4 virtual server perhaps. Comes back online. Great! Goest back offline. Oh noooo... ! Flip flops between offline and online... mom, look what I did to the Exchange cluster! :)

Bottomline, the flip-flopping goes on. You can't delete the IMAP4 VS if it's in transition (Offline pending, Online). At the right time - just when it goes online and before any other resource goes offline, you hit delete and the resource is deleted. And investigate.

What's wrong? The IMAP4 service is set to DISABLED by default. The Application Event Log will show you 3 errors:

Event ID: 1009 | Source: MsExchangeCluster | Description: IMAP Virtual Server (MAILBOX): Failed to start the service 'IMAP4SVC' because it has been disabled. Check the Services manager to change its startup type.
Event ID: 1010 | Source: MSExchangeCluster | Description: IMAP Virtual Server (MAILBOX): Failed to start the service 'IMAP4SVC'.
Event ID: 1003 | Source: MSExchangeCluster | Description: IMAP Virtual Server (MAILBOX): Failed to bring the resource online.

Resolution: Set the IMAP4 service on all nodes to MANUAL (not AUTOMATIC).

Now bring back the Exchange Virtual Server online. Comes online. Stable.

Test IMAP4 access - telnet to port 143. It works!

Labels: ,

Thursday, June 24, 2004


MSDTC Resource and Confusing KB 301600

Posted by Bharat Suneja at 11:58 AM
Microsoft KB 301600 talks about how to install the Microsoft Distributed Transaction Coordinator (MSDTC) resource on a Windows Server 2003 cluster.

STEP 8: In Dependencies, press and hold the CTRL key on the keyboard, select both the Physical Disk and Network Name that you created in step 2, and then click the Add button

But STEP 2 has nothing to do with creating a Network Name and Physical Disk resource!! Instead, it talks about the procedure for starting Cluster Administrator!!! :
2. Start Cluster Administrator. To do so:
a. Click Start, and then point to All Programs
b. In Administrative Tools, click Cluster Administrator.
What needs to be done:
Assuming you've already created the Cluster Group resources (IP, Network Name and Physical Disk for quorum), you already know how to create new Groups and Resources...
1. Create a separate cluster group for MSDTC - let's call it "MSDTC Group"
2. Create an IP Address resource for MSDTC, assign a unique static IP address, no dependencies
3. Create a Network Name resouce for MSDTC - let's call it "MSDTC". Choose the IP Address as dependency
4. Make sure MSDTC is enabled for network (KB817064) by going to Add/Remove Programs | Windows Components | Application Server | check Enable Network DTC Access

Enabling network access for MSDTC

At the time of writing, KBA 301600 included a reference to KBA 817064 to enable network access for MSDTC. However, it's a security best practice to not enable network access, as noted in updated Exchange Server 2003 doc: How to Install the Microsoft Distributed Transaction Coordinator in a Windows Server 2003 Server Cluster.

5. Create a new resource of type: Distributed Transaction Coordinator, select Network Name & Physical Disk as dependencies
6. When done, right-click on the MSDTC Group and select Bring Online

You can install Exchange Server 2003 Ent only after configuring the MSDTC resource.

(Note: Since this item was posted, KB 301600 has undergone many revisions & editions. As of Oct 1, 2007, the KBA is on version 23.0. - Bharat)