This movement of data is subject to a large number of options configured at the publisher, distributor and subscriber, but for all the complexity it is surprisingly simple to set up, providing the DBA has a solid understanding of the underlying principles. Replication can be thought of in terms of the newspaper model - the publisher produces the newspapers and sends them to the distributor, who separates and sends them on to the subscribers e.
Data in the replication model moves in a similar way, with some context-specific differences. There are three main forms of replication; snapshot replication, which can be thought of as the best fit to the newspaper model - the whole newspaper is propagated each time.
Merge replication incorporates changes at both the publisher and the subscriber; and transactional replication is an incremental flow of data from the publisher to the subscriber much like a streaming news service, in this context. For further discussion on the intricacies of the different replication models, please consult Books Online.
However replication has many points of failure. This article will address two common causes of replication failure - when information in subscribers is not synchronized with information in the publication base tables and will show the reader where to look to find information on replication-specific issues.
It will cover latency, stalled agents, failed jobs, replication-related tables, 'gotchas' and along the way will provide further links to replication as a data-propagation technology. The focus of this article is transactional publication with pull subscriptions, so some information may vary depending on the model and topology you have chosen. Some familiarity with replication as a feature of SQL Server is assumed. An architecture like this has several benefits, not least of which is reducing locking and blocking on the principal database and allowing sub-optimal code i.
Below is a very basic architectural example where replication could be used. There are two types of replication subscriptions - push subscriptions, and pull subscriptions. Push subscriptions are initiated from the publisher - that is to say, the log reader agent on the publisher scan the transaction log of the database with publications containing articles for replication, determines which transactions are to be replicated log reader agent then the distribution agent sends the transactions to the distributor, which forwards those transactions onto the subscription database, where they are applied.
The subscriber does not actively query either the distribution database or the publisher; instead, it waits for inbound transactions. Pull subscriptions are somewhat different. The subscribers, through the distribution agent, will periodically question the distribution database for unapplied transactions, gather the transaction and apply it to the subscription database. The distribution database will then be updated to show those transactions have been applied.
The subscriber will not contact the publisher directly. This article is not intended to give an in-depth discussion of replication - there is plenty of literature on this subject and some extensive articles in Books Online , which you can find in the links below.
Instead, this article is intended to demonstrate some of the pitfalls of replication not, in the author's humble opinion, the most stable feature of the SQL Server platform , identifying typical symptoms for problem diagnosis, and suggesting solutions for some replication-related problems. For the remainder of this article, the context will be transactional replication, since arguably this is the more common kind of replication model found in the wild. Latency in this context simply refers to the delay between publisher and subscriber via distributor and taken in isolation with no other symptoms can be perfectly normal.
Latency can be caused by various factors, including geographical distance between nodes; network traffic or bandwidth constraints; transactional load on the publisher; and system resources. Poor latency can be planned for and incorporated into an architectural plan - normally the acceptable latency will be a function of business requirements, for example in the model above, the natural latency between the UK and Japan may be 3 minutes. The Japanese management team may require data that is a maximum of 1 hour old.
This will inform the architectural decision to allow for a variable latency of 20 minutes to include contingency of an additional 17 minutes. Therefore the wise DBA can configure alerting that informs him or her of latency problems when latency exceeds this period as it is likely that latency exceeding this threshold is indicative of a more serious problem.
Replication configuration allows the DBA to configure latency maximums, such that when the maximum is reached or exceeded, a new snapshot of each publication is required to be pushed to each subscriber to allow replication to continue.
Currently, Microsoft provides only one tool to monitor latency, and that is the ubiquitous 'Replication Monitor'. Opinions amongst DBAs on the use of this tool vary; unfortunately, while it has its uses, it is generally considered to be inaccurate, useful as a guideline only; and, more seriously, can occasionally affect the state of replication due to its intrusive methods of detection causing resource conflicts.
Various examples of this have been documented by Microsoft Support and commented on in other online forums. However, as a tool it does have some uses, including drilldown dialogs that enable the DBA to see the number of unapplied transactions and estimated time until synchronization ; an 'at-a-glance' view of latency across all configured publications; last start times and statuses of various agents and jobs; and an organized view of the publisher, distributor and subscribers in one menu.
Drill down using the menu on the left and the various tabs and options on the right to view the information you need. It is quite possible to measure latency yourself using T-SQL and a knowledge of tokenization.
By inserting 'tokens' into the replication process think putting a floating log into a fast-flowing river and measuring the time taken to flow between two predefined points , the DBA can work out the delay between one part of the replication process and another.
Automatically measuring these statistics allows the DBA to create procedures that will monitor the latency and alert automatically when problems are detected. Tracer tokens are both inserted and measured at the publisher, and the procedure is straightforward, using just 4 different stored procedures: Either the latency is due to factors beyond your control i.
Stalled Agents and Jobs, and Finding Further Diagnostic Information Often indicated by increased latency, it is not unusual for the agent processes governing replication to stall, and this can happen for different reasons.
For example, one cause in this author's recent experience was a temporarily-high network load which caused a timeout to occur when the replication account attempted to connect to the distributor from the publisher. This caused the distribution agent at the distributor to stall, and consequently the subscription agents relying on a particular publication also stalled since we were using a pull subscription model.
This non-movement of data increased latency for the affected publications. Consequent successful attempts by the publisher to connect to the distributor were successful likewise subscriber to distributor but these were 'queued' behind an erroneous and entirely misleading error message that the 'initial snapshot is not yet available'.
This particular situation was fixed by restarting the subscription agents and restarting the distribution agent in particular the clean-up job. This caused a temporarily high load as replication caught up with the queued transactions and soon replication was back in synchronization. Post-event root cause analysis highlighted how the replication setup had fallen like a set of dominoes at the slightest hint of instability at one end of the process.
Depending on the situation you face, you may wish to restart the agents, or reinitialize the publication. Reinitialization by using either an existing or new snapshot may be appropriate when latency indications show a long synchronization time or when you have suspicions that an agent may be incorrectly configured i. It is important to note that preparation of a snapshot will normally cause exclusive table locks to be taken out on the tables in the defined publications, which can cause connection timeouts or long query execution times propagated through the application layer as timeout errors and manifesting as increased helpdesk calls which may impact the business.
Where possible, an existing snapshot should be used dependent on age or reinitialization should take place during an out-of-hours or 'slow' period. You can find the history of the replication jobs by drilling down through SQL Server Agent and examining Job History particularly on the distributor. By looking at what replication jobs have started and indeed which are stalled you can identify which agents to restart or which publications to reinitialize.
You can view error messages relating to stalled or errored replication processes by checking the replication-specific tables on the distributor.
By joining on the columns of these tables, you can easily find information on errors relating to e. A database snapshot for purposes of replication is a complete view of the schema and contents of all articles for a particular publication.
Each publication has it's own snapshot. So, for example, here is the structure of a fictional publication. Database snapshots, on initialization, are created as sparse files containing the database pages that contain the articles defined for the publication. As transactions are completed on the base tables, the snapshots are updated to reflect the new information and the size of the snapshot grows. So in this context, re-snapshotting simply refers to the process of re-collecting all the data into an initial snapshot.
As warned earlier, this process can be resource-intensive and it is recommended this is completed during a quiet period the author speaks from experience! Initializing a publication is quite different. The initialization process will effectively rewrite the articles at the subscribers, overwriting the contents with the snapshot contents. Again, right-click on the subscription and hit 'Reinitialize'.
Some Notes Some miscellaneous points and summaries which may help you on the road to replication diagnosis: Become familiar with replication-related tables - those in Distribution, and those in MSDB on the publisher. In push subscriptions, the distribution agent is on the publisher. In pull subscriptions, it is on the subscriber.
Snapshotting data has an adverse impact on data availability of the base tables. Replication is immediate continuous data propagation, where log-shipping is replication with differences - the 'burst' model of data transmission, the lack of a distributor, and the ability to insert artificial delays e. Choose the best strategy for your needs. Next Steps I hope this article has been helpful and will encourage you to explore replication in greater detail. If you find inaccuracies or have any comments or questions, please leave a comment below and I'll get back to you as soon as possible.