Why Alerting Tools Often Fail Administrators

post date Tweet This
post comments No Comments

Got smoke?

I have smoke detectors in my house.

They are really good at telling me when there is smoke, or a fire, providing I keep fresh batteries in them of course.

They are really lousy at preventing smoke, or fires, from happening in the first place.

I was a production DBA for seven years with a large financial services firm. I used to think what I wanted was a smoke detector. So I built my own and eventually I bought as many as possible. What I eventually found out was that all the smoke detectors in the world were not helping me prevent the fires from happening.

I didn’t want to be on alert 24×7. Nobody wants that. But that is what I had with all of those smoke detectors. Bells and whistles all day long. I could never focus on fixing the problem, I was constantly being asked to react to the symptoms.

What I wanted was a tool that could help me prevent problems before they would happen. I needed a way to find the root cause of performance problems.

What I wanted (and eventually found) was Ignite8 from Confio.

I became a customer and liked the tool so much I came to work for the company.

If you are an administrator that is responsible for maintaining database performance and are thinking that a tool built around the same principles of a smoke detector is going to help you be more efficient in your job then I am here to tell you this:

You’re doing it wrong.

You want a tool that helps you find the root cause. You want a tool that helps you prevent problems. You want a tool that helps you identify performance bottlenecks in “four clicks or less”.

Well, you want those things if you want to sleep at night.

Otherwise go and buy yourself a smoke detector. When you are ready for a tool that can make a difference let us know. We’ll be here to help.

Collect and Alert on Deadlocks for MS SQL Server With Ignite

post date Tweet This
post comments No Comments

Ignite can collect a count of the number of deadlocks to be displayed as a resource graph. For Ignite version prior to 8.2, you can update two files in order to start collecting the number of deadlocks. The code below takes advantage of the system_health extended event session that is enabled by default when SQL is started and should work with SQL 2008 SP1 and later versions.

First, inside the <install dir> \Ignite PI\iwc\tomcat\ignite_config\idc\metrics\custom-metrics-queries.xml file add this entry:

<metricsQuery className="com.confio.idc.database.metrics.domain.queries.DeltaCalculationMetricsQuery">
 <sql>
 <![CDATA[
 select COUNT(XEventData.XEvent.value
('(data/value)[1]', 'varchar(max)')) AS cnt
 FROM
 (select CAST(target_data as xml) as TargetData
 from sys.dm_xe_session_targets st
 join sys.dm_xe_sessions s on s.address = st.event_session_address
 where name = 'system_health') AS Data
 CROSS APPLY TargetData.nodes
('//RingBufferTarget/event') AS XEventData (XEvent)
 where XEventData.XEvent.value
('@name', 'varchar(4000)') = 'xml_deadlock_report'
]]>
 </sql>
 <frequency>60</frequency>
 <queryTimeout>10</queryTimeout>
 <metricsList>
 <metric name="Deadlocks">
 <expression>args[0]</expression>
 </metric>
 </metricsList>
 <supportedDatabaseList>
 <supportedDatabase>
 <databaseType>SQL Server</databaseType>
 <minimumVersion>10.0.0</minimumVersion>
 </supportedDatabase>
 </supportedDatabaseList>
 </metricsQuery>

Then, in the <install dir>\Ignite PI\iwc\tomcat\ignite_config\idc\metrics\custom-metrics-properties_SQLServer.xml file add this entry:

<metricsProperties name="Deadlocks">
<displayName>Number of Deadlocks</displayName>
<description>Number of Deadlocks</description>
<category>Sessions</category>
<units>deadlocks</units>
<heartbeat>60</heartbeat>
<minY></minY>
<maxY></maxY>
</metricsProperties>

After editing those XML files you need to restart the Ignite service. You should then see a graph in the ‘Sessions’ tab on the Resources page:

Now you will be able to see the number of deadlocks over time inside of Ignite. To configure an alert for deadlocks you do the following.

Go to the ‘Alert’ page and create an alert of type ‘Custom SQL Alert – Single Numeric Return’. Name it ‘Deadlock Alert’ and choose an execution interval. Insert whatever notification text you want. An example of a notification would be: “You have a deadlock. If you have trace flag 1204 enabled check your error log for additional details.” [You only need to enable this trace flag if you desire to have a history of your deadlocks for analysis later. You do not need the trace flag if you only want to examine the deadlocks by querying the default trace as the deadlocks occur, keeping in mind that the information will not persist after a system restart.]

You can enable the trace flag globally by running the following command:

DBCC TRACEON (1204, -1)

Ideally you would enable this trace flag to be enabled upon startup. You do that using SQL Server Configuration Manager for the instance. Right click on the instance name and go to ‘Properties’, then the ‘Advanced’ tab:

Click on the drop down for the Startup Parameters option and edit by adding “;-T1204” at the end as follows:

Click OK. You will need to restart the instance for the parameter to take effect. If that is not possible at the moment it is fine as the the DBCC statement above will enable the trace flag immediately.

Back in the Ignite GUI choose the instance(s) that you want to execute this alert against. Then, enter the following code into the SQL Text window:

SELECT count(1)
FROM
(SELECT CAST(target_data as xml) as TargetData
FROM sys.dm_xe_session_targets st
JOIN sys.dm_xe_sessions s on s.address = st.event_session_address
WHERE name = 'system_health') AS Data
CROSS APPLY TargetData.nodes
('RingBufferTarget/event[@name="xml_deadlock_report"]') AS XEventData (XEvent)
WHERE DATEDIFF(mi, CAST(XEventData.XEvent.value
('@timestamp', 'datetime')as datetime), GETUTCDATE()) < #FREQUENCY#

Set the query to execute against the monitored instance. Next, configure your thresholds and notification methods as desired. Test the alert to verify it will run without error. Then click save. You’re done.

If you are not using the trace flag to store the deadlock details for later analysis then you just need to run the code in order to pull the deadlock details back from the default trace. Keep in mind that the details are reset (i.e., lost) after a service restart:

SELECT COUNT(XEventData.XEvent.value('(data/value)[1]', 'varchar(max)')) AS cnt
FROM
(SELECT CAST(target_data as xml) as TargetData
FROM sys.dm_xe_session_targets st
JOIN sys.dm_xe_sessions s on s.address = st.event_session_address
WHERE name = 'system_health') AS Data
CROSS APPLY TargetData.nodes
('//RingBufferTarget/event') AS XEventData (XEvent)
WHERE XEventData.XEvent.value
('@name', 'varchar(4000)') = 'xml_deadlock_report'

Independent Review of IgniteVM and AlarmVM

post date Tweet This
post comments No Comments

Blogger and VMware Certified Instructor, Eric Sloof @esloof, just did an independent review of IgniteVM and AlarmVM – complete with video demonstrations. See what the expert has to say!

http://www.ntpro.nl/blog/archives/1995-Cool-Tool-IgniteVM.html

IgniteVM Review – Yes It Is That Cool!

post date Tweet This
post comments No Comments

Eric Sloof (blog | @esloof) posted some videos of IgniteVM and AlarmVM today and we wanted to share his post with you here.

I won’t give you a transcript of the videos but will share with you some of the quotes I liked reading. Eric states:

IgniteVM is a 24/7 monitoring tool, based on industry best practice Performance Intelligence and Wait-Time analysis methodsAssessing database performance without visibility leaves database administrators (DBAs) guessing about the root cause, hampering their ability to take action to address their most important user-oriented problems.

When I speak to customers about IgniteVM I tell them how I would rather spend five minutes examining the virtual layer for any possible issues before ever spending one minute looking at a query plan.

So what tools are giving you insight to your virtual environment, and how is that tool able to tie performance back to specific SQL statements?

I only know of one that does that for MS SQL Server, Oracle, Sybase ASE, and DB2: IgniteVM.

But don’t take my word for it, go and download yourself a trial and see it for yourself.

Q&A from Monitoring Database Performance on VMware Webinar – March 1

post date Tweet This
post comments No Comments

Last week’s webinar had nearly 700 attendees! We had a lot of great questions from the audience. Below is just a sample. If you missed the event or want view it again,  here is the link to the recorded webcast, http://marketo.confio.com/OnDemand_MonitoringonVMware_March2012

Q: What would you recommend regarding the storage layout for the VM, i.e. should we configure 1 DB server per LUN?

A: Storage layouts will differ for each VM running a database, just like they do in the physical world. We configure one database per VM to provide more flexibility, but then configure the storage based on that database. Based on I/O and requirements you may share a LUN with many other VMs, e.g. a dev/test and other less critical VMs. Sometimes you may have many LUNs for one VM and no other VM will share those LUNs. So, the answer is that it depends on I/O workloads.

Q: You mentioned that if there are 1000+ concurrent users that you would not virtualize that environment. What is the reason for that?

A: The decision to virtualize an environment should not be based on the number of concurrent users alone. It should be based on the resource consumption of an environment, e.g. storage IOPS, CPU utilization, etc. However, in practicality, an environment with a large number of concurrent users would most likely also use a lot of resources. When beginning a virtualization project, start with the database environments with less resource consumption and build up to an environment like this. You will not only learn many things along the way to help make the project more successful, but you also lessen the risks dramatically.

Q: Is the MAC address the same on Fault Tolerance VMs?

A: Yes. When FT is configured for a VM, a shadow of that VM is created on another physical server. These instances are kept in “lockstep” with each other and access common storage with a single IP and MAC address. Here is a link to a document on the VMware website with information about how FT works: http://www.vmware.com/files/pdf/VMware-Fault-Tolerance-FT-DS-EN.pdf

Q: Can I use OEM (Oracle Enterprise Manager) to monitor my VMware environment?

A: OEM does provide a plug-in for monitoring VMware. However, I personally do not see enough advantages of using this plug-in over the free tools provided via the vSphere GUI on top of a vCenter server. When monitoring databases (Oracle, SQL Server, Sybase, DB2/LUW) running in a VMware environment, Confio’s IgniteVM provides a “single pane of glass” view from the database to the VM down through the physical ESX host and storage layers. Everything you need can be found on a single page which is very unique in the industry.

Q: Is swapping a feature that must be enabled in order to use?

A: No, hypervisor swapping is an inherent feature of VMware used as a last resort when a physical host is under severe memory pressure. Page sharing and ballooning are techniques used to reclaim memory before swapping because of the severe performance degradation that can occur. Here is a link to a document with more information about memory management: http://www.vmware.com/files/pdf/techpaper/vsp_41_perf_memory_mgmt.pdf

Q: Can you define ballooning and what it means?

A: The VM is running a guest O/S like Windows or Linux, and that guest O/S does not know it is running in a virtual environment. Because of that, if the physical host that contains the VMs is running low on memory, the guest O/S will not know this and not free memory up to help. Ballooning helps the guest O/S understand the hosts memory shortage. A balloon driver is installed as part of VMware Tools into the guest O/S and polls the ESX server to get a target balloon size. If the ESX host needs memory, it will post a target balloon size for the balloon driver making it inflate. The hypervisor will see the memory pages pinned by the balloon driver and reclaim that memory. A more thorough description can be found in the vsp_41_perf_memory_mgmt.pdf document reference above.

Q: If we use OS monitoring (CPU , I/O, Memory) on a physical host that has 5 VMs, what will be the accuracy of those OS reports?

A: I assume you have ESX/VMware 4.1 or lower installed and you have ESX running on top of Linux. In that case, the Linux O/S counters should be a good indication of memory and other resource pressure on the physical host. However, note that with ESX 5.0 and higher, the only option is to use ESXi installed directly on the physical host, so the Linux O/S is no longer present. Your O/S counters will no longer work.

Q: In our testing we have seen dropped pings during vMotion. Could you comment on that?

A: I do not have any specific experience with this, but searched for others having issues. It seems that this was more of an issue in earlier versions of VMware. However, when the vMotion is performed, it may take just a little time for everything to get into place, so I could see a failed ping or so occurring. I would not expect to see many occur though, so if that is happening, a call to VMware support may be in order.

Q: Can vSphere send scheduled reports and custom alerts?

A: To my knowledge, the vSphere GUI does not accommodate sending reports on a defined basis. However, many people on the internet have example of scripts that can probably accomplish what you need. There is the concept of a scheduled task in vSphere, but sending reports is not one of the things you can do.

Q: Can you recommend the paravirtualized disk controllers? You did not mention them.

A: Yes, if you have VMs that drive require a lot of I/O throughput, which databases are certainly candidates to do, you should use the paravirtual SCSI (PVSCSI) adapters. These are best suited for environment, especially SAN environments, where applications drive a very high amount of I/O. This is something I will include in my presentation in the future, so thank you for reminding me about it.

Q: We use the VM memory active perfmon counter and monitor for it in SCOM. Are the VM counters in perfmon good to use?

A: Yes, these are supplied by the VMware environment are ok to use. Standard perfmon counters can be incorrect and misleading, but these should be ok. However, if you only use the VM perfmon counters, you will be missing he physical host, storage and other metrics.

Q: As per VMware best practices for SQL Server DB’s they suggest use RDM with specific RAID?

A: I am not aware of a suggestion to use RDM (Raw Device Mapping) with a specific RAID type. My understanding is that any type of RAID can be used underneath a raw device, but I am not a storage expert by any means. The same proven practices you are currently using for SQL Server should also translate to the virtual environment.

Q: Does vMotion/DRS have any overhead on the VMs performance?

A: Not that I am aware of, and it probably only helps. If a physical host is running low on resources while another physical has plenty to spare, vMotion/DRS will help balance the workload for you.