Maximizing VMware Performance and CPU Utilization

In a previous post we discussed overcommitting VMware host memory – the same can be done with host CPU. As per the Performance Best Practices for VMware vSphere 6.0:

In most environments ESXi allows significant levels of CPU overcommitment (that is, running more vCPUs on a host than the total number of physical processor cores in that host) without impacting virtual machine performance. (P. 20, ESXi CPU Considerations)

This post will discuss calculating CPU resources, considerations in assigning resources to virtual machines (VMs), and which metrics to monitor to ensure CPU overcommitment does not affect VM performance.

Calculating available Host CPU resources

The number of physical cores (pCPU) available on a host is calculated as:

(# Processor Sockets) X (# Cores/Processor)  = # Physical Processors (pCPU)

If the cores use hyperthreading, the number of logical cores is calculated as:

(# pCPU) X (2 threads/physical processor) = # Virtual Processors (vCPU)

For example:

4 sockets X 6 cores/processor = 24 pCPU
24 pCPU X 2 threads/core = 48 vCPU

Please note that hyperthreading does not actually double the available pCPU. Hyperthreading works by providing a second execution thread to a processor core. When one thread is idle or waiting, the other thread can execute instructions. This can increase efficiency if there is enough CPU Idle time to provide for scheduling two threads, but in practice performance increases are up to a maximum of 30% and are strongly application dependent.

Considerations in Assigning vCPUs to VMs

  1. Best Practices recommendations
    • Start with one vCPU per VM and increase as needed.
    • Do not assign more vCPUs than needed to a VM as this can unnecessarily limit resource availability for other VMs and increase CPU Ready wait time.
    • The exact amount of CPU overcommitment a VMware host can accommodate will depend on the VMs and the applications they are running. A general guide for performance of {allocated vCPUs}:{total vCPU} from the Best Practices recommendations is:

      • 1:1 to 3:1 is no problem
      • 3:1 to 5:1 may begin to cause performance degradation
      • 6:1 or greater is often going to cause a problem

  2. Non-Uniform Memory Architecture (NUMA)
    In a previous post on minimizing CPU latency with NUMA, we discussed performance degradation on multiprocessor VMs in which the number of vCPUs was greater than the number of vCPUs in a NUMA node. Generally, try to keep multiprocessor VMs sized so that they fit within a NUMA node.
  3. Co-stop
    VMware schedules all the vCPUs in a VM at the same time. If all the allocated vCPUs are not available at the same time, then the VM will be in a state of “co-stop” until the host can co-schedule all vCPUs. In its simplest form co-stop indicates the amount time after the first vCPU is available until the remaining vCPUs are available for the VM to run. Sizing VMs to use the least number of vCPU’s possible minimizes the time needed for co-stop waits.

 

Monitoring VMware CPU metrics

Monitor the following metrics to fine tune the number of vCPUs allocated per VM and to ensure that CPU overcommitment does not degrade performance:

  1. VM CPU Utilization
    Monitor CPU Utilization by the VM to determine if additional vCPUs are required or if too many have been allocated. CPU use can be monitored through VMware or through the VM’s operating system. Utilization should generally be <= 80% on average, and > 90% should trigger an alert, but this will vary depending on the applications running in the VM.
  2. VM CPU Ready
    VM CPU Ready is a measure of the time a VM has to wait for CPU resources from the host. VMware recommends CPU ready should be less than 5%.

    Longitude high CPU Ready VMware report

    Figure 1: Longitude Report of VMware showing high CPU Ready values.

  3. Co-Stop
    Applicable to VMs with multiple vCPUs – is a measure of the amount time after the first vCPU is available until the remaining vCPUs are available for the VM to run. A co-stop percent that is persistently >3% is an indication that a right-sizing exercise may be in order.

    esxtop showing high co-stop

    Figure 2: esxtop showing a VM with a high co-stop value.

  4. VMware host CPU Utilization
    Monitor CPU Utilization on the VM host to determine if CPU use by the VMs is approaching the maximum CPU capacity. As with CPU usage on VMs, CPU utilization at 80% – 90% should be considered a warning level, and >= 90% indicates that the CPUs are approaching an overloaded condition.

    Longitude high host CPU report

    Figure 3: Longitude Capacity Planner showing host CPU approaching capacity.

Summary: Overcommitting CPU allows you to maximize use of host CPU resources, but make sure to monitor overcommitted hosts for CPU use, and CPU Ready and Co-stop percentages. Avoid oversizing VMs with more vCPU’s than needed. Consider NUMA architecture and the effect of co-stop waits when creating VMs with multiple vCPUs.

VMware Memory Management and Capacity Planning

VMware provides the ability to create virtual machines (VMs) that are provisioned with more memory than physically exists on their host servers. This is possible because VMware memory management is able to recover memory that is no longer in use by the VM’s guest operating system (OS). However, you can push memory overcommitment to the point where the hypervisor is unable to keep up, leading to potentially severe performance degradation on the VMs. With proper capacity planning you can estimate how much overcommitment is possible before risking performance problems.

Today’s blog will outline basic VMware memory terms, how and when VMware memory management initiates memory reclamation, and capacity planning best practices.

 
VMware memory terms

The following concepts are used in discussing memory on ESX hosts:

  • Capacity (Host level)
    The physical memory available on the host. (see Fig. 1)
     
  • Consumed Memory (Host level)
    Total memory in use on the ESX host, which includes memory used by the Service Console, the VMKernel, vSphere services, plus the memory in use for all running VMs. (see Fig.1)

    VMware memory management: vCenter host memory
    Figure 1: vCenter with 37GB consumed host memory out of 64 GB physical capacity

  • Provisioned Memory (VM level)
    The amount of memory allocated to a VM plus the overhead needed to manage the VM. Provisioned memory is an upper limit – when a VM is powered on, it will only consume the memory that the OS requests, and the hypervisor will continue to grant additional memory requests made by the VM until the provisioned memory limit is reached.
     
  • Consumed Memory (VM level)
    Current level of memory consumption for a specified VM. (See Fig. 2)
     
  • Active Guest Memory (VM level)
    Hypervisor estimate of memory actively being used in the VM’s guest OS. The hypervisor does not communicate with the guest OS, so it does not know if any memory allocated to the VM is no longer needed. To gauge memory activity the hypervisor checks a random sample of the VM’s allocated memory and calculates the percent of the sample that is actively being accessed during the sampling period. (see Fig. 2)

    VMware memory management: vSphere display of a VM’s Consumed Host and Active Guest Memory
    Figure 2: vSphere display of a VM’s Consumed Host and Active Guest Memory

  • mem.minFree
    A Host level minimum free memory threshold that is used to trigger memory reclamation. VMware initiates increasingly aggressive memory reclamation techniques as the free memory decreases further below the mem.minFree value.

     
    The mem.minFree value is calculated based on a sliding scale, in which 899 MB of memory is reserved for the first 28GB of physical memory, and 1% of memory is reserved for physical memory beyond 28GB. For example, for a 64 GB server:

    mem.minFree = 899 M + (64GB – 28GB)*.01 = 899MB + 369MB = 1268 MB

 

VMware memory management: reclamation

Techniques
The following are VMware memory reclamation techniques, in order of severity:

  • Transparent Page Sharing (TPS) – VMware detects and de-duplicates identical memory pages. TPS begins by breaking up large memory pages (2MB) into smaller pages (4KB), and checks the smaller pages for duplicates. TPS is “transparent” to the VM’s guest operating system as it does not affect the amount of memory consumed by the VM. TPS is enabled within VMs, however in VMware 6.0 it is disabled by default between VMs for security considerations.
     
     
  • Ballooning is a technique in which the hypervisor reclaims idle memory from a guest OS and returns it to the host. Ballooning works as follows:
     

    1. The hypervisor contacts a balloon driver installed on the guest OS as part of VMware tools.
       
    2. The hypervisor tells the balloon driver to request memory for a balloon process from the guest OS.
       
    3. The guest OS allocates memory to the balloon process. That memory is now unavailable for other processes on the guest OS.
       
    4. The balloon driver contacts the hypervisor with the details of the memory it has been allocated.
       
    5. The hypervisor removes the ballooned memory from the VM, lowering memory consumed by that VM.
       
    6. If memory problems are resolved on the host, memory can be returned to the VMs by “deflating” the memory used by the balloon and re-allocating the memory to the VM.

     
    From the guest OS perspective, the total memory has not been changed, but available memory has effectively decreased by the amount in use by the balloon process. Guest OS performance can be significantly degraded if the ballooning process reduces memory to the point where the guest OS needs to start paging.
     

  • Memory compression – the hypervisor looks for memory pages that it can compress and reduce by at least 50%.
     
  • Swapping – the hypervisor swaps memory pages to disk
     
mem.minFree State Memory Reclamation Techniques
400% High Break down large memory pages
100% Clear Break down large memory pages + TPS
64% Soft TPS + Ballooning
32% Hard TPS + Memory Compression + Swapping
16% Low Memory Compression + Swapping
+ Block VMs from allocating memory

 
Key Points:

  • TPS does not affect VM performance.
     
  • Monitor a VM’s guest OS for paging due to low memory during ballooning.
     
  • Memory compression and swapping can cause serious performance problems for VM performance
     
  • TPS and ballooning are relatively slow compared to swapping, if you need memory fast, swapping may be used.
  •  

     
    Capacity Planning Best Practices for Memory

    The goal of VMware is to maximize memory use without starving your VMs of the memory they need to perform. To estimate memory requirements for capacity planning you need to look at both the Active Guest Memory metrics from VMware and the Memory use metrics from the operating system.

    VMware memory management: Longitude report of Windows memory used for an Exchange server
    Figure 3: Longitude report of Windows memory used for an Exchange server

    VMware memory management: Longitude VMware report of Active Guest Memory for an Exchange VM
    Figure 4: Longitude VMware report of Active Guest Memory for an Exchange VM

    Figures 3 and 4 show memory use on an Exchange 2013 server running within a VM – Figure 3 is memory use from the Windows perspective (~7.5Gb), and Figure 5 is active memory use from the VM perspective (~1.5GB). VMware Active Guest Memory underestimates the memory needed for Exchange. The best practice for underestimated Active Guest Memory is to allocate memory for the VM as specified by the application’s requirements, and to make sure that the VM would not lose memory to ballooning either by running the VM on a host without memory overcommitment, or by setting a memory reservation for the VM.

    Best practices for Capacity Planning for memory are:

    • Check memory use reports for the VM’s guest OS to gauge memory requirements, and defer to application recommendations for memory allocation.
    •  

    • Do not use more memory than is needed for your VM. Ideally, consumed memory for the VM should be close to the memory used by the guest OS, plus overhead for running the VM.
    •  

    • If the Active Guest Memory estimate is significantly smaller than actual guest OS Memory use, protect the VM from memory reclamation by either setting a memory reservation or running the VM on a host that has not been overcommitted.
    •  

    • If you overcommit memory, monitor Host Consumed memory. When Host Consumed is close to capacity, watch for signs of the Host Consumed value dropping (indicating TPS recovering memory), or unexpectedly low free memory on the guest OS’s (indicating ballooning).
    •  

    • While ballooning is occurring, monitor free memory and paging on guest OS’s. Move VMs to hosts with more memory if ballooning causes performance degradation on the VMs.
    •  

    • If ballooning does not resolve low memory on the host, move or power down VMs before reaching the Hard memory state (32% of mem.minFree). Memory compression and swapping cause sever performance degradation.

     
    Conclusion:

    Overcommitting memory can make the best use of your resources, but keep an eye on the host’s consumed memory and the effect of memory reclamation on your VMs. If you have VMs running memory sensitive applications, make sure you allocate enough memory for them and protect them from memory reclamation.

    Server Monitoring Best Practices

    IT is a pervasive presence in businesses, running everything from enterprise level scheduling applications to web based stores. Monitoring the health of the infrastructure that underlies a critical application helps to ensure that it is available when needed, and that it functions as designed. Missing a form submission in a scheduling application due to an overloaded database server results in missed deliveries. Missing a swipe in a smartphone application due to network congestion results in a lost sale. IT applications are the lifeblood of organizations and when the delivery of the information they produce is compromised the consequences can include losses in productivity, revenue, and business intelligence. Effective server monitoring is the key to keeping your IT applications available.

    One of the biggest challenges facing IT professionals is managing increasingly sophisticated and heterogenous IT infrastructures. In the past, IT consisted of physical datacenters and servers, usually running a common OS. Management consisted of watching well known key performance indicators (KPIs) and logs.
    Fast forward to today. Applications can be dispersed across servers that are themselves dispersed across multiple locations. Servers can be virtualized either on local hosts or in the cloud. Each component has its own set of KPIs and logs, and IT professionals are tasked to do more with less. The IT profession has morphed from specialists handling a finite environment to generalists responsible for implementing and managing technology almost as fast as it is developed.

     
    Server Monitoring Best Practice 1: Monitor KPIs and Availability

    Knowing where to start monitoring can be challenging. The first line of defense for monitoring servers is watching for performance and availability issues that have an immediate effect. Critical early warning KPIs are CPU, Memory, Disk, and Network, but their importance and impact will vary depending on the platform. For example, high CPU on a physical server is addressed by looking for processes using more CPU than expected. High CPU on a virtual machine (VM) on a CPU overallocated virtualization host means not only looking for the process consuming the CPU on the VM, but examining actual CPU utilization on the host to check if it is approaching the limit of available CPU and causing CPU Ready % to increase.

    Server Monitoring Best Practices: Longitude KPI Overview Dashboard
    Figure 1. Longitude KPI Overview Dashboard

     
    Keep the following in mind when monitoring KPIs:

    • Scheduled collection intervals should be frequent enough to pick up trends and minimize the effect of transitory spikes. Quickly changing metrics (CPU, Network Activity) should be sampled more frequently than metrics that change relatively slowly (memory, disk free space).
    • Start with default KPI thresholds and adjust on a per server basis. For example, database servers can be set to allocate as much free memory as is available on a server, resulting in low free memory without a problem. Archiving KPI values allows you to create a baseline to adjust thresholds appropriately.
    • Use an overview dashboard that groups like servers together and allows you to drill down when a problem occurs for that group of servers.

    Monitoring server availability is also an integral part of server monitoring. The primary purpose of availability testing is to ensure that the services provided by the server are accessible. Availability tests include:

    • Request/response queries to the service provided by the server, e.g. accessing a web page or querying a database.
    • Verifying that services/daemons and/or processes are running.
    • Evaluating the output of diagnostic scripts.
    • Verifying that application ports are listening for request.
    • Pinging the server. Please note that pinging a device is the standard for up/down tests, but runs the risk of missing problems on servers that responding to a ping request but whose OS or services are in a hung state.

    Server Monitoring Best Practices: Longitude Availability Transactions

    Figure 2. Longitude availability transactions.

     

    Server Monitoring Best Practice 2: Automate discovery of server infrastructure

    IT infrastructures encompass a mix of physical servers, VMs, virtualization hosts and network devices, each of which have unique KPIs and availability metrics. VMs are especially volatile, as they can be spun up or down with little advance notice. The goals for automating infrastructure monitoring are:

    • Discover servers, VMs and network devices.
    • Monitor discovered devices with appropriate KPIs
    • Monitor basic server availability with a ping
    • Monitor new VMs as they are created.

     

    Server Monitoring Best Practice 3: Less is more for effective alerts

    We’ve explored what needs to be monitored, now let us address what happens once a problem is detected. For alerts, the best practice is “less is more”:

    • Alert on critical problems with email or text messages (e.g. web site is down, or database is not responding).
    • Enable escalation on less severe persistent problems (e.g. long running SQL queries, or low disk space warnings).
    • If a problem can be resolved via automated intervention i.e. a script, try that first and then escalate if the problem if it still exists.
    • Minimize transient spikes in volatile KPIs before alerting on a problem. For example, average 3 CPU collections at a 5 minute interval and evaluate the average against the KPI threshold.

    Server Monitoring Best Practices: Longitude automated problem correction
    Figure 3. Longitude automated problem correction.

     

    Server Monitoring Best Practice 4: Maintain Historical Context

    As the saying goes, “Those who fail to learn from history are doomed to repeat it”. When approaching server monitoring maintaining a historical context is not only important for capacity planning, but also for recognizing problem patterns related to availability and performance. Does the problem recur? How often? When? And under what circumstance? Knowing the answers to these questions are critical to successfully understanding and mitigating outages.

    Server Monitoring: Longitude problem events vs. time of day
    Figure 4. Longitude problem events vs. time of day

    Figure 4 displays a problem event summary report for the average event volume generated over the previous 30 days for a monitored vCenter console. The x-Axis represents the hour of the day, the y-axis represents the average event volume per hour. The persistent spikes during the 3:00 AM and 3:00 PM hours indicate a problem with a 12 hour recurrence that might have been missed viewing events over a shorter timescale.

    Server Monitoring Best Practices: Longitude historical problem event detail
    Figure 5. Longitude historical problem event detail.

    The next step in investigating this pattern is to examine the details for the problems reported during the spikes in Figure 4. The report in Figure 5 shows a regular pattern related to excessive CPU usage on one VM. Further investigation would be to examine problem reports and archived KPI values for the problem VM.

    Conclusion:

    Operating an IT infrastructure that encompasses both physical and virtual resources complicates IT’s task of ensuring maximum availability for business critical applications. Having a cohesive server monitoring strategy is a necessity required to avoid outages that affect productivity and the bottom line.
    Server monitoring should:

    • Monitor appropriate KPIs in the context of server type and function
    • Monitor services provided by servers to identify issues before they impact the organization
    • Discover/automate IT monitoring wherever possible
    • Limit alerting to high severity and persistent issues and automate responses when possible
    • Maintain history and understand problem patterns

    If you would like to learn more about Longitude please visit heroix.com/longitude.

    Building a Splunk Map

    Visualizing incoming web traffic on a geographic map provides valuable insights for security monitoring, customer activity, and website traffic. Splunk® provides the ability to turn log data that contains IP addresses (e.g. firewall logs, web server logs) into a real-time activity map.

    To create a basic Splunk map, you simply specify which type of data you want to examine and how you want to drill down into that data. For example, if you have an Apache web server and want to display a map of web client requests counted by the city the request came from, you would use the following query:

    sourcetype=access_combined | iplocation clientip | geostats count by City

    The query works as follows:

    1. Data is restricted to only access_combined web log data, which tells Splunk to select all Apache format web logs.
    2. The name for the field containing the incoming IP address in Splunk’s access_combined CIM model is “clientip”. The iplocation command reads in the clientip for each record, looks up the geographic location for the IP address, and adds the following fields to each record:
      • Country
      • City
      • Region
      • Latitude (lat)
      • Longitude (lon)

       

    3. The geostats command then sorts the data into bins based on latitude and longitude, and plots the data on a map. The “count by City” argument for geostats is then used to populate the pie chart at each location.
      basemap
    4. Hovering over the pie chart will display a pop up showing the breakdown of traffic by City.basemap_hover

    There are a couple of refinements that can further enhance the value of the map:

    1. Eliminate internal traffic.
      You can use a wildcard or CIDR notation to specify ranges of IP addresses – for example, to eliminate traffic from internal 192.168.0.0/16 addresses, you could use either of the following:

      sourcetype=access_combined clientip != 192.168.0.0/16

      or

      sourcetype=access_combined clientip != 192.168.*

    2. Display an alternate value if City is blank.
      iplocation is not always able to assign names to the City, Region, or Country fields when it looks up an IP address. On the map, all points where the City is blank are grouped together under the name “VALUE”.
      To provide more accurate data, Splunk can use the eval command and the if function to copy the values from the Region or Country fields to the City field. For example:

      eval City= if( City = "", Region, City)

      will assign the value for Region to City if City is blank, and keep the current value if one exists. For this map we want to use Region if the value for City is blank, and use Country if both Region and City are blank. The full command to assign defaults for Country, Region and City is:

      eval Country = if( Country="", "N/A", Country),
      Region = if (Region="", Country, Region),
      City = if (City="", Region, City)

      For our map this can be shortened to use nested loops:

      eval City = if (City = "", if(Region = "", if(Country = "", "N/A", Country), Region), City)

    3. Display data for more cities in pie charts.
      In our default command, geostats is limited to keeping count of 10 cities, and all other cities will be grouped under the name “OTHER”. The “globallimit” argument can be used to change the number of cities geostats counts, with a value a 0 indicating that all cities should be displayed:

      geostats count by City globallimit=0

    Chaining these together yields:

    sourcetype=access_combined clientip != 192.168.0.0/16 |
    iplocation clientip |
    eval City = if (City = "", if(Region = "", if(Country = "", "N/A", Country), Region), City) |
    geostats count by City globallimit=0

    all_cities_resolved

    This map is just one example of the many visualizations available with Splunk – if you have any questions about how to customize Splunk dashboards for your needs or would like to get started with a Splunk free trial, please contact us at splunk@heroix.com.

    Getting Started with a Splunk Trial

    To get the most out of a Splunk® trial you not only want to demonstrate Splunk’s value but you also want to configure Splunk for your environment so that you can quickly convert from trial mode to production mode. Consider the following when planning for your trial:

    1. What logs are you currently looking at?
      The initial use case for Splunk is often to streamline manual or scripted log parsing.  For example, tracking down customer activity at the request of customer support, or documenting application errors for developers who don’t have access to the production environment.  During your trial configure Splunk to automate the log parsing that is currently done manually and leverage Splunk’s reporting, alerting and dashboarding capabilities to provide a self-service portal to your end users.

    2. Are there additional logs that could speed up troubleshooting or improve security monitoring?
      While the initial use case may be to automate existing log parsing, Splunk’s ability to create a central location for all the machine data in your environment streamlines troubleshooting and provides a comprehensive overview of operations.  For example, Splunk can cross reference firewall logs and security logs in order to identify low and slow attacks that might otherwise not be detected.

      If you’re not already doing so, configure Splunk to index the following data inputs during your trial:

      • Windows Event Logs
      • Firewall Logs
      • Antivirus Logs
      • Syslogs
      • Web Server Logs

      Start by collecting data without filtering to determine data volume and to differentiate between useful information and noise.

    3. How much data are you indexing?
      In Splunk Enterprise, go to Settings → System → Licensing and click on the green Usage Report button.  The License Usage page will have tabs for Today and the Previous 30 days.  Click on Previous 30 days and in the Split by dropdown menu select Source type.   This breakdown of how much of each type of data you’re indexing can be used to plan the scale you will need for your production deployment.
      Keep in mind:

      • By default Splunk will collect all the data that exists for a data input on the first collection – so if you have several years of Apache logs, all of that data will be indexed by Splunk.  After that first collection you will get a more accurate reading for average daily volume by source, and you can observe the effects of filtering on collection volume.
      • It may take several minutes for Splunk to finish indexing older data, so clicking on the link to search data immediately after configuring the collection can return “No results found.”
      • Splunk’s trial version allows you to index 500MB per day, but you can contact a Splunk sales rep in order to get a trial license with an indexing limit more appropriate to your environment. If you index more data in a day than your license allows Splunk will display an alert.  If you exceed your licensing capacity more than 5 times in a 30 day period, you will need to contact your Splunk sales rep to have your license reset.  Splunk will continue to collect and index data but you won’t be able to search the data until the reset.

    4. Should you filter your data?
      Ideally avoiding filtering would be best as there is always a possibility that you might later need information you’ve filtered out and Splunk’s big data architecture can efficiently filter the data at query time.  However sometimes licensing and disk constraints may require you to filter out data.  Splunk provides the ability to create whitelists or blacklists for collections, and the ability to truncate message data in Windows Event log events.

    5. Where should I install Splunk during the trial?
      A Splunk Enterprise trial can use a single instance of Splunk without Universal Forwarders (Splunk collection agents).  In that scenario:

      • Installing Splunk on a VM will allow you to adjust resources as needed.
      • For Splunk to collect network data such as syslog or SNMP traps the servers or network devices producing the data need to be configured to send the data to Splunk.  Any firewalls between Splunk and the syslog/SNMP Trap producers should be configured to allow the data to pass through to the Splunk server.
      • Collecting Windows Event Log and Performance data is done via WMI and Splunk will use the permissions of its service account (Splunkd).  Set up the Splunkd service account to have at least local administrator privileges on the Windows servers being monitored.
      • File collections can be done using a UNC path.  If the Splunkd service account has admin privileges, it can map to administrative shares – e.g. \\web-server\c$\inetpub\logs\…\*.log

      Keep in mind that a Splunk installation scales using commodity based servers, so your initial Splunk deployment can be readily scaled up to meet production requirements.

    6. Where can you get help?
      Contact us at splunk@heroix.com  – we can answer any questions you have, work with you to configure your trial, and help you plan your production environment. For more information on Splunk visit  Heroix’s Splunk information pages.

    Compliance Regulations and IT Departments

    Compliance regulations often appear overly burdensome but they are also necessary.  Regulatory compliance is a sign that an organization is aware of its security obligations and implementing its best effort at protecting critical and confidential data.  The problem that faces IT departments is determining which of the array of compliance standards they should follow.

    The Successful SIEM and Log Management Strategies for Audit and Compliance White Paper presented by SANS lists some of the major regulations IT departments may encounter.  Appendix E provides basic details on the types of organization governed by the each of the regulations, links to the relevant documentation, and an overview of what IT departments need to do to comply.  Some of the top regulatory concerns for IT departments are:

    HIPAA

    The Health Insurance Portability and Accountability Act (HIPAA) specifically applies to IT departments working in healthcare, medical records, insurance and any other medical related business or environment and are focused on preventing unauthorized access to patient information while still allowing authorized and medically necessary access.  For IT departments the regulations are related to the way information is created, stored, accessed, shared, and deleted.  The HIPAA site includes a list of links providing guidance on implementing  IT security and privacy guarantees.

    FISMA

    The Federal Information Security Act (FISMA) applies to all government agencies and contractors interacting with government agencies.  FISMA’s overall goal is to implement cost-effective, risk-based, consistent, and auditable information security for all federal government information. Guidance on complying with FISMA is available from the National Institute of Standards and Technology (NIST)  and includes Federal Information Processing Standards (FIPS) that cover cryptographic requirements.   An outline and checklist for FISMA reporting, certification and accreditation is also available in a SANS Research Paper, FISMA reporting and NIST guidelines.

    PCI

    Payment Card Industry Data Security Standards (PCI DSS) apply to all IT departments that handle any type of payment card information.  PCI compliance dictates how important financial information (like credit card numbers) can be collected, stored and transmitted.  Minimum requirements for compliance include maintaining a secure network, encrypting confidential cardholder information while at rest and in transit, segregating networks that handle cardholder information, and log retention.

    The PCI web site includes multiple Self-Assessment questionnaires (SAQS).  The Understanding SAQs for PCI DSS PDF provides a guide to determine which of the SAQs is appropriate based on the type of PCI activity performed at the site.  PCI standards are updated frequently – the most recent version is PCI DSS 3.1, published April 2015.

    NERC

    The North American Electric Reliability Corporation (NERC) regulates the bulk power system in the United States, Canada, and Mexico’s Baja Peninsula in order to ensure its reliability and security.  NERC standards apply to companies that generate, provide or transmit energy.   NERC compliance requires IT teams to adhere to Critical Infrastructure Protection (CIP) reliability standards, with the documentation including detailed outlines of compliance measures.  Since organizations regulated by NERC are critical infrastructure components, CIP standards cover physical security in more detail than the other previously discussed regulations.

    Using Splunk® for Compliance

    Determining which compliance regulations apply to your IT department provides you with an outline of what information you need to collect, monitor and archive.  The next step is determining how to implement software to meet your specific requirements.  Ideally your software should be flexible enough to meet all your data collection needs, be able to correlate across multiple data sources and issue alerts, and then archive data to meet retention requirements.

    This is where Splunk comes in.  Splunk is ideally suited to meeting compliance needs:

    • Splunk ingests any type of machine data without pre-imposing a data format.  Splunk is able to ingest data even if the format is unknown or if it changes.
    • Splunk does not use a database, eliminating overhead and simplifying correlation across multiple formats.
    • Data is immediately searchable and can be correlated across multiple data types, providing a live view of your compliance posture.
    • Alerts and Dashboards can be preconfigured and provide the ability to drill down for root cause analysis.
    • Data can be archived automatically when it ages out and loaded back as needed.

    Splunk’s advanced features offer IT departments the ability to maintain regulatory compliance with HIPAA, FISMA and more.  Splunk’s flexibility and customizability allow IT departments to work smarter, not harder.

    Splunk for PCI Compliance

    PCI is an acronym that stands for “Payment Card Industry” and PCI Compliance is a specific and rigid set of requirements that all businesses that process credit or debit cards as a form of payment must follow.  The intent of PCI requirements is to ensure that sensitive personal information is protected and that there is an audit trail to investigate data breaches if security protocols fail.  For businesses this means archiving large quantities of network and security data and retaining that data for at least one year.

    Splunk is an affordable log collection, analysis, and archiving solution designed to help you maintain PCI compliance and protect your customers from the ramifications of insecurity in the digital age.

     

    The Importance of PCI Compliance

    If your business is found to be in violation of PCI compliance security standards you could be subject to fines by the company that you’re using as a credit card payment processor. Additionally, if your business is unfortunate enough to suffer a data breach where customer credit card information is actually stolen you could be hit with significantly larger fees from banks, credit card issuers and more.

    Failure to maintain PCI compliance is not a secret that you can keep.  Mandiant’s 2104 Threat Report noted:

    In each of the incidents we investigated, a third party — typically one of the major banks or card brands — had notified the retailers of the compromise. But in some instances, federal law enforcement notified the victims. The threat actors maintained access to the compromised systems for up to six months. (page 11)

    When a data breach occurs the victims will investigate unauthorized activities and that will be traced back to the compliance failure.  Maintaining compliance means not only earning and retaining the trust of your customers, but also avoiding the types of monetary fines and fees that could cripple even the strongest businesses in a way that they may never recover from.

     

    Compliance Means Protection

    Failure to maintain PCI compliance doesn’t just mean that your business could be subject to catastrophic fines although that is very much a concern. PCI compliance isn’t designed to purely be a “watchdog” – these rules are in place to protect your business and your customers from cyber security issues.

    According to the experts at Reuters, the average cost of a data breach in 2015 was $3.8 million dollars. This was an increase from $3.5 million dollars in 2014. This number only represents the cost to get your business back up and running again.  It does not take into account the goodwill nor the revenue lost from customers who no longer trust your business with their important personal and financial information. Splunk is designed to help you maintain the trust and loyalty of your customers, all while acting as a premiere performance monitoring and reporting solution at the same time.

     

    Automating Compliance

    PCI standards require log review and incident investigation when suspect entries are discovered.  Splunk’s data collection and log analysis tools can automate the repetitive search for known threats, providing dashboards and alerts that allow users to focus on investigation.

    Splunk includes features designed to help you maintain PCI compliance at all times.  Data indexers can be clustered to provide scalability, redundancy and high availability.  Splunk’s built in search and analysis tools, and user customizable alerts and dashboards provide the ability to automate log review and speed up incident investigation.  Splunk can be configured to automatically archive older data to long term storage to comply with PCI data retention requirements.

    Splunk also includes native tools that allow you to meet requirements of other governing bodies beyond the PCI compliance standards, including both FISMA and HIPAA, as well.

     

    Bypassing the Limitations of Traditional SIEMs with Splunk

    We have officially entered into an age where Security Information Event Management has both never been more important and has never been more complex at the exact same time. Cyber security in general is at a crossroads, as it is increasingly common to read about yet another devastating breach that has affected one or more of the largest corporations on the planet. The data breach that affected Home Depot in 2014, for example, was expected to cost an estimated $28 million dollars, which also equates to 0.01% of the company’s total sales during the same year. The highly publicized breach that struck Sony was estimated to ultimately cost around $35 million dollars.

     

    The Role of SIEMs in Cyber Security

    Security Information Event Management products are designed to play an important role from preventing these types of attacks altogether. Network events and security-related issues are analyzed in real-time, empowering IT professionals with the actionable information they need to identify patterns as they develop, spot attacks in their early stages and make any adjustments necessary to help stop a small problem before it becomes a much bigger (and more expensive) one down the road.

    SIEM products collect security-related information and other data from a wide range of sources, all of which is then loaded into a database that is scanned for known threats. While this model has worked well in the past, the complexity of threats has evolved dramatically in recent years to the point where even this proactive measure isn’t necessarily enough from preventing a company from making international headlines due to a catastrophic security incident.

     

    The Natural Limitations of the Traditional SIEM Model

    With a traditional SIEM model, the product in question will only collect data from a pre-defined source. This essentially means that if an attacker is savvy enough to know what a particular SIEM product will be looking for, they can work hard to cover their tracks so that the breach can go undetected for a much larger period of time.

    Another limitation of traditional SIEM products is that the databases that security-related information is loaded into requires a specific format for processing. In order to get all data into the correct format, a large amount of time and energy is required. Depending on the size of an organization, this too can delay the detection of a security breach beyond the point of no return.

    Perhaps the biggest limitation of traditional SIEM products, however, is that they are only looking for threats that have previously been identified. If an attacker is breaking new ground, the SIEM will essentially be unable to detect it because it is “unaware” that this method exists in the first place. This can prove problematic as those with malicious intentions are always working to stay one step ahead of the countermeasures that businesses and organizations around the world work to deploy.

     

    Bypassing the Limitations of Traditional SIEMs with Splunk

    Splunk® is designed to elevate the natural capabilities of traditional SIEMs to their next level, offering all of the protection with none of the natural limitations. As an analytical tool, Splunk doesn’t just collect and analyze pre-determined data produced by a network – it analyzes all of the data, period. All raw data is indexed, at which point Splunk builds a schema from scratch to build the most comprehensive and flexible security profile possible given the specifics of the network that it’s actually working with.

    You don’t have to wait for data to load into the proper format into a database, delaying the amount of time it takes for threats to be identified. Known threats are detected immediately and unknown threats are uncovered as soon as the raw data is analyzed.

    The time has most certainly come for the traditional SIEM model to evolve every bit as aggressively as the techniques of the people who wish to do you harm in the digital realm. Splunk was designed from the ground up to be the most comprehensive answer to that call available on the market today.

    Troubleshooting Web Site Outages

    Your website will go down. It doesn’t matter how much redundancy you have built in – if AWS can crash, your servers can as well. Exactly how much you can fix on your own will depend on how much of your infrastructure is under your control and the exact nature of the problem. The following steps are a general troubleshooting outline:

    1. Write up documentation and configure monitoring:
      Document your site configuration and use the documentation as an outline for monitoring.Update the documentation and monitoring whenever your site is modified. The exact mix of what to monitor will depend on your configuration. In general, website distributed application monitoring can include:

      Component Monitor
      Web servers OS, web applications, web page availability
      Backend databases OS, databases applications
      DNS servers Name resolution
      Mail servers Exchange or SMTP availability
      Firewalls Network Traffic, internet availability
      Reverse proxy servers Web pages, web applications
      Load balancers Web activity on individual web servers
      Routers Network Traffic
      All Components Ping

      Most performance metrics can be monitored from within your infrastructure, but web page availability should also be monitored from an external server that connects to your web pages over the internet. Building in a monitored segment that includes the internet will check for outages due to DNS errors or loss of internet connectivity.

    2. Verify alerts are not transient problems.

      Temporary glitches in resource availability or network connectivity can cause false alerts. Use the performance monitoring data you’ve collected on your servers to resolve recurring false alerts. For example, if you regularly get website outages overnight, check for backups that might use excessive server resources or too much bandwidth.

    3. Open your website in a browser and check for the following errors:
      • Server Not Found: Your site’s URL is not resolving to its IP address.
        • Troubleshoot this using nslookup to manually check name resolution.
        • If you host your own DNS servers check that they are resolving correctly and that they aren’t being overwhelmed with a DNS DDS attack.
        • If you have a DNS service check that they are not experiencing an outage. If they are experiencing problems you’re limited to checking for updates until the problem is resolved.
        • Keep in mind once you or your provider fix DNS it can take up to 24 hours before the change propagates through DNS, so set expectations (and possibly host files) accordingly.
      • Unable to Connect or Webpage is not available: The browser is not able to connect to a web server at the resolved IP address.
        • Can you connect to other devices when on your intranet from the outside? Is RDP or VPN working? If you can get in from the outside and the internet connection is up, then the problem is internal to your site.
        • If the problem is your connection to the internet check your network equipment to see if anything has crashed or rebooted and hasn’t been reset to the correct configuration. If your equipment appears to be functioning properly then get in touch with your internet provider.
        • If the internet connection is not a problem can you access the web page from the intranet? Check the firewall or any reverse proxy servers to see if they’re blocking the page.
        • If you can’t see the web pages from the inside check that the web servers are up and the web services are running.   If you’ve got a backup server, even if it is underpowered, bring that up to temporarily host the site while you finish troubleshooting the primary server.
      • Error 404: Page not found: The server is responding but the requested resource does not exist.
        If you have multiple servers hosting different pages on your website check that the server hosting the requested page is up. For example, if you have a blog in a subdirectory of your web site (e.g. http://www.heroix.com/blog), verify that the blog server is running. When monitoring check web page availability for one page for each separate web server.
      • Error 503: Server Unavailable: The server is unable to respond to the request.
        The server may be out of resources or overloaded with requests. Check resource usage on the web servers and check network traffic for possible DDoS attacks.

    As mentioned previously – this is a general outline for troubleshooting web site outages. You will need to adapt it to your servers and network configuration. If you’ve got any suggestions for additional browsers errors or troubleshooting steps, please leave them in the comments.

    3 Simple Steps to Begin Monitoring Your Infrastructure

    As an IT Infrastructure grows in size and complexity implementing infrastructure monitoring becomes increasingly difficult.  Increasing scale makes it difficult to drill down from a distributed application performance issue to a resource constriction on one of the application’s back end servers.  Mixes of different operating systems, network equipment, software versions, etc., adds to the complexity of requiring multiple collection methods for corresponding metrics and comparing metrics across servers, applications and your network.

    Fortunately, regardless of network size or complexity, the same 3 steps can be used to implement infrastructure monitoring:

    Monitor the operating systems 

    Monitoring the operating systems of all your servers provides a snapshot of resource availability.  Alerts for low free memory, high disk queue, or other resources provide the ability to proactively address bottlenecks before they affect application performance.  Grouping metrics together across multiple backend servers based on a common frontend application also provides a way to track application performance issues caused by bottlenecks that may be distributed across several servers.

    Data that has been collected for real time performance alerts can also be archived and used for capacity planning.  Capacity planning is typically used to extrapolate when additional resources will be needed based on long term usage trends.  However archived data can also be used to identify underutilized servers – this can be especially important in allocating resources to virtual machines (VMs) running on Hyper-V or VMware.

    Monitor connectivity

    The most basic tool for monitoring connectivity is a ping.  A ping failure can provide a quick alert for an offline device or network congestion severe enough to hang connections between devices.  However pings are prone to intermittent failure due to transient network issues resulting in false alerts.  To counter the problem of false alerts make sure you only trigger email or page warnings if there are multiple consecutive failures.

    A ping is useful but it does have its limitations – it will tell you if a server is available on a network but it will not determine if a service is available on the server.  For example, a SQL server will respond to a ping even if the database is not running.  To monitor your ability to connect to resources across servers check that they are listening on their configured port and ideally that they are responding to requests.  For a SQL server, run a test SQL query transaction and check that the expected results are returned.  For a DNS server, check that a name is resolved correctly.  Map out the required connections between your servers and set up test transactions to verify that those resources are accessible.

    Monitor system logs

    System logs chronicle the activity on servers and network devices and can record everything from a successful logon to a bad block on a physical drive.  Log activity ranges in severity from low level Information or Audit Success messages to higher severity Warning, Error, Critical, and Audit Failure messages.  Unless you have a specific Informational or Audit Success event that needs to be monitored, avoid the lower severity events that comprise the bulk of log data and focus on the less frequent but more useful messages at Warning or higher severity.

    Collecting log data can provide both an early warning for a problem and also provide a forensic analysis of events that happened on a server or network device before a failure.  Unix and Linux systems and some network devices use Syslog to manage their logs.  The Syslog daemon can be configured to forward specified severity records to remote listeners.  The Windows equivalent to Syslog is the Windows Event Log which can be collected remotely through WMI.


     

    These 3 simple monitoring steps will provide an overview of your infrastructure’s health and provide tools that will map application performance issues to underlying resource constraints.  The next step after monitoring your infrastructure is application monitoring.