{"id":1125,"date":"2014-08-31T12:29:34","date_gmt":"2014-08-31T12:29:34","guid":{"rendered":"http:\/\/ixyzero.com\/blog\/?p=1125"},"modified":"2014-08-31T12:29:34","modified_gmt":"2014-08-31T12:29:34","slug":"10%e4%b8%aa%e5%ba%94%e8%af%a5%e7%9b%91%e6%8e%a7%e7%9a%84%e6%98%93%e5%bf%bd%e7%95%a5%e6%8c%87%e6%a0%87","status":"publish","type":"post","link":"https:\/\/ixyzero.com\/blog\/archives\/1125.html","title":{"rendered":"10\u4e2a\u5e94\u8be5\u76d1\u63a7\u7684\u6307\u6807[bak]"},"content":{"rendered":"<p>\u4eca\u5929\u5728\u901bV2EX\uff08\u7f6a\u6076\u4e86\uff0c\u4e3a\u4ec0\u4e48\u53c8\u5728\u7f51\u4e0a\u6f2b\u65e0\u76ee\u7684\u7684\u95f2\u901b\uff1f\u6562\u4e0d\u6562\u6709\u4e2a\u76ee\u6807\/\u8ba4\u771f\u7684\u770b\u770b\u4e66\u63d0\u5347\u4e00\u4e0b\u4e13\u4e1a\u6c34\u5e73\uff0c\u800c\u4e0d\u662f\u5728\u7f51\u4e0a\u95f2\u901b\uff1f\uff09\u7684\u65f6\u5019\u770b\u5230 Livid \u5728DevOps\u8282\u70b9\u4e0b\u5206\u4eab\u7684\u4e00\u4e2a\u4e3b\u9898\u201c<a href=\"http:\/\/www.v2ex.com\/t\/130938\" target=\"_blank\">10 \u4e2a\u5e94\u8be5\u76d1\u63a7\u7684\u6307\u6807\uff0c\u6765\u81ea Bitly \u8fd0\u7ef4\u56e2\u961f\u7684\u5206\u4eab<\/a>\u201d\uff0c\u7136\u540e\u53bb\u82f1\u6587\u539f\u6587\u770b\u4e86\u770b\uff0c\u51c6\u5907\u5728\u665a\u4e0a\u7a7a\u95f2\u7684\u65f6\u5019\u7ed9\u7ffb\u8bd1\u4e86\uff08\u5728\u6ca1\u627e\u5230\u59b9\u5b50\u5403\u996d\/\u804a\u5929\u7684\u60c5\u51b5\u4e0b\uff09\uff0c\u6682\u65f6\u5148\u653e\u4e0a<a title=\"10 Things We Forgot to Monitor\" href=\"http:\/\/word.bitly.com\/post\/74839060954\/ten-things-to-monitor\" target=\"_blank\">\u82f1\u6587\u539f\u6587<\/a>\uff1a<\/p>\n<h1 style=\"color: #555555;\"><a style=\"font-weight: inherit; font-style: inherit; color: #555555;\" href=\"http:\/\/word.bitly.com\/post\/74839060954\/ten-things-to-monitor\" target=\"_blank\">10 Things We Forgot to Monitor<\/a><\/h1>\n<p style=\"color: #555555;\">There is always a set of standard metrics that are universally monitored (Disk Usage, Memory Usage, Load, Pings, etc). Beyond that, there are a lot of lessons that we\u2019ve learned from operating our production systems that have helped shape the breadth of monitoring that we perform at bitly.<\/p>\n<p style=\"color: #555555;\">One of my favorite all-time tweets is from\u00a0<a style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"http:\/\/bit.ly\/ZyuDJ0\" target=\"_blank\">@DevOps_Borat<\/a><\/p>\n<blockquote style=\"color: #919191;\">\n<p style=\"font-weight: inherit; font-style: inherit;\">&#8220;Law of Murphy for devops: if thing can able go wrong, is mean is already wrong but you not have Nagios alert of it yet.&#8221;<\/p>\n<\/blockquote>\n<p style=\"color: #555555;\">What follows is a small list of things we monitor at bitly that have grown out of those (sometimes painful!) experiences, and where possible little snippets of the stories behind those instances.<\/p>\n<h2 style=\"color: #555555;\">1 &#8211; Fork Rate<\/h2>\n<p style=\"color: #555555;\">We once had a problem where IPv6 was intentionally disabled on a box via\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">options ipv6 disable=1<\/code>\u00a0and\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">alias ipv6 off<\/code>\u00a0in\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">\/etc\/modprobe.conf<\/code>. This caused a large issue for us: each time a new curl object was created,\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">modprobe<\/code>\u00a0would spawn, checking\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">net-pf-10<\/code>\u00a0to evaluate IPv6 status. This fork bombed the box, and we eventually tracked it down by noticing that the process counter in\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">\/proc\/stat<\/code>\u00a0was increasing by several hundred a second. Normally you would only expect a fork rate of 1-10\/sec on a production box with steady traffic.<\/p>\n<p style=\"color: #555555;\"><a class=\"gist\" style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"https:\/\/gist.github.com\/jehiah\/8511258\" target=\"_blank\">check_fork_rate.sh<\/a><\/p>\n<h2 style=\"color: #555555;\">2 &#8211; flow control packets<\/h2>\n<p style=\"color: #555555;\"><a style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"http:\/\/word.bitly.com\/post\/67486390974\/networking-traffic-control\" target=\"_blank\">TL;DR<\/a>; If your network configuration honors flow control packets and isn\u2019t configured to disable them, they can temporarily cause dropped traffic. (If this doesn\u2019t sound like an outage, you need your head checked.)<\/p>\n<pre style=\"color: #666666;\"><code style=\"font-weight: inherit; font-style: inherit;\">$ \/usr\/sbin\/ethtool -S eth0 | grep flow_control\nrx_flow_control_xon: 0\nrx_flow_control_xoff: 0\ntx_flow_control_xon: 0\ntx_flow_control_xoff: 0\n<\/code><\/pre>\n<p style=\"color: #555555;\">Note: Read\u00a0<a style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"http:\/\/monolight.cc\/2011\/08\/flow-control-flaw-in-broadcom-bcm5709-nics-and-bcm56xxx-switches\/\" target=\"_blank\">this<\/a>\u00a0to understand how these flow control frames can cascade to switch-wide loss of connectivity if you use certain Broadcom NIC\u2019s. You should also trend these metrics on your switch gear. While at it, watch your dropped frames.<\/p>\n<h2 style=\"color: #555555;\">3 &#8211; Swap In\/Out Rate<\/h2>\n<p style=\"color: #555555;\">It\u2019s common to check for swap usage above a threshold, but even if you have a small quantity of memory swapped, it\u2019s actually the rate it\u2019s swapped in\/out that can impact performance, not the quantity. This is a much more direct check for that state.<\/p>\n<p style=\"color: #555555;\"><a class=\"gist\" style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"https:\/\/gist.github.com\/jehiah\/8511306\" target=\"_blank\">check_swap_paging_rate.sh<\/a><\/p>\n<h2 style=\"color: #555555;\">4 &#8211; Server Boot Notification<\/h2>\n<p style=\"color: #555555;\">Unexpected reboots are part of life. Do you know when they happen on your hosts? Most people don\u2019t. We use a simple init script that triggers an ops email on system boot. This is valuable to communicate provisioning of new servers, and helps capture state change even if services handle the failure gracefully without alerting.<\/p>\n<p style=\"color: #555555;\"><a class=\"gist\" style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"https:\/\/gist.github.com\/jehiah\/8511374\" target=\"_blank\">notify.sh<\/a><\/p>\n<h2 style=\"color: #555555;\">5 &#8211; NTP Clock Offset<\/h2>\n<p style=\"color: #555555;\">If not monitored, yes, one of your servers is probably off. If you\u2019ve never thought about clock skew you might not even be running\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">ntpd<\/code>\u00a0on your servers. Generally there are 3 things to check for. 1) That\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">ntpd<\/code>\u00a0is running, 2) Clock skew inside your datacenter, 3) Clock skew from your master time servers to an external source.<\/p>\n<p style=\"color: #555555;\">We use\u00a0<a style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"http:\/\/www.nagios-plugins.org\/doc\/man\/check_ntp_time.html\" target=\"_blank\">check_ntp_time<\/a>\u00a0for this check<\/p>\n<h2 style=\"color: #555555;\">6 &#8211; DNS Resolutions<\/h2>\n<p style=\"color: #555555;\">Internal DNS &#8211; It\u2019s a hidden part of your infrastructure that you rely on more than you realize. The things to check for are 1) Local resolutions from each server, 2) If you have local DNS servers in your datacenter, you want to check resolution, and quantity of queries, 3) Check availability of each upstream DNS resolver you use.<\/p>\n<p style=\"color: #555555;\">External DNS &#8211; It\u2019s good to verify your external domains resolve correctly against each of your published external nameservers. At bitly we also rely on several CC TLD\u2019s and we monitor those authoritative servers directly as well (yes, it\u2019s happened that all authoritative nameservers for a TLD have been offline).<\/p>\n<h2 style=\"color: #555555;\">7 &#8211; SSL Expiration<\/h2>\n<p style=\"color: #555555;\">It\u2019s the thing everyone forgets about because it happens so infrequently. The fix is easy, just check it and get alerted with enough timeframe to renew your SSL certificates.<\/p>\n<pre style=\"color: #666666;\"><code style=\"font-weight: inherit; font-style: inherit;\">define command{\n    command_name    check_ssl_expire\n    command_line    $USER1$\/check_http --ssl -C 14 -H $ARG1$\n}\ndefine service{\n    host_name               virtual\n    service_description     bitly_com_ssl_expiration\n    use                     generic-service\n    check_command           check_ssl_expire!bitly.com\n    contact_groups          email_only\n    normal_check_interval   720\n    retry_check_interval    10\n    notification_interval   720\n}\n<\/code><\/pre>\n<h2 style=\"color: #555555;\">8 &#8211; DELL OpenManage Server Administrator (OMSA)<\/h2>\n<p style=\"color: #555555;\">We run bitly split across two data centers, one is a managed environment with DELL hardware, and the second is Amazon EC2. For our DELL hardware it\u2019s important for us to monitor the outputs from OMSA. This alerts us to RAID status, failed disks (predictive or hard failures), RAM Issues, Power Supply states and more.<\/p>\n<h2 style=\"color: #555555;\">9 &#8211; Connection Limits<\/h2>\n<p style=\"color: #555555;\">You probably run things like memcached and mysql with connection limits, but do you monitor how close you are to those limits as you scale out application tiers?<\/p>\n<p style=\"color: #555555;\">Related to this is addressing the issue of processes running into file descriptor limits. We make a regular practice of running services with\u00a0<code style=\"font-weight: inherit; font-style: inherit; color: #666666;\">ulimit -n 65535<\/code>\u00a0in our run scripts to minimize this. We also set Nginx\u00a0<a style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"http:\/\/wiki.nginx.org\/CoreModule#worker_rlimit_nofile\" target=\"_blank\">worker_rlimit_nofile<\/a>.<\/p>\n<h2 style=\"color: #555555;\">10 &#8211; Load Balancer Status.<\/h2>\n<p style=\"color: #555555;\">We configure our Load Balancers with a health check which we can easily force to fail in order to have any given server removed from rotation.We\u2019ve found it important to have visibility into the health check state, so we monitor and alert based on the same health check. (If you use EC2 Load Balancers you can monitor the ELB state from Amazon API\u2019s).<\/p>\n<h2 style=\"color: #555555;\">Various Other things to watch<\/h2>\n<p style=\"color: #555555;\">New entries written to Nginx Error Logs, service restarts (assuming you have something in place to auto-restart them on failure), numa stats, new process core dumps (great if you run any C code).<\/p>\n<h2 style=\"color: #555555;\">EOL<\/h2>\n<p style=\"color: #555555;\">This scratches the surface of how we keep bitly stable, but if that\u2019s an itch you like scratching,\u00a0<a style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"http:\/\/bitly.com\/jobs\" target=\"_blank\">we\u2019re hiring.<\/a><\/p>\n<div class=\"postmeta\" style=\"color: #777777;\">by\u00a0<a style=\"font-weight: inherit; font-style: inherit; color: #61b3de;\" href=\"http:\/\/twitter.com\/jehiah\" target=\"_blank\">jehiah<\/a><\/div>\n<div class=\"postmeta\" style=\"color: #777777;\">#\u00a028 January 2014<\/div>\n<hr \/>\n<p>\u5e0c\u671b\u8bd1\u6587\u80fd\u65e9\u70b9\u51fa\u6765(\u4e0d\u8bba\u662f\u81ea\u5df1\u7ffb\u8bd1\u7684\u8fd8\u662f\u522b\u4eba\u7ed9\u7ffb\u8bd1\u7684)(\u2606_\u2606)\/~~<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u4eca\u5929\u5728\u901bV2EX\uff08\u7f6a\u6076\u4e86\uff0c\u4e3a\u4ec0\u4e48\u53c8\u5728\u7f51\u4e0a\u6f2b\u65e0\u76ee\u7684\u7684\u95f2\u901b\uff1f\u6562\u4e0d\u6562\u6709\u4e2a\u76ee\u6807\/\u8ba4\u771f\u7684\u770b\u770b\u4e66\u63d0\u5347\u4e00\u4e0b\u4e13\u4e1a\u6c34\u5e73\uff0c\u800c\u4e0d\u662f\u5728 [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[23,11,12],"tags":[30,321],"class_list":["post-1125","post","type-post","status-publish","format-standard","hentry","category-knowledgebase-2","category-linux","category-tools","tag-linux","tag-monitor"],"views":16724,"_links":{"self":[{"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/posts\/1125","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/comments?post=1125"}],"version-history":[{"count":0,"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/posts\/1125\/revisions"}],"wp:attachment":[{"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/media?parent=1125"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/categories?post=1125"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ixyzero.com\/blog\/wp-json\/wp\/v2\/tags?post=1125"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}