ChinaNetCloud Blog

Welcome to the ChinaNetCloud Blog, partly about the things we do, but more importantly about Operations, Clouds, technology, Linux, customers, sales, service, and other random topics we find interesting and want to share with you.
 

Content

 

Server Layers
Tech Choices - Best Configurations for the new Dell R420
 
 

分享按钮



 

Tech Choices - Why we use Nginx instead of Apache

Most of our customers come to us using Apache as their web server, especially in front of a PHP-based system using mod-PHP.  We always recommend they switch to Nginx and FPM, for scaling and performance reasons. 
 
Apache is a great web server, very powerful, modular, and now the granddaddy of web serving.  Aside from bind and a few other tools, Apache has been the most-deployed open source system in the world, until recently running most of the world's websites.
 
But, Apache is not perfect, and is no longer well-suited to large-scale systems.  Why ?  Because of its process model, which is simple and flexible, but does not scale, especially when coupled to application code like PHP which can be very memory-intenstive.
 
A typical web app server has two parts.  The client connection part does the HTTP link to the browser and maintains long-lived TCP/IP connections, often for 1-2 minutes.  And in big systems, may need to carry thousands or tens of thousands of simultaneous connections.
 
This directly collides with Apache limits of about 500 processes and thus HTTP connections, and this is made worse by modern browsers opening up to six connections per host (up from two a few years ago).  So at more than 100 or so simultaneous users, Apache is full.
 
The second part is the application processing part, which runs the code.  This is RAM and CPU intensive on most systems and thus must be limited in the number of processes, usually to about 10 per 1GB of RAM and 2 per CPU core.  Thus a 4GB system with 16 cores should only have about 32 application processes.
 
But the core problem is that Apache directly links the front-end client communications component to the backend application processing component.  As mentioned, the front-end part tends to be very long-lived, often several minutes, but the backend part tends to eat RAM and CPU.  There is no direct way to balance these on a large system, so they must be split.
 
There are two main ways to do this.  The first and easiest for an existing system is to put a Load Balancer or Nginx in front of Apache to handle the client connection part.  Load Balancers like HAProxy or Nginx can handle tens of thousands of simultaneous connections easily, and then let Apache really only function as the backend application part, with 32 or whatever processes.
 
The second and more common method is to replace Apache with Nginx and use PHP-FPM as the application server.  Like above, this splits the front-end client communications part from the backend application part.  Nginx handles the HTTP, while FPM runs the back end application, say with 32 processes.
 
There are still issues with these approaches, mostly in how load and off-server RPC calls are done (including MySQL queries).  Both of these issues are topics for another blog.
 
And using the Nginx only approach can be a problem for applications that heavily use Apache features, especially rewrite rules, .htaccess, or optional modules like mod_security.  For these situations, putting Nginx in front of Apache is often the best method.
 
Generally, all new systems should be setup using Nginx and PHP-FPM.  This provides high-performance scaling and is the best choice for balancing users with RAM and CPU resources.  Existing systems can use Nginx or HAProxy in front to achieve the same effect, better serving users in today's modern Internet environment.
 

| Posted on August 29, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Server Layers

Broadly, we believe in avoiding problems, and the best way to do that are the following areas, where we design in the best possible starting point, avoiding most potential issues and problems from the beginning so they never happen.  No system is perfect and a rapidly-growing, ever-changing Internet system is always a challenge, but our strategy lets us all focus on the core and most important things that bother each specific customer and situation.
 
1. Hardware
 
Need to start with the best HW config, selection, and configuration - which takes experience and knowledge, as we see customer make mistakes all the time - just last week we saw a customer with a big investment but bad RAID cards, making the system too slow.  Easily fixed for 1500 RMB.  Same for CPU, RAM, disks, and other components.  We help customers make the best choices.
 
2. Architecture
 
Good architecture includes both software and operational structure, and we help with the operations part, making sure the customer has the right mix of Load Balancing, High-Availability, Scale, service separation, DB replication and scalability, video transcoding, storage, and flexibility for future changes and growth.  We also often help with question of HTML and data Caching, search, RPC, NoSQL, Queue, and other difficult problems, at every scale and size.  We have seen almost every possible Architecture and are well aware of works and doesn't, and why.
 
3. Configuration
 
Each part of a system should be optimally configured, including the Hardware, Linux, Firewalls and Switches, Web Servers, PHP/Java, MySQL, and more.  We have world-class, often 5-10 page configurations that optimize Reliability, Performance, Scale, Security, and overall cost, based on the customers needs, size, and situation.
 
4. Deep Monitoring
 
Running systems are complex and face dynamic operating environments, connectivity, applications, and other challenges.  By deeply monitoring all parts of the system, we can find and focus on the key parts of any given problem.  This is especially important on high-performance large-scale systems in various industries that often have unique and interesting problems.  We have solved many of these over the years by carefully analyzing our ever-expanding monitoring data and details, including custom components for critical parts of customers' infrastructure.
 
5. Tools
 
Every system has basic tools, but we are experts in all of the standard and most advanced Linux and service configuration, monitoring, and troubleshooting tools.  In many cases, we've added our own more powerful tools, based on our deep understanding of how these systems work, and particularly the unique needs of, and problems faced by large-scale dynamic systems.
 
6. Knowledge
 
All of these above items are enhanced by our deep knowledge and experience in all these areas, including how each part of the system works on a fundamental levels, such as the hardware, Linux kernel and memory, networking at all layers, web, PHP, and application servers, and all aspects of database internals and operations.  These insights, combined with the above tools and monitoring lets us both develop the best configurations and solve difficult problems in real-time for our customers.
 
7. Seen it
 
As a service company working on hundreds of systems in every Internet industry, we've seen almost everything that can occur.  We do continually improve and learn new things, too, but our experience in all standard architectures, systems, and problems is very strong.  This lets us both recommends tried-and-true solutions and quickly solve important problems by leveraging lessons learned by and with other customers.
 
8. Processes
 
All of the above is tied together with our process and procedures, which allows our team at all levels and roles to build scalable processes that deliver world-class services to our wide range of customers.  These processes insure consistent service, quality, and communications across our teams, 24 hours per day, and include our continuous tracking, management, QA, and follow up that helps insure the best service levels.
 

| Posted on September 26, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |

 


 

Tech Choices - Why we use Centos instead of Debian / Ubuntu

 
We run some of the world's largest Internet operations, so we are interested in reliability and stability - this is our Job #1.  For that we are a Linux shop; we only use Linux to power our customers' systems.  But which Linux distribution do we use ?  We use CentOS.  Why ?
 
As a large system operator, we need reliability and predictability over a large variety of systems over many years.  We need strong support by most of the world's software vendors and open source project.  We need documentation, tools, and global resources for the most commonly used systems.
 
For all that, the RedHat / CentOS family of distributions are the way to go.  They provide all of the above with relatively few problems and are stable over many years, allowing us to provide world-class support on thousands of systems running every possible configuration, service, and application.
 
RedHat's Enterprise Linux (RHEL) is the gold standard of enterprise distributions, updated every five years or so, with an overriding focus on stability, predictability, and security.  Once a new major version is released, such as 5.x or the recent 6.x, all versions and code are frozen and only security or major bugs are fixed, typically by back-porting from newer versions of things.
 
CentOS is of course the open source version of the corresponding RHEL distribution, normally released soon after RedHat.  We use CentOS due to the very high cost of the standard supported RHEL version, about $800 per server, which is cost-prohibitive for many of our customers with dozens or hundreds of servers each.
 
There are two potential problems with the RHEL/CentOS systems.  The first is that once they freeze versions, they never change except for the security/bug fixes.  This is good for stability, but bad for many services like MySQL or PHP that are under heavy development and change a great deal over the five year distribution lifetime.  For example, RHEL/CentOS still uses MySQL 5.0 as its standard version, while 5.1 and 5.5 are now current versions.
 
Fortunately, this is easily fixed by splitting the repositories used by yum, such that core software such as the real RHEL/CentOS components including the kernel and all utilities still come from the distribution, but add-on software such as nginx, apache, php, Java, and MySQL come from newer sources such as Fedora or direct from vendors such as MySQL.  In our case, we have our own semi-mirror repos to handle all of this automatically.
 
The second potential problem is that CentOS releases can be delayed and lag behind RedHat, including for critical patches or fixes.  This was most obvious during the RHEL 6.x release cycle, but in our experience is not something to worry about and has never been an issue for us.
 
Many people ask us why not use Debian-based systems such as Debian or Ubuntu server.  We do support these if there is no other choice, but in our experience, they are not nearly as stable or trouble-free as RHEL/CentOS. 
 
We think this is in part due to their rapid development and much less testing / maturity of all the different versions and combinations.  And that despite their popularity, major vendors and projects are still primarily deployed on RHEL/CentOS systems where they can sell support to enterprise customers (as is certainly true for Oracle and MySQL).  Beyond that, we've had lots of kernel and stability issues on Debian-based systems, especially in our clouds.
 
To us, the only real reason to use Debian/Ubuntu is if they provide special functionality that is necessary for a system, especially if a newer kernel is needed for some driver or IO subsystem reasons such as ext4 or journal changes in recent kernels; all of which are now in RHEL/CentOS 6.x.
 
Of course, most developers who use Linux are on Ubuntu desktops and it's understandable they'd like to use the same system for production, and there are some cool tools to use, but overall, we still feel that RHEL/CentOS is a much better deployment platform and having migrated dozens of customers to production CentOS without problems, this is still the way to go.
 
Overall, you'll be happier with a well-managed CentOS system.  We have thousands of servers on this platform and have about one server crash PER YEAR; it's so rare we hardly think about it, so pick a good platform and be happy, reliable, and fast.
 

| Posted on September 25, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Tech Choices - Best Configurations for the new Dell R420

 
Dell is our favorite server vendor, and they have just released their newest generation of high-performance servers, the Rx20 line including the R420, R520, R620, R720, and more.  This blog covers optimal configurations for our favorite server of choice, the R420.
 
These servers have dozens of options, each with many choices, so you can literally build thousands of different configurations.  This makes it very difficult for a casual buyer to make the best choices, and we often see customers who have either bought too little or much of a server.
 
So, let's walk though how we think about server hardware configurations.  The best way to see this for a Dell Server is to use the website configuration page which gives you most (but not all) of the valid configurations, and helps you avoid mistakes.
 
Chassis - The choices are 2.5" or 3.5" drives.  Generally we prefer 3.5" as they have better performance and higher capacity, though you only get four of them.  Always choose hot swap if you can.  So, take 4x3.5" Hot Swap unless you really need 8 disks.
 
CPU - These are always changing and in this version there is massive variation in performance and price, up to 5X from top to bottom.  Our general rule is to choose the lowest of the mid-level CPUs, with hyperthreading and in this version, 6 cores.  So this means the E5-2420 or 2440 with 6 cores and good price/performance.  Get two of them.
 
RAM - These servers greatly expand on the R410 system which only had 8 slots of 8GB each.  Now we get 12 slots for up to 16GB DIMMS.  Remember this is a NUMA system and RAM goes in with pairs if you have two CPUs, so we often put in 8x8GB for 64GB, leaving 4 slots free to add more 8GB or 16GB RAM if needed.  For a smaller system, use either 4x8GB or 2x16GB for a nice small 32GB system with great expandability.
 
RAID - This is easy.  For a web or other server with low IO, use the H310 card with no cache. For any DB, Xen/Cloud, Search, or other high I/O system, use the H710 with flash-backed cache.
 
Disks - This is more difficult and highly application-dependent.  Generally, for web and other standard applications, use two near-line SAS (which are really SATA with a better interface) disks, starting with 2x1TB disks (only $50 more than 500GB).  For high-performance systems get 15K SAS disks, starting with 2x300GB, up to 600GB if you need them.  The best flexible layout is 2x1TB Near-Line SATA and 2x300GB SAS 15K - we use this for our clouds and bigger systems.
 
NIC - The built-in NIC has two ports and is fine for most uses.  If you really need an extra port for iSCSI or other use, then you can add a card for 2-4 more.
 
DVD - We still like to buy a DVD reader for disks, even though you can boot from USB.  WE find that data centers and other IT staff still like to burn DVDs from ISOs for install, troubleshooting, etc. and this is an inexpensive way to save a server when there are issues.
 
Internal SD Storage - We are starting to like this, as a place to keep tools like boot diagnostics, etc.  It's pretty cheap, so 1-2GB seems useful.
 
DRAC - We always choose the DRAC7 Enterprise which has true remote console functionality (not just IPMI).  This lets us fully manage and troubleshoot boot and other issues remotely, as most of our servers are spread all over the world.
 
Power Supply - It used to be that power supplies often failed, so all servers have the option to have dual redundant power, but we feel this is not necessary or worth.  We've actually never had a PS fail in recent years, and the extra cost and cabling complexity is usually not worth it.  Get one PS and forget about this problem.
 
Overall, that's how we think about buying servers, though of course we can and do customize based on actual situations, especially for RAM and disk requirements.
 

| Posted on August 20, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Load Balancing - HAProxy vs. Nginx

 
Another more important measure of scale is simultaneous connections it can carry.  This is critical to large sites that need to carry lots of long-distance keep-alive connections.  We have HAProxy systems that carry up to 250,000 simultaneous connections, far more than Nginx or any other front end software system.
 
HAProxy's configuration system is very powerful, allowing many front-end listening 'servers' and then many back-end server pools, that can all be mixed and matched using various rules.  This is how large-scale sites with many listening IP/port/host combinations can flexibly move their loads around their backend servers as code change, migrates, upgrades, etc.
 
HAProxy can also rewrite URLs, much like Nginx.  It's not quite so powerful, but good enough for most uses, and can include various useful redirects to remove these little annoying settings from each backend.
 
The backend monitoring and control is very powerful, with complex configurations of what and how to monitor, how to determine availability, how fast to ramp up load, and so on.  It even has reserve and backup pools for use when primary pools are all off-line, which is very useful on big systems for failover, maintenance, etc.  Dynamic pool changes, removing servers, etc. can also be done via sockets (though we had to write tools to make it easy).
 
The best part of all this is the monitoring of HAProx itself, which is provided through a very useful GUI via HTTP, and also via a socket which can query and control the system.  Each front and backend pool, and each server, has about 15 different data values and statuses, including request rates, connections, errors of various types, time at this status, max connections and rates, and more.
 
HAProxy also has L4 LB functions, most useful for balancing MySQL Read Slaves, but can work on any TCP/IP connection, including XMPP, node.js, games, or other socket-based systems.
 
Finally, HAProxy has very powerful log facilities, both in terms or information and configuration, which can be fully customized.  The detailed status is most useful, where it tells you how each connection ended in great detail, which is very useful for troubleshooting.
 
So, what's wrong with HAProxy ?  Two things:  SSL and Mulit-Core.
 
HAProxy does not do SSL.  So if you have to terminate encrypted traffic, you need a front-end to do this; we usually use Nginx for this.  This means that a typical LB servers has Nginx listening on port 443 and then sending decrypted traffic to port 80 or 81 to HAProxy for real load balancing.  A bit complex, but works great and is very scalable.
 
Additionally, HAProxy, like Nginx, is a single threaded event-driven system, which makes it fast and scalable.  But this limits its final scale to a single core.  Multi-core is supported and allows 2, 3, 4X and more performance, but the monitoring is per-process, which makes things more complex.  This is rarely a problem, though as it's only needed at very high performance levels of tens of thousands of requests/second on hundreds of thousands of simultaneous connections.
 
Even with these issues, though, HAProxy is the overwhelming best Load Balancer available and should be used on nearly all new systems.
 

| Posted on August 17, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Sharing Files & Assets between Web Servers

 
Most of our customers have multiple web and application servers running their systems.  They do this for several reasons, including high-availability, scalability, and to split various functions and services.
 
The problem this creates is how to share a single copy of various assets such as images, code, media files, and other things between the various web servers ?  There are several ways, but some are more useful or safer than others, and each has advantages in different situations or data sizes.
 
The first rule for these systems is never place application code on a shared resource, such as NFS.  This creates performance and perhaps severe reliability problems on many levels, so don't do it.  You can rsync or maybe copy from NFS, but always execute a local copy.  Always.
 
But site images, user pictures, music, static content, etc. are all good to share if possible.
 
The first thing to think about is where the files come from and who updates them.  In most systems, all web servers need to read and write the files, such as user uploaded images.  But in other systems and for things like static web images, one server may be the writer (such as web1) and the others only sync from it, making things easier.
 
The simplest systems just need to share static files that can be written on web1 and read by web2, web3, etc.  The easiest way to do this is just rsync run by cron.  Very simple, very easy, and very fast for any sized system, even with thousands of files.  There are important options for rsync for this, but it's a good and reliable way as long as basic rules are followed.
 
There are also more sophisticated rsync processes, usually used for fast-changing files, such as user uploads.  A user uploads their image and wants to see it immediately, which is hard to do with rsync, but there are special tools for this, such as ionotify that can sync files almost instantly, but are complex to use.
 
Another common way is to use NFS to share files.  We generally don't like this because NFS can be hard to manage, has security issues, and most importantly creates a single point of failure in your system.  If the NFS server has a problem, all web servers are down.  Further, NFS has some kernel issues that can crash systems forcing reboots in some cases.   NFS is useful and powerful, but we try to avoid it unless it's really necessary - we just don't like web servers loading from NFS in real time.
 
Larger and more complex systems can use clustered file systems like GlusterFS which we've started to use.  These allow an NFS-like read/write of files, using user-space tools or FUSE file systems that are easier than NFS to use, but can be complex to setup.  They can scale and also remove the single point of failure in many cases, since the data is spread between servers so any crash doesn't affect other servers. 
 
Finally, many sites are no longer using any system at all; instead, they are pushing all their assets to a common shared storage system like Amazon S3 or AliYun OSS.  This removes the need for large disk storage or any sync process.  They just upload all images, etc. to the Cloud and serve from there - simple, easy, and usually economical.
 
There are also other ways, but these are the most common that we see and recommend to solve your file sharing needs.
 

| Posted on August 8, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Backups should be off-site

 
We have lots of customers who have good backups, but they are all in the same data center, and often on the same server as the main data.  This is dangerous and not recommended because many bad things can happen to data centers, and such a loss would usually mean the end of the business.
 
Modern data centers are great places, but bad things happen, including and especially fire, but also earthquakes, water leaks, and other problems, especially in older urban buildings.  Also, theft sometimes happens, where a server might walk out the door with your data.
 
Also, if a hacker gets on your server, he might erase or damage your data, and then delete your backups.  This can mean the end of your company.
 
So while on-site/on-server data is good for fast recovery in the event of accidental deletion or DB corruption, it cannot protect you from real and possibly fatal data loss.  So you need to get a copy of your data off-site, ideally once per week.  How to do that ?
 
Usually, for small systems, it's easy and obvious, just compress the data and push it somewhere else via sftp, rsync, etc.  That 'somewhere' can be another data center with good bandwidth, of your office if you have a fast link.  Increasingly popular are on-line storage systems like S3 (AliYun OSS in China), which provide unlimited storage (at a price), flexibility, and ease of use.
 
However, often backups are much too large to push off-site every day or week.  For these situations, there are different solutions we help customers with, such as splitting up data and sending some each day of the week, such as images, videos, user data, history data, etc.
 
Other methods include doing incremental backups with monthly or weekly baselines, or using various technologies to synchronize or replicate data of different types.  Each system, from images to databases to log files usually require a different set of tools or thinking.  Sometimes we'll just take the key user and financial data and skip the rest, or find an off-line method such as physical tape or disk drive, which may be the only way with many TB of data.
 
It's best if the server being backed up can push its data to a remote location, but can't delete or otherwise change it after writing.  This prevents hackers from deleting backups, and also avoids any problem with backup system bugs or other issues from damaging data.
 
One popular place to store data is in your development system, since you should use close-to-production data for testing.  But you should scrub or delete personal info from production data before loading it in a dev/test system to protect data privacy and from data theft.  Thus, best practice is to push data to your office, then pull a copy, scrub it, and load in test systems.
 
Regardless, ANY DATA taken off-site should be encrypted.  We cannot emphasize this enough, and there have been too many data losses or thefts of private data in recent years to forget this key point.  Use any type of encryption you want, and make sure it's fast, given the large data size, but always encrypt.
 
Of course, encryption is useless if you forget the password/key or type it wrong, so always test your decryption periodically, such as monthly or quarterly.  This is easily done on the target location, for example decrypting / decompressing, and reading a tar file header - this can be very fast as it only needs the first few KB of the file, regardless of how large it is.
 
Finally, be sure to test your backups.  A backup is only useful if you can find it, decrypt it, and it works.  The only way to know is periodically, at least quarterly, run the whole recovery process on a test or virtual server.
 
With all of this, you can sleep at night knowing your data is safe and secure.
 

| Posted on August 1, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Security - MySQL Serious Password Security Bug, Many Systems at Risk

 
A very serious bug in the MySQL password system was recently discovered that allows anyone to easily access a DB server as root.  This is probably the most serious MySQL or DB bug in years, though fortunately only some systems are affected.
 
The standard MySQL.org official builds that ChinaNetCloud uses are NOT vulnerable and thus okay, but it seems that most Ubuntu, Debian, Fedora, and many other distribution builds are affected.  Red Hat and MySQL/Oracle official builds are okay.  The core bug actually occurs due to a compiler optimization issue used by some compilations (SSE by GCC).
 
To make this worse, many websites do not properly protect their databases.  We see many, many systems where the DB server has a public IP, with no real firewall, and poor user control - a recent Internet survey found 1.75 million of these.  If even 10% are vulnerable versions, then hundreds of thousands of DBs are suddenly hackable in a few seconds. 
 
Check your system today or call us for assistance, as we can tell you what versions and have test tools/code you can run to check.
 
Also, this shows why the ChinaNetCloud best-practices approach to security is the best way, using layers of security to protect against exactly this type of problem.  We use physical and host firewalls, we use private IPs if possible, we strictly limit DB user host sources, and limit user privileges as much as possible (for example we only enable the root user on the localhost unix port, so any network attack like this is useless).  This all helps insure that if there is some serious problem in one layer or system, the other protections prohibit or limit the damage.
 
We also never use source versions which are much harder to upgrade and fix; we believe the best practice is to use the official widely-used and well-known versions / builds so that bugs are found and fixed quickly.
 
We can help improve your security with free audits; if you are interested, contact us or call your Sales Manager today.
 

| Posted on July 26, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

User Password Protection

 
There have been several high-profile hacker attacks on password systems recently in the news, including Yihaodian in China and LinkedIn in the U.S., both of which resulted in the publishing of user/password pairs.  This is unnecessary and unfortunate as it reduces overall consumer confidence in the Internet in general and e-commerce in particular.  At ChinaNetCloud, we pride ourselves in ultra-strong security and in helping our customers continually improve their systems - we also have a free system security audit we can do for you to see how well you are doing.

Yihaodian was most foolish, it seems storing passwords in plain-text in their system.  This is a despicable practice that should never be used, ever, for anything.  The reasons are obvious as anyone with access to the DB, including hackers, programmers, and many others, can see, use, and sell the passwords.  Plus, many people re-use passwords in many systems, so if you know one password that user Bob uses on one system, you can try it on all others, too, including corporate, code, finance, health, and other sensitive systems.

LinkedIn was much stronger, using hashed passwords using MD5 or SHA-1 which is a very good and standard method for password protection, but in modern times not good enough.  The reason, known for decades, is that there are two key ways to leak passwords.  The first is that the same passwords will hash to the same value, so if my password's hash is "34AH8CD" and yours is the same, we know our passwords are the same and I can log into your account.  The second problem is that common hashes can be precomputed in something called a Rainbow Table, such as md5("a"), md5("b"), md5("test"), etc. for millions of common passwords - then all the hacker has to do it compare the results in the password DB with the Rainbow Table - this is common and is what was used partly for the LinkedIn release.

How to prevent this ?  In a word, salt.  Not like salt and pepper, but in adding extra random data to the password before hashing.  This way two identical passwords will have different hashes.  Unix has had this in /etc/password forever with 12 bit salt, but now 48-128 are common, often in combination with key-stretching using multi-round hashing.

All new and upgraded systems should use proper salting and hashing to create secure passwords, to protect their users, systems, and data.  And you should follow other good practices on non-password data such as mobile phone numbers, email, transaction histories (also all stolen from Yihaodian) by limiting server access, and purging all such data from DBs used for developing and testing.  Plus, always, always encrypt your backups before they leave the server.

References:
Paper on good password storage:  http://www.aspheute.com/english/20040105.asp

| Posted on July 25, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Tech Item - Backups should be off-site

 
Most of our new customers already have backups.  Most of them work, at least part of the time.  And the customers generally feel good about their backups (even if they've never tested them).  

But most backups are only stored locally, not off-site, which creates a huge risk for the data and thus the business.  All your important data should be put somewhere, or else a fire, flood, or business dispute may remove access to your data.  In China, various regulatory and government issues may also limit access to your systems (or all systems in a paricular IDC).  So you need copies of your data elsewhere.

This creates two questions - where and how to transfer and store the data ?  These are difficult questions that depend greatly on your situation and especially data size.  It also depends on where you are, since outside of China we often use Amazon S3 for storage, but this type of storage is not available yet in China (but hopefully by the end of 2012).

For small customers, we recommend simple transfer of the backup files to your office for storage and maybe dev/test.  For larger customers, we try to find an off-site location such as other servers, Amazon, etc.  We also offer backup services which move data off your servers and usually to other data centers around China or Asia, though this system is limited in size to a few GB.

Moving small amounts of data is easy, via sftp, rsync, and other simple methods, but larger data is a problem.  UP to tens of GB can be moved with these methods, more carefully, or with backup systems like Bacula.  Larger data sets pose special problems and often require hardware like tape systems, data sync systems like incremental rsync (good for video, images, etc. but not DB), and other methods.  Some commercial tools can also be useful in this area.  All require serious discussion and planning, which we do regularly for our customers.

Best practice is of course to do a test restore on a dev/test system every week or month, and then run some simple data integrity checks, which makes sure the whole process is working well.

Note that if you push production data to dev/test or office use, best practice is to scrub the data to remove sensitive information like passwords, email addresses, phone numbers, etc. so the data can't be stolen and sold by developers or others.  This is often done as part of a formal dev/test data load and scrub operation to both reduce the data volume and remove sensitive data before being used.

Also, a key part of off-site backups are security.  There are many stories of data getting lost or stolen while off-site, either on tapes/disks or just files on a remote file server (or laptop).  For this reason, we always encrypt any backup file before it leaves the server.  Be careful not to lose the encryption key, and to test a restore.

In summary, off-site backups are very important to your business, especially in terms of surviving a disaster of various types.  Any backups are better than none, of course, and best practice is off-site daily, encrypted, tested, for dev use, scrubbed.

| Posted on July 20, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Tech Issue - Using SSH keys vs. passwords

 
In the SSH world, almost everyone agrees that keys are better than passwords, are more secure, and are more modern.
 
I don't agree.  While keys can be better, they have serious risks that are not well-understood, and I argue are less safe than well-managed passwords.
 
The general argument is that keys are better because most people use weak passwords or share passwords between systems so any compromise then endangers many systems simultaneously.  And since keys can have passphrases, they can be made even safer since they include the best of both - the key and a password.
 
This is true, but not reality in most well-managed environment, and misunderstands where the true risks come from, and at what step of the process. 
 
For example, we use hard, random passwords for everything and never share any password between any two system, service, etc.  And importantly, this is done at system setup time using tools and procedures - a key issue is that we build security into the system at setup, the first time, and mostly automatically.  Since passwords are rarely changed, the system is secure long-term.
 
Keys are totally different.  Yes, a good system will generate a good key and not share it, though managing so many keys long-term is much, much harder than managing passwords.  They are hard to move (especially through ssh gateways), to store, to name and organize, to share, etc.  This leads to shortcuts and creates risk. 
 
To be secure, keys need passphrases, but rarely have them and even with them, busy engineers will often remove them to make their life easier, and at every use there is opportunity to remove that passphrase or copy the key to another location, reducing security forever.
 
More seriously, the key files are all that is needed to provide access and they can be stolen without the user's knowledge, usually from their machine where they sit unprotected and in clear text.  In theory users can put passphrases on keys, but they rarely do and here is where the risk lies - if I can break into your computer or ssh gateway, I get free access to all your system, without you knowing and in a way that would be impossible with passwords.  This is the core problem I have with keys - all your servers are only as secure as your weakest client / engineer machine, which is usually a person laptop or smart phone, home computer, etc. infested with all kinds of spyware, viruses, etc., all of which can read your keys ! 
 
Good passwords, by contrast, are kept in a tool like KeePass and are almost impossible to steal, especially in bulk.
 
The only way to solve this is to require passphrases, but there is no way to do this, and busy engineers will often remove the passphrase to make life easier.  In theory the SSH protocol could be enhanced to report the key type and if a passphrase was used, but since this is checked client-side, the server can't trust it; there is simply no way to enforce this, and thus most keys will remain unprotected and insecure.
 
Further, for large-scale systems, when we have 5,000 servers, which is easier to deal with, keys or passwords ?  Sure LDAP is a solution, but only for certain systems, but we can't always impose LDAP on customers, so we have to manage passwords or keys, and passwords are far easier.  There are many password systems of various types and uses, but few key systems - KeePass can handle key files or cut/paste, but not in any real or useful large-scale way.
 
In the end, it comes down to which is easier to manage correctly, keys with passphrases or complex random passwords.  For me, passwords win.  And if you use keys, please add a passphrase today.
 

| Posted on July 12, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Tech Items - Load balancing (HAProxy vs. Nginx)

 
All of our larger customers have multiple web servers for their front-end systems, and all of those have some type of load balancing.  While some use DNS or LVS, most use nginx as their load balancer (LB).  However, we think that HAProxy is a much better and more powerful LB than nginx and should be used for most systems that want to scale, are complex, or need good control.

Most people have heard of HAProxy, and some know it has the same architecture as nginx, being a single-threaded event-driven system that can scale to 100-200,000 simultaneous connections and 100,000 requests/second on big systems.  More importantly, HAProxy is very powerful and flexible, with a wide variety of front-end, back-end, and standby pools, very flexible re-write rules and checking, and more.  

It also has very powerful and flexible logging, including how every connection or request was started and ended, at what HTTP phase, and by who - this can really help troubleshooting.  In addition, the real-time API allows engineers to dynamically add/remove servers from the pools, which is needed for maintenance, testing, etc. (though we have built special tools to make this easier).

For us, one of the most useful parts of HAProxy is its very sophisticated monitoring, including a very nice GUI that we can access in a browser.  This lets us see the status and statistics of all pools and servers, including errors, connection, request rates, check info, and much more.  We can use this directly for real-time monitoring, and also pull the data via an API to feed our monitoring system.

On the other hand, nginx has almost none of these features and is very simplistic, especially in monitoring and control.  There is no way to know what servers are okay, there are no stats on connection rates or other info that makes the system useful for troubleshooting, monitoring, or control.  Nginx is simple and works, but is not well-suited for large or complex systems.

The one thing HAProxy cannot do well is SSL, which is not directly supported.  The easiest way around this is to use nginx to handle SSL connections on port 443 and then forward the un-encrypted connections to port 80 (or a different port if to be balanced separately).  This is a bit complex, but not too bad and works well, though some work is needed to get the client IP passed all the way through the system to the real application servers.

Overall, HAProxy is the best choice for large-scale load balancing of real systems, especially when they are often changing, have many pools, complex needs, and good monitoring with control.  Nginx is not a bad choice, but HAProxy is much better.

 

| Posted on June 6, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Tech Items - Linux NIC Interrupts Overloading Single CPUs


The Linux kernel has come a long way in terms of performance in the last few years and especially in the 2.6/3.x kernel line.  However, at very high IO rates, especially for the network, interrupt handling can become a problem.
We've seen this on high-performance systems saturating one or more 1Gbps NIC and also in VMs with lots of small packets, with a recent overload at about 10,000 packets per second.

The reasons are clear:  in the simplest modes, the kernel processes each packet via a hardware interrupt from the NIC.  But as the packet rate rises, these interrupts overload the SINGLE CPU that handles them.  This single CPU concept is very important and poorly-understood by sysadmins.  On a common
4-16 core system, an overloaded core is hard to see, since overall CPU utilization is 6-25% and the server looks normal.  But the system will run very poorly, dropping packets with no warning, no dmesg log items, and with nothing appearing to be wrong.

But if you look in top in multi-cpu mode (run top and press 1) at the %si item (System Interrupt) or in mpstat irq item (mpstat -P ALL 1) you can see this - on a very busy system it's clear that interrupts are high, and with advanced mpstat usage you can see which CPU and driver is the problem.

You need a newer version of mpstat that can run -I mode.  Then to see irq load, run this:

mpstat -I SUM -P ALL 1

Anything over a 5,000/second is a lot.  10-20,000/second is extremely high.

To find out what driver/item is creating the load, run mpstat -I CPU -P ALL 1

This output is hard to read but you need to trace the right column to see which IRQ is causing the load, such as 15, 19, 995, etc.  You can also specify just the CPU you want to make the display simpler, such as "mpstat -I CPU -P 3 1" for CPU #3 - note that top, htop, and mpstat may number CPUs differently (starting at 0 or 1; both top and mpstat use 0, 1, 2 but htop uses 1, 2, 3).

Once you know the IRQ number, then look at the interrupt table, via "cat /proc/interrupts" and find the number from mpstat's load - then you can see the driver using that IRQ.  That file will also show you the # of interrupts so you can see which is loading the system.

Okay, now it's probably the NIC card, what to do ?

First, make sure you are running irqbalance, which is a nice daemon that will automatically spread your IRQs across CPUs.  This is very important on a busy system, especially with two NIC cards, since by default CPU 0 will handle all interrupts, and can obviously get easily overloaded.  irqbalance spreads these around to lower the load.  For maximum performance, you can manually balance these to spread across sockets and hyperthread-shared cores, but this is usually not worth the trouble.

But even after spreading the IRQ around, a single NIC can overload a single CPU core.  So then what ?  This depends on your NIC and driver, but generally there are two helpful choices.

The first it multiple NIC queues, such as some Intel NICs have.  If they have four queues, these can have different interrupts and thus be handled by four CPU cores, spreading the load.  Usually the driver should do this automatically, and you can check via mpstat.

The other and often more important driver option is IRQ coalescing.  This powerful function allows the NIC to buffer several packets before calling the IRQ, saving a massive amount of time and load on the system.  For example, if the NIC buffers 10 packets, the CPU load is reduced by almost 90%.  This function is usually controlled by the ethtool utility, using the 'c' options, but some drivers need this set at driver load time; see your documentation on how to set this.  For example, some cards like the Intel NIC we worked with, have automatic modes that try to do the best thing based on load.

Finally, some drivers such as we saw on our VMs, just don't support multiple queues or coalescing.  In this case, once the CPU is busy, that's the limit of your performance until you can change NICs or drivers.

This is a complex area that's not well-known, but a few good techniques can really improve performance on very busy systems.  Also, a little extra monitoring can help find and diagnose these hard-to-see problems.
 

| Posted on April 25, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Tech Items - More NUMA Fun in Virtual Machines


Turns out that using NUMA for large-memory processes is not so easy, as we discovered when we tried to update our MySQL init scripts to world-class levels.  As we've written before, current best practice is to use numactl to set full interleaved mode for NUMA systems for large memory processes such as MySQL, MongoDB, Memcached, and Java systems.  This prevents RAM nodes from running low in mysterious kinds of core RAM and swapping even when there is plenty of RAM free on other nodes.

But we found that in at least some versions of Centos 5.x for Xen VMs, NUMA is not compiled in and thus not supported.  This seems to be due to bugs that can prevent the kernel from booting in certain circumstances, though these are unclear.  As a result, there is no NUMA support and thus numactl is useless, but also there should be no swapping problems, even on NUMA hardware.

We are looking at enabling this for VMs with a different kernel, though this may not make sense, since then we are using numactl to set interleaved mode which basically turns NUMA off (at least for the big process, not for the kernel and other processes, which probably benefit from NUMA aware locality).

Another option we have not yet considered is to turn off NUMA support at boot time, which in theory would also solve the problem and run in a pseudo-interleaved mode.  Interesting idea.
 
| Posted on April 24, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
 

 

 
Most of our customers run MySQL.  Most of them have backups.  Most of those are wrong.  Most are never tested.

How and why does this happen ?  Many ways, so let's look at the good and bad ways to backup MySQL databases.

Slaves - Many people think that backing up the Slave DB is a good and sensible idea.  Even if they backed up the Slave correctly (which they don't, see below), this is still a bad idea.  Why ?  Because the slave data is often not the same as the Master DB.  Why ?  Many reasons, including MySQL bugs, but also due to some statements not being replicated correctly (ever see those warnings in your logs?).  Or replication stopped due to another error like deadlock or timeout and was incorrectly restarted or skipped, or there are poorly-understood skip or do configurations at some point in the system's history (maybe years ago) that cause data de-synchronization.

Basically, you should not trust that the slave is correct and thus avoid backing up the slave unless the Master DB is very busy or for other reasons cannot handle the performance or locking issues related to backup.  Even in this case, you should use a tool like Percona's replication checker to verify replication to know the slave's data is probably accurate.

Locking - Most customers come to us using MyISAM tables, which are very hard to backup (and are a bad idea, see other blogs).  The only good way to backup MyISAM is to fully lock the database for the entire backup period, often many minutes or hours, effectively killing the website for a long time.  This can be fixed by using InnoDB, but some customers try another way, to use mysqlbackup in non-locking mode, or backing up the files with or without a snapshot, etc.  These are all useless as the data will be corrupt or not consistent.

Mixed-MyISQM/InnoDB - Many customers have both MyISAM and InnoDB tables but try to backup using standard single-transaction methods for InnoDB, which result in bad MyISAM data (but they don't know until they test/do a restore).  Even worse, many customers don't even know they have MyISAM tables, but get them when developers create tables without checking on the engine; in MySQL 5.1 and earlier, the default is MyISAM.  We have special monitoring to detect this situation, but most customers don't know about this and are backing up inconsistent data.

Right Way - The right way to do it is with careful planning and selection of options, mostly based on whether or not there are MyISAM tables.  First, always backup the master if possible.  Second, if all InnoDB, use single transaction mode or LVM snapshots and check to make sure there are no MyISAM tables.  Third, if there are MyISAM tables, use Percona tools to validate a slave and back that up if needed.

Backing up MySQL is not easy to do perfectly, as it takes good knowledge, tools, and monitoring.  We work on this problem every day with our customers to ensure their data is safe and reliable.
 

| Posted on April 23, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
 

 

Tech Items - MySQL Sub Slaves to Which Master?


Interesting customer question this week on the best Master for the 2nd, 3rd, etc. Slaves in a system.  We've always used the primary Master as Master for all Slaves as simple and best practice, but now I'm wondering.  I'm thinking that making the 2nd and other slaves of the main Slave would make any Slave promotion to Master much easier since the secondary slaves won't have to be re-pointed, which is messy and difficult, especially with various failure modes at 3am, even with MMM or other tools.

One one hand these slaves would maintain all their SET MASTER TO, their log positions, everything.  This greatly simplifies the very stressful Master failover situation, so that all that has to happen is simple read-write and write IP promotion after what if often lots of troubleshooting of the real Master.  In addition, this would really simplify any automated HA DB failover upon Master failure.

On the other hand, this introduces a secondary Single Point of Failure (SPoF), the main Slave.  In a typical multi-slave system, any Slave can be promoted to Master by simply declaring it so, making it read-write, and pointing all other slaves at it.  But any slave failure is easily handled by just dropping the dead slave from the application DB pool.  But if the main Slave is the sub-slave Master, and it fails then all other slaves would have to be re-pointed to the Master or another Slave.  In this scenario, essentially all Slaves would fail together since all updates are SPoF through the main Slave.  Most reads would continue and the app would stay up (unlike a Master failure scenario), but lack of updates would be a severe problem rather quickly.

I'm not sure which is better or worse, but it's an interesting question to ponder.  For now we'll keep our simple single Master to all Slaves structure, but keep thinking about how to improve and simplify that.

Another related issue is virtual IP addresses in any multi-DB system.  We have not been ideal at this, in that we use real IPs for all DBs, even though this makes failovers and promotions, especially of R/W split systems, much more difficult.  But using real IPs also makes the system much simpler to setup and manage, and given that failovers are very rare, has worked quite well.  Yet another issue to think about as we work to design and build the world's best and fastest systems.

| Posted on April 10, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |

 

Tech Items - MySQL Slaves are not good Backup


We have many customers who backup their Slave databases and feel that this is enough for them.  We disagree, because the experts disagree, because Slaves often do not match their Masters.  How can this happen ?  Many different ways, and while MySQL has gotten better over the years, Slaves are still very often quite different than their Masters and often have corrupted or inaccurate data and you should not rely on them for good backups (of course any backup is much better than none!)

MySQL Replication is a wonder of simplicity and general reliability, but this simplicity along with MySQL's rather loose definitions of data integrity create many situations where the Master and Slaves can be different.  We even see customers there the Slaves are different from the Master and from each other, sometimes every week.  There are so many ways this can go wrong, from duplicate keys, non-deterministic procedure execution, common LIMIT issues and more.  This is especially true if you have warnings in your DB logs on replication.

So, we usually recommend always backing up the Master, so you get good data, always.  We use advanced methods, often tied to InnoDB so there is no locking, and try to limit the performance impact by using high-performance hardware and DB configurations.  This guarantees good backups, accurate data, and good performance.

There are a few good tools for this that we are starting to use for our customers to fix the sync issues, including and especially the old Maatkit system, now maintained and re-branded by Percona, the world's top MySQL consulting company.  We use these tools to scan Slaves to find differences and are testing additional tools that can fix and re-sync the data, especially on large systems where a Slave re-sync is not practical, or where Replication has errors so often that we need to fix them every week.

With these tools, customers can in theory use the Slaves for backups, though we are not yet doing this, as we continue to test and evaluate these complex and powerful tools.  By the summer of 2012 we should have these tools running on most systems with periodic sync reports and enhanced DB backup options that include the Slaves and good backup guarantees (though for financial and other critical data, we'll usually recommend the Master).
 

| Posted on April 5, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |


 

Amazon's EC2 Cloud System

 
We are experts on Amazon's EC2 cloud system, running many customers across various EC2 regions, including EU, US, Japan, and Singapore (waiting for our first customer to use Brazil).  We often get asked about various EC2 features, so here is a quick summary of Amazon AWS thinking.  Also, we are working on getting better pricing and long-term options for Chinese users to make Amazon even more attractive and affordable.

EC2 - The core offering are cloud servers which are extremely popular and useful.  Especially because recently they both lowered the price and added long-needed new instances.  We are especially excited about 64-bit small and medium instances which give our customers a whole new way and price point for using Amazon.  Previously the only 64-bit system smaller than the US$250/month Large instance was a Micro which was cheap but nearly useless for real work.  The old Small instance was 32-bit and thus not easily compatible with scale-up or the mainstream 64 bit systems and tools.  But the new Small and Medium instances are both 64-bit and very well-priced at about $64 and $125/month for non-reserved pricing.

S3 - The oldest and easiest to use AWS feature is S3, the Simple Storage System, for storing images, backups, and most anything.  S3 is pretty good though can be a bit slow and expensive for large data.  Best practice for customers is to use it for shared upload or static objects, especially things like uploaded images on SMS sites.  Generally you can take the image on your server, re-size it, and push to S3, storing the URL in your DB.  Then your page links to the S3 image.  This is great because your several web servers do not need NFS, rsync, or other complex file / image sync systems, none of which scale well.  And S3 is great for backups as we push our customers encrypted backups there in most cases for unlimited storage and retention as needed.

RDS - The Relational Data System, or MySQL in the cloud.  This is an interesting service and simple to use, but we don't recommend it for high-performance use due to reports of problems and scaling issues.  In particular, reports indicate the system does not cache data, instead using EBS I/O for all queries.  Also, it's not really tunable for various situations nor easy to use replication.  For now, large-scale customers should use their own MySQL instances on standard EC2 instances with EBS storage.

CloudFront CDN - This is Amazon's CDN which seems to work reasonably well across the US and EU, with some service in Asia, mostly from HK.  However, service inside China is limited and probably cannot be relied upon for consistent service.  For this reason, we recommend that customers with ICP licenses use a PRC domestic CDN like ChinaNetCenter for best performance inside China.

EBS - The Elastic Block Store or the basic iSCSI disk storage system for EC2, and broadly the best cloud storage system available.  Even though performance is not always consistent, it's extremely flexible and easy to use with dynamic mounting, fairly easy re-sizing, snapshots and more.  EBS is simply very nice and easy to use and generally quite fast, though one has to watch it.

ELB - For load balancing there is the Elastic Load Balancer.  The ELB is a very simple load balancer and we use it mostly to handle multi-available zone failovers, since there is no other good and fast way to do this (other than ugly and slow API EIP transfers).  As a load balancer, ELB is not very good nor consistent, but is necessary for real HA and also SSL termination.  We then usually use HAProxy in LB instances behind it for real load balancing, where we set rewrites, filtering, re-writes, logging, etc.  So any real system includes ELB and two HAProxy instances.

We'll write more about other features such as Route53, Email Sending, and security in future blogs or newsletters.
 
| Posted on April 1, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
 

 

Tech DBs - Why use only InnoDB on MySQL
 

Most of our customers come to us with systems running MySQL.  This is good.  Most are using MyISAM inside MySQL.  This is bad.  Almost all systems should use InnoDB only for their data; it's a simple as that, with a couple exceptions discussed below.

MySQL has two common storage / DB engines:  MyISAM and InnoDB.  Up to and including version 5.1, MyISAM is the default and the most common, but is also the oldest system that is not being updated or improved now.  It is also slow in many cases and has the bad habit of corrupting data on system crash, plus has no transactions, referential integrity (rarely-used on the Internet), and other advanced features.  

This is mostly because MyISAM only has table-level locking, not row-level like all other modern systems.  This means when a user / client does something, the whole table is locked (could be millions of rows) and everyone else waits.  As you can imagine, this does not scale very well to large systems.  More recently there are exceptions for SQL INSERTs and a few other things, but performance in real systems is still quite poor on MyISAM.

In addition, it has no transaction log / journal, so it just writes data to the Linux file cache and hopes it eventually gets on disk.  If the system ever crashes and loses some of that data, MyISAM will often not start or complain you need to fix the tables; it has limited methods to recover data and often loses things.  Also, MyISAM is difficult to backup correctly, usually requiring a full system lock of all data during the backup, which often means the website is down or not usable for 15, 30, 60 or more minutes per day.

By contrast, InnoDB is a much more modern system and is under heavy development, with several options in the InnoDB family, from the base engine to plug-in versions to advanced enhanced versions such as XtraDB and others.  Everyone is working on scaling InnoDB for better CPU and IO performance, better backups and locking, improved statistics and debugging, and more.  It is where all the action is.

Additionally, as a system, InnoDB supports several critical features, most importantly a transaction log and row-level locking.  The log allows for real DB transactions but more importantly for data crash recovery and roll-back.  This allows for much better data protection while maintaining higher performance due to the way InnoDB does IO.  And row-level locking provides much higher concurrent performance in most cases, since users only lock data they are writing, and reads almost never block at all.  Recent performance tests show massively better performance for InnoDB over MyISAM, especially under heavy load.

Generally, Innodb is much faster, much safer, much more powerful, and continually getting better.  You should always use it for all your systems unless there is a specific reason not to.  

What are those specific reasons you might not use InnoDB ?  There are at least two, the first being count(*) behavior.  On MyiSAM a SELECT count(*) with no WHERE is very fast.  InnoDB must actually count the rows and will be slow.  Of course, usually count(*) is not a very good way to program (first use an indexed column to be faster, like count(id)) and is rarely a good idea without a WHERE clause.

Second, MyISAM allows full text search in regular columns, which InnoDB does not.  Normally real search is done outside of MySQL in systems like Lucene or Solr (or Sphinx in new MySQL Versions), but some smaller sites do search in MySQL; in this case you probably need to use MyISAM for the text table.

Overall, and to put it simply, use InnoDB.  Every day.  For every table.  Only minor exceptions that rarely apply.

Very good reference article from Tag1:  http://tag1consulting.com/MySQL_Engines_MyISAM_vs_InnoDB
 
| Posted on March 26, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
 

 

Goldman Sachs 10 Rules to Follow

 
The last blog talked about the Goldman Executive Departure and loss of focus on the customers' needs and priorities.  So while we are saying bad things about Goldman, we should also revisit their 10 Commandments of Business, written in 1970 and still true today, for them and for us:

1.  Don’t waste your time going after business you don’t really want.
2.    The boss usually decides— not the assistant treasurer. Do you know the boss?
3.   It is just as easy to get a first-rate piece of business as a second-rate one.
4.   You never learn anything when you’re talking.
5.   The client’s objective is more important than yours.
6.   The respect of one person is worth more than an acquaintance with 100 people.
7.   When there’s business to be found, go out and get it!
8.   Important people like to deal with other important people. Are you one?
9.   There’s nothing worse than an unhappy client.
10. If you get the business, it’s up to you to see that it’s well-handled.
 
| Posted on March 21, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |

 

Always Do What is Best for the Customer
 

Our company culture is to always do what is best for the customer.  Always.  Even if it is not the best thing for us, because what is best for the customer is always best for us, long-term.

Today, an executive of Goldman Sachs, the world's top investment bank, quit, because the company had changed over his 12 years and is not putting the customer first and is doing things only for bank profits, often that are bad for the customer.  Below is his article in the NY Times, read today by most of the world's top businessmen and women, about why he's leaving a company that for 124 years has helped customers.

There are good lessons here, about always serving the customer.  For us, this means choosing the right servers, services, etc. for the customer, even if it means less money for us.  Always do what is right for the customer.  Even if that is fewer servers or no servers at all.

We are always thinking of how we can do the right things for them, especially when we are choosing servers or services and spending their money.  Remember that we are usually spending the customers' money and that the #1 thing we sell is trust.  Trust that we will do the right things, securely, quickly and professionally, but also that our advice is correct and worthy of their trust, not just best for us.  This is where the guy below feels that Goldman Sachs has major problems.  Our teams must work together to keep this in mind and always do what is right for the customer.
https://www.nytimes.com/2012/03/14/opinion/why-i-am-leaving-goldman-sachs.html?hp=&pagewanted=all
 
| Posted on March 14, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
 

 

Tech Troubleshooting - HAProxy Performance and Load


Interesting troubleshooting session with HAProxy (http://haproxy.1wt.eu/) yesterday on a high-load customer.  HAProxy is a very high-performance system, up to 100,000 requests/second, 100,000 connections, very flexible system that we use everywhere (and much, much better than nginx or LVS for this).  It's very hard to run systems at this performance level at hundreds of millions of requests/day at such massive connection levels, with specialized Linux kernel tuning along with careful monitoring and management.

The problem was at 150-200,000 concurrent connections but less than 5,000 requests/second, the system was becoming very slow, taking several seconds to respond.  Standard troubleshooting of this situation found nothing strange, checking overall CPU, memory, sockets, kernel messages, iptables connection tracker limits, TCP memory and pressure.  Since this is a VM, we also checked the underlying Xen Dom0 system since CPU and other limits there can also affect VM load performance.

But a look at per process CPU shows that HAProxy is using 95%+ of a CPU, a clear sign of process CPU overload.  HAProxy is normally a single process event-driven system which is how it (and Nginx, which has the same architecture) achieve such high performance.  But if that single process has more load than a single CPU can handle, we are dead, or at least very slow - I'm actually amazed that at 100% the system could function at all, still at 3-5,000 requests/second and 200,000 connections.

Why the CPU load ?  We don't know, since the actual request rate was not that high.  We think it's due to the number of connections and the overhead in managing that list, even though in theory the kernel should be doing that using epoll(), but the system part of CPU was quite high which is probably all the socket selection work.

One question is why so many connections and the answer is the long TCP keep-alive time, which defaults to two minutes.  So 2,000 new connections per second and over 100 seconds average connection time gets you 200,000 connections, simple as that.  We are shortening the timeout, which normally reduces the user experience, but in this case the application is not that sensitive to this, so we feel safe to reduce to 60 seconds or less, down to 15 seconds and in extreme cases just turning keep-alive off (since for this customer we are only seeing about 1.2 requests/connection, much lower than typical websites).

The overall solution for now was to enable multiple processes for HAProxy, which is a difficult feature to use since the statistics and monitoring are then per process and randomly switched between them, so debugging and monitoring data are much less useful.  But the per-process CPU dropped to 50% and the server was suddenly very happy at 120,000 connections.  We will run this way for a while and see how it goes and also work to get back to single process mode for better monitoring unless we can find a better per-process method.

Overall, this is all in a day's work for managing large sites and advanced high-performance load-balancing systems running at the edge of what is possible.  And we learn something new every day, though fortunately we can then share that knowledge with you and across our customer base.
 
| Posted on March 6, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
 

 

Tech Annoyances - Linux NUMA Issues
 

Linux memory management is very advanced and powerful, including and supporting sophisticated NUMA memory tracking, management, and optimization very early on.  NUMA is Non-Uniform Memory Access, which in today's modern servers means that each CPU has its own directly-connected RAM.  The system may have 32GB of RAM, but each CPU directly has 16GB, and if it wants to access the other 16GB, must go through the other CPU, which is slower (non-uniform).  This used to be rare in PC Servers, but the new Intel Nahalem chips (ELX-5xxx) only operate this way, with three RAM channels per CPU.

Linux knows all of this and tracks RAM use by CPU and application.  It tries to schedule each process to run on the CPU that owns most of the RAM that process is using, which should improve performance.  All of this is invisible to the user and sysadmin, and in fact, most engineers have never even heard of this technology, even though all new servers use it.

So what is the problem ?  Swapping.  For years there have been reports of mysterious swapping by syadmins, especially with large-RAM monolithic processes like databases and Java servers.  Recent discussions and tools have shown that NUMA is a key problem.  Why ?  Because newly created processes default to allocating all/most of the RAM on a single CPU, then using some from other CPUs.  Thus on a 16GB machine with 8GB per CPU, and a 12GB MySQL process, we'll find MySQL uses all 8GB of the first CPU and 4GB from the second.  

Why is this a problem ?  Because the kernel needs RAM, too, and does not well-balance across the NUMA system, especially when one CPU's RAM is full.  Instead, it will swap out some RAM on the first CPU, even when there are many GB of free RAM on the second CPU.  And of course this swapping is evil, freezing the whole DB system during swapping and killing the website.

How to fix this ?  For now, the only way is to start the big RAM process using a special command 'numactl' in interleaved mode.  This will split the RAM use evenly between the CPUs and avoid all of these problems, though in theory the system may be a little slower due to cross-CPU RAM access.  But this is still a good and recommend method.  But it's also annoying because you must use numactrl every time you start such processes, which means modifying init scripts and otherwise changing how things start (plus make sure numactl packages are installed, which is not so common).

What Linux really needs is a default NUMA policy for this.  There are sophisticated policies for controlling RAM use, binding, etc. but these are all per process, and there is no way to have a default.  The kernel should really have a sysctl for this, allowing sysadmins to set the default for new processes, or with some other filters, etc. such as large RAM processes, etc.  This would go a long way to avoiding unneeded swap and other related problems since we could just set policy to interleaved and run normally.
| Posted on February 22, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |

 

Tech Annoyances - Linux Swapping Fun

 

Linux is a great operating system.  Period.  Probably one of the best every built and still the fastest evolving, especially given that it already runs everything from cell phones to super computers, and a new version is released about every 8 weeks, sometimes with major internal changes.  But it still has annoyances and issues that make life difficult for Internet Operations, where we live.  We'll occasionally blog about these issues.


One is swapping.  Not swapping as a concept, which everyone supports, but how it's handled, monitoring, and managed on Linux.

First is swappiness.  This is the kernel sysctl that determines how the system should trade application RAM for file cache.  The default on most distributions is 60, which is absurdly high.  We always set this to zero as servers should never swap, period.  Swapping is evil on many levels, but in practical terms, any swapping process is frozen during swap, which is deadly to modern multi-threaded systems like MySQL or Java, which are simply dead for 1, 10, 100 seconds while we swap.  If there is really no RAM free, then there is no choice, but the swappiness of 60 makes this happen far earlier than it should, often with gigabytes of RAM still free.  

Maybe this is useful on a desktop to push out old apps and free up cache, but even then we'd suggest a setting of 20-30 would be better, but servers should be zero, pure and simple.  One might argue that some swappiness is okay to push out unused code, but such code is not that common on real servers and is often quite small. especially with today's RAM sizes (we always use 2-64GB)

Second is swap monitoring.  We sometimes have low RAM issues that forces some swap, but it's then very hard to tell what is using the swap.  Why do we care ?  A good reason is that we just want to know.  But more importantly, even when RAM use is lowered, things will stay in swap, often a lot of things, and this creates alerts and worry for us if RAM runs low again (though in theory swapcache tells us how much real danger there is, but that's an advanced topic for most engineers).  We can restart the offending services during a maintenance window to remove swap, if we know which services - this is obvious in single use machines like DB servers, but many customers have many services on one bigger machine.  A tool to tell us what apps are really using the swap would be useful.

Third is swap release.  As noted above, we get some cases of past swap use, which could be days, weeks, or months ago when we used the swap, but it's not needed now.  Some or most may even be in the swap cache and can be instantly released, but there is no method to do that.  So the first useful function would be to immediately drop the swap cache, just as we can drop the file caches.  The second would be to force things to be swapped in to reduce swap use and make us all feel more comfortable.

Fourth is random mysterious swapping.  Linux is very sophisticated, but still swaps for no obvious reason, and no one seems to know why.  Recently some of this has been discovered to be due to NUMA issues (see future blog) but others are still not well-understood, especially on older kernels (before 2.6.3x, certainly including RH/Centos 5.x which are 2.6.18).  Even today we always allocate some swap because the kernel just seems unhappy without it, and in some cases goes crazy with kswapd and other things taking 100% CPU for no reason; we see this in larger RAM server on EC2, for example, especially under load.  Newer kernels are better, but more progress is needed.
 
| Posted on February 15, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
 

 

Having Everyone's Problems


We run large-scale Internet systems. All large-scale Internet systems have problems, so any big Internet company has a team that deals with the problems in their own system. Many problems are common, but many of these problems are different for different companies, different technologies, and different industries, e.g. video systems are very different from mobile game systems.

But we run systems in every industry, which means we have everyone's problems. That is good in many ways since many of the problems are the same, such as hardware, IDCs, Linux, MySQL, PHP, Apache, HA, security, performance, etc. So we can share our deep knowledge and lessons learned from one customer with the others, and everyone benefits. We bring world-class best practices to all of these areas.

But it's also difficult because many of our customers' problems are not the same, and we have to handle every type of problem on every type of system, including uncommon languages (Python, Ruby, Perl), search engines (Solr, Sphinx), caches/queues (Redis, MQ), NoSQL (MongoDB and others), video encoding, sharding, hardware, firewalls, replication, batch systems, continuous integration and automation (Hudson, Puppet), many custom issues such as MMO game engines, and much more.

For these issues we have to learn about the systems, the problems, and how to monitor, manage, troubleshoot, document, and tune all these things. This makes us the global experts in everything about the Internet, but also makes life interesting at 3am when one of these things breaks. In the end, we have to know everything about everything . . .
 
| Posted on February 2, 2012 | by Steve Mushero, Co-Founder and CEO | Leave a Comment |
 

 

Welcome to the ChinaNetCloud Blog


Hi !  This is the shiny new ChinaNetCloud Blog, partly about the things we do, but more importantly about Operations, Clouds, technology, Linux, customers, sales, service, and other random topics we find interesting and want to share with you.  Later we may also split this into a few blogs, such as on business, on tech and tools, on service, etc.

Steve Mushero, our co-founder and CEO, will initially write most of the blogs, but other guest-writers will also help from time to time, including team engineers, managers, sales, and others who have interesting ideas to contribute.

As a reminder, our main business is architecting, designing, building, and especially operating large-scale Internet servers and systems.  So we outsource and operate customers' backend serves, data bases.  We take care of everything, running the systems 24x7, with deep monitoring, troubleshooting, backups, and more, focused on performance, reliability, and saving the customer money.  This is what we do, here, there, and everywhere.

We also partner with and recommend a variety of 3rd party services such as IDCs, CDNs, hardware, web monitoring, email delivery, and more.  And our business ecosystem includes a wide variety of others such as investors, web development, digital advertising, data analytics, content, accounting, legal, business setup, and much more - we are always happy to connect people together.  Let us know how we can help connect you . .

Today we mostly serve the Chinese market, and mostly for Chinese Internet companies, though we also have lots of global customers who are coming into China and ask us to run their systems.  We also increasingly are taking Chinese companies out into the world, running systems that serve Asian or American users, often using Amazon EC2 in Singapore, Japan, or California.

Later we will expand to take over the world since our model, service, and business actually scales and fits the needs of any Internet company, anywhere.  We will probably open an office in Singapore later this year to serve SE Asia, India, and beyond.  Then on to the Middle-East, Latin America, the U.S., Europe, and eventually Africa - our vision is to Run All the World's Internet Servers - there are about 10 million so we have a lot of work to do !

Stay tuned for interesting blogs in this space.

| Posted on January 17, 2012 | by ChinaNetCloud Team | Leave a Comment |