March 20, 2007

trying to load balance

A few people have been expressing some concern around "unexplained" pings or IDS triggers from Mozilla servers. I spoke with our IT team and learned that this is happening because we're trying to improve our load balancing to give users better service at our websites.

As I understand it, there are two basic ways to improve service to different geographic areas. The first is to have a table of IP addresses that correspond to different geographic areas and so when the server sees an IP that is from, say, France, it can connect them to our co-location facility in Europe rather than the one in the US. This, theoretically, can give the user a faster connection and better service. The second mechanism is that when a user connects to us, we can query their nameserver and use that information to determine which data center facility to send them to. The second method can, theoretically, provide even better service because it doesn't rely on static data but actually measures things right then and there to determine which of our co-lo centers can best serve that user.

Now, I'm not an IT guy, but that's how I understand things. Here's the actual response that our IT team is sending to people who have expressed to us some concern about this system:

Mozilla is using proximity based load balancers that send probes from each site to the nameserver that looked up the address to a Mozilla web property (like www.mozilla.com) to dynamically determine which Mozilla data center is closest.

What you're seeing is a result of those probes and in no way represents any compromised host. I am working with Citrix to find a better way to reduce the frequency of probes.

If you'd prefer to be statically assigned to a particular datacenter, please send me a list of netblocks in your network and which datacenter is closes (traceroute to 63.245.209.4 or 63.245.213.4).

The Mozilla IT team aren't doing anything untoward here. They're just trying to ensure that people visiting Mozilla get the best possible service. They do recognize that this is causing some users to be concerned so they're working on alternative solutions.

For those of you who prefer something more definitive than my non-tech explanation of how this works, here's a technical description:

When a client's LDNS accesses the GSLB site for the first time, the RTT information is not available with the system. In such cases, GSLB VIP selects a site using the Round Robin method and directs the client to this site. The system then starts calculating the RTT between the site and the LDNS. Similarly, the system deployed on the participating site begin to calculate the RTT between the LDNS and the GSLB site. Periodically, the system participating in GSLB will report the RTT to other participating systems. When the DNS query is sent the next time, the system selects the best site using the network metrics.

The system uses different mechanisms such as ICMP echo Request/Reply (PING), TCP, and UDP (DNS) to probe the Round Trip Time (RTT) metrics between the LDNS and the sites participating in the GSLB domain. First, a PING probe is performed to obtain the RTT. If the PING probe fails, the DNS probe is performed to calculate the RTT. If the DNS probe also fails, the TCP probe is performed.

Note: The system performs UDP probing on port 53 and TCP probing on port 80.

Let me know if there's any more info I can provide here.

Posted by asa at 11:27 AM

 

asa2008.jpg