trying to load balance

A few people have been expressing some concern around "unexplained" pings or IDS triggers from Mozilla servers. I spoke with our IT team and learned that this is happening because we're trying to improve our load balancing to give users better service at our websites.

As I understand it, there are two basic ways to improve service to different geographic areas. The first is to have a table of IP addresses that correspond to different geographic areas and so when the server sees an IP that is from, say, France, it can connect them to our co-location facility in Europe rather than the one in the US. This, theoretically, can give the user a faster connection and better service. The second mechanism is that when a user connects to us, we can query their nameserver and use that information to determine which data center facility to send them to. The second method can, theoretically, provide even better service because it doesn't rely on static data but actually measures things right then and there to determine which of our co-lo centers can best serve that user.

Now, I'm not an IT guy, but that's how I understand things. Here's the actual response that our IT team is sending to people who have expressed to us some concern about this system:

Mozilla is using proximity based load balancers that send probes from each site to the nameserver that looked up the address to a Mozilla web property (like www.mozilla.com) to dynamically determine which Mozilla data center is closest.

What you're seeing is a result of those probes and in no way represents any compromised host. I am working with Citrix to find a better way to reduce the frequency of probes.

If you'd prefer to be statically assigned to a particular datacenter, please send me a list of netblocks in your network and which datacenter is closes (traceroute to 63.245.209.4 or 63.245.213.4).

The Mozilla IT team aren't doing anything untoward here. They're just trying to ensure that people visiting Mozilla get the best possible service. They do recognize that this is causing some users to be concerned so they're working on alternative solutions.

For those of you who prefer something more definitive than my non-tech explanation of how this works, here's a technical description:

When a client's LDNS accesses the GSLB site for the first time, the RTT information is not available with the system. In such cases, GSLB VIP selects a site using the Round Robin method and directs the client to this site. The system then starts calculating the RTT between the site and the LDNS. Similarly, the system deployed on the participating site begin to calculate the RTT between the LDNS and the GSLB site. Periodically, the system participating in GSLB will report the RTT to other participating systems. When the DNS query is sent the next time, the system selects the best site using the network metrics.

The system uses different mechanisms such as ICMP echo Request/Reply (PING), TCP, and UDP (DNS) to probe the Round Trip Time (RTT) metrics between the LDNS and the sites participating in the GSLB domain. First, a PING probe is performed to obtain the RTT. If the PING probe fails, the DNS probe is performed to calculate the RTT. If the DNS probe also fails, the TCP probe is performed.

Note: The system performs UDP probing on port 53 and TCP probing on port 80.

Let me know if there's any more info I can provide here.

reactions, thoughts, comments, etc.

Yeah, that makes sense. While I can understand that people get concerned when their IDS starts whining (I sure would) I hope you helped explain things here.

I like the dynamic probing they do: This will make sure people dynamically get the best connection even if their geographically closest node happens to be slow at that moment (due to network congestions in between or whatever).

I didn't know about this. I don't like it. They should find an alternative ASAP.

I don't think what they're doing is the best way... couldn't you just get the same result from a local IP database? If it's to do congestion control on overloaded servers, just have the servers tell each other how loaded they are.

It's more so to deal with network congestion between end users and the end site. The Netscalers already communicate site load metrics between themselves And really, the real purpose was to get better response time to different geo regions than it was to balance server load.

A really good example is New York - often the closest, or shortest trip time site is Amsterdam and NOT San Jose. Static maps I've looked at assume that US addresses should stay in the US (it's a 4ms difference but Amsterdam is still quicker).

Static maps also don't deal with networks that have been allocated by ARIN (so probably US based) and have been broken up into smaller networks and announced from different parts of the world or that have subnets of a larger supernet announced from different locations foe TE purposes or whatever.

This is a common technique (for dynamic proximity) but I don't think it's used much in the wild. It's something we've been experimenting with since the beginning of the year.

I also think we're hitting some sort of issue with Netscaler that's causing it to be too aggressive and I've been working with Citrix since last week to resolve it.

My suggestions:

Have a good reverse DNS entry for the IP-address doing the probe would really help (the current: ns01.nllb.nl.mozilla.com. (for the Netherlands I presume) isn't best it could be, but ok).

Put up a webpage on the IP-address doing the probe with an explanation.