Cache DNS issues.

William Muriithi

25 Nov 2014 25 Nov '14

4:08 p.m.

‎Morning, I have cache dns running on a postfix based mail server. The bin‎d server do come up before postfix, but dns queries fail till the dns master come up. Doesn't slave servers keep data persistent? I mean, is it normal for slave to fail resolving queries on start up if the master us down? Thanks in advance ‎ William

Show replies by date

James Knott

25 Nov 25 Nov

4:19 p.m.

On 11/25/2014 11:08 AM, William Muriithi wrote:

...

‎Morning,

I have cache dns running on a postfix based mail server. The bin‎d server do come up before postfix, but dns queries fail till the dns master come up.

Doesn't slave servers keep data persistent? I mean, is it normal for slave to fail resolving queries on start up if the master us down?

What's your cache time? A cache will typically hold the data for a short time, to keep stale data from propogating. It might be as short as a few minutes. A cache certainly won't survive a reboot.

D. Hugh Redelmeier

5:08 p.m.

| From: James Knott <james.knott@rogers.com> | What's your cache time? A cache will typically hold the data for a | short time, to keep stale data from propogating. It might be as short | as a few minutes. A cache certainly won't survive a reboot. I don't know about local policy, but each DNS record has a TTL which is supposed to limit the time it is cached. I imagine that these are generally honoured by a caching server to avoid overloading the other servers.

James Knott

5:21 p.m.

On 11/25/2014 12:08 PM, D. Hugh Redelmeier wrote:

...

| From: James Knott <james.knott@rogers.com>

| What's your cache time? A cache will typically hold the data for a | short time, to keep stale data from propogating. It might be as short | as a few minutes. A cache certainly won't survive a reboot.

I don't know about local policy, but each DNS record has a TTL which is supposed to limit the time it is cached. I imagine that these are generally honoured by a caching server to avoid overloading the other servers.

The original TTL was 24 hours. But that's a bit long these days. It's a balance between "overloading" and stale addresses. Upper level servers tend to have long times, local ones shorter. Regardless, I'd expect the cache to be empty on a system that's just been started. Cache as cache can. ;-)

D. Hugh Redelmeier

10:35 p.m.

| From: James Knott <james.knott@rogers.com> | The original TTL was 24 hours. How do you know that? Do you know the queries William Muriithi's machine is making? Do you know what DNS servers it is querying? In a response to a DNS query, each record has its own TTL. That TTL depends on what the authoritative server set it to be initially and how intermediate servers (if any) modified it (usually because it has been aging in a cache). You can see this with dig(1).

William Muriithi

26 Nov 26 Nov

12:09 a.m.

Hugh, Thank you. My problem seem to be ttl issue. The ttl for my records is one hour and this system was down longer than that due to power outage. Axfr dump from the Slave look as below. ‎; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1 <<>> example.com axfr ;; global options: +cmd example.com. 3600 IN SOA srvyyzdc01.example.local. hostmaster.example.com. 2014111803 900 600 86400 3600 example.com. 3600 IN TXT "v=spf1 mx -all" example.com. 3600 IN MX 10 smtp1.example.com. example.com. 3600 IN NS srvyyzdc01.example.local. example.com. 3600 IN NS srvyyzdc02.example.local. example.com. 3600 IN A 192.168.230.241 aeo.example.com. 3600 IN A 192.168.230.131‎ Original Message From: D. Hugh Redelmeier Sent: Tuesday, November 25, 2014 5:35 PM To: GTALUG Talk Reply To: D. Hugh Redelmeier Subject: Re: [GTALUG] Cache DNS issues. | From: James Knott <james.knott@rogers.com> | The original TTL was 24 hours. How do you know that? Do you know the queries William Muriithi's machine is making? Do you know what DNS servers it is querying? In a response to a DNS query, each record has its own TTL. That TTL depends on what the authoritative server set it to be initially and how intermediate servers (if any) modified it (usually because it has been aging in a cache). You can see this with dig(1). --- GTALUG Talk Mailing List - talk@gtalug.org http://gtalug.org/mailman/listinfo/talk

D. Hugh Redelmeier

3:39 a.m.

| From: William Muriithi <william.muriithi@gmail.com> | Thank you. My problem seem to be ttl issue. The ttl for my records is | one hour and this system was down longer than that due to power outage. Sorry if my message was confusing. I don't think that setting a longer TTL is the solution to your problem. In fact, I don't know the right solution. Not really my area of expertise. Setting a longer TTL will paper over the problem. If the outage is short enough, and the remaining TTLs are long enough, you will not have a problem. But all TTLs count down, so it is probably only luck if you never crash with a short remaining TTL. Adjusting TTL way high will adjust the probabilities, but still leaves a vulnerability. You didn't really tell us everything relevant about your problem so I'm guessing at a few things. Is your DNS master server on the same machine or a different one? If it is on the same machine, perhaps you can delay the startup of postfix until after the master server is up. If the server is on another machine, I don't know of an off-the-shelf solution for all cases. That doesn't mean that there isn't one. If you control the master server, you could run a local slave DNS server on the postfix machine. That is probably the best and cleanest solution. Zone transfers don't have to happen in real time. This assumes that you only really care about queries for names in that zone. If you don't control the master server: A normal (not DNSSEC) way of deciding that there is no domain with the given name is to give up after a query receives no answer after a timeout. Just telling postfix to have more patience might work but has other problems. [UNTESTED] Perhaps before starting postfix, you could do a query of a better-be-a-sure-thing domain name with a really long timeout. dig @master.server known-name.ca +time=300 might do the trick (a 5 minute patience). This might fit in the init script. Of course this is all a little improper. If the server goes down without the machine running postfix going down, you have the same old problem.

William Muriithi

4:16 a.m.

Hugh,

...

Sorry if my message was confusing. I don't think that setting a longer TTL is the solution to your problem. In fact, I don't know the right solution. Not really my area of expertise.

Setting a longer TTL will paper over the problem. If the outage is short enough, and the remaining TTLs are long enough, you will not have a problem. But all TTLs count down, so it is probably only luck if you never crash with a short remaining TTL. Adjusting TTL way high will adjust the probabilities, but still leaves a vulnerability.

You have a point, unless I make it full day, at some point it will happen again. This is in Markham and power outage is like a weekly affair. Petty disappointed by power supply up there.

...

You didn't really tell us everything relevant about your problem so I'm guessing at a few things.

Is your DNS master server on the same machine or a different one? If it is on the same machine, perhaps you can delay the startup of postfix until after the master server is up.

Nope. The master and the slave leave of different hardware. Actually, master is an active directory while the slave is a bind server. The purpose of the bind server is to ensure DNS resolve fast enough as I do a lot of DNS related checks to avoid spam coming through.

...

If the server is on another machine, I don't know of an off-the-shelf solution for all cases. That doesn't mean that there isn't one.

Me too. I am wondering if removing the MX server from the VM that should start automatically should solve the problem.

...

If you control the master server, you could run a local slave DNS server on the postfix machine. That is probably the best and cleanest solution. Zone transfers don't have to happen in real time. This assumes that you only really care about queries for names in that zone.

Actually bind running on MX server is actually slave. Sorry for mixing up cache and slave. Shouldn't both stop serving the zone in question is the TTL has expired? If thats not the case, I guess I am on the wood on why its happening then as you implicitly seem to imply it shouldn't happen with slaves.

...

If you don't control the master server:

A normal (not DNSSEC) way of deciding that there is no domain with the given name is to give up after a query receives no answer after a timeout. Just telling postfix to have more patience might work but has other problems.

[UNTESTED] Perhaps before starting postfix, you could do a query of a better-be-a-sure-thing domain name with a really long timeout. dig @master.server known-name.ca +time=300 might do the trick (a 5 minute patience). This might fit in the init script.

Hmm, good idea, I could chkconfig off postfix and set it a small script to check DNS and then bring up postifx when it resolve successfully. Possibly call it every 15 minutes.. Will investigate that idea tomorrow

...

Of course this is all a little improper. If the server goes down without the machine running postfix going down, you have the same old problem.

William

...

--- GTALUG Talk Mailing List - talk@gtalug.org http://gtalug.org/mailman/listinfo/talk

D. Hugh Redelmeier

5:35 a.m.

| From: William Muriithi <william.muriithi@gmail.com> | Actually bind running on MX server is actually slave. Sorry for | mixing up cache and slave. Shouldn't both stop serving the zone in | question is the TTL has expired? If thats not the case, I guess I am | on the wood on why its happening then as you implicitly seem to imply | it shouldn't happen with slaves. [checking with my 2001 edition of "DNS and BIND"] The slave can, but need not, have a file that saves the zone across reboots. Clearly you want this. Zone transfers are initiated by the slave if it reboots without such a file. They are also initiated if the lifetimes specified in the SOA record have expired (refresh time, retry time, expire time, negative caching time). NOT the same as ordinary record TTL. (Irrelevant: zone transfers are initiated by the master when the zone data changes.) Of course this is about BIND 9 in 2001. So it looks as if a few tweeks should get you something reasonable.

James Knott

1:35 a.m.

On 11/25/2014 05:35 PM, D. Hugh Redelmeier wrote:

...

| The original TTL was 24 hours.

How do you know that? Do you know the queries William Muriithi's machine is making? Do you know what DNS servers it is querying?

I was referring to earlier days of the Internet, not his system. "The units used are seconds. An older common TTL value for DNS was 86400 seconds, which is 24 hours. A TTL value of 86400 would mean that, if a DNS record was changed on the authoritative nameserver, DNS servers around the world could still be showing the old value from their cache for up to 24 hours after the change." https://en.wikipedia.org/wiki/Time_to_live As that article mentions, with that TTL, it would take a full 24 hours for a change to be reflected.

3906

Age (days ago)

3907

Last active (days ago)

List overview

Download

9 comments

3 participants

participants (3)

D. Hugh Redelmeier
James Knott
William Muriithi

Cache DNS issues.

William Muriithi

James Knott

D. Hugh Redelmeier

James Knott

D. Hugh Redelmeier

William Muriithi

D. Hugh Redelmeier

William Muriithi

D. Hugh Redelmeier

James Knott

tags

participants (3)