server questions - - help needed

Greetings I am quite new at running a server so hopefully the question isn't too out there. My server has been operational for about a year and I am working on a number of different projects on it. Twice now (this last friday and 5 weeks early I came into the office to find that the server has somehow been taken down and has rebooted itself (process setup in the bios) but as it doesn't quite complete the boot process, I have to hit a key to tell it to continue and then finally to log in to read Debian (stable). So I am trying to determine what may have caused the system to do a reboot, whilst I have my suspicions I want to figure out exactly what is happening to cause this kind of behavior. AIUI servers should be able to run happily for years without issues (barring hardware problems) so I want that kind of reliability. Where in /var/log will I be finding the most clues as to the events that lead up to this 'reboot'? Thanks in advance!! Dee

On Sun, Jun 03, 2018 at 02:47:13PM -0500, o1bigtenor via talk wrote:
Greetings
I am quite new at running a server so hopefully the question isn't too out there.
My server has been operational for about a year and I am working on a number of different projects on it. Twice now (this last friday and 5 weeks early I came into the office to find that the server has somehow been taken down and has rebooted itself (process setup in the bios) but as it doesn't quite complete the boot process, I have to hit a key to tell it to continue and then finally to log in to read Debian (stable).
Who does the stopping? BIOS or Linux kernel? I ask because, my machine always stops at BIOS prompt when power comes back. I don't know why. I set the BIOS to "power off" when power comes back, so it should stay turned off, but it doesn't.
So I am trying to determine what may have caused the system to do a reboot, whilst I have my suspicions I want to figure out exactly what is happening to cause this kind of behavior. AIUI servers should be able to run happily for years without issues (barring hardware problems) so I want that kind of reliability. Where in /var/log will I be finding the most clues as to the events that lead up to this 'reboot'?
/var/log/message /var/log/syslog /var/log/debug
Thanks in advance!!
Dee --- Talk Mailing List talk@gtalug.org https://gtalug.org/mailman/listinfo/talk
-- William Park <opengeometry@yahoo.ca>

On Sun, Jun 3, 2018 at 5:54 PM, William Park via talk <talk@gtalug.org> wrote:
On Sun, Jun 03, 2018 at 02:47:13PM -0500, o1bigtenor via talk wrote:
Greetings
I am quite new at running a server so hopefully the question isn't too out there.
My server has been operational for about a year and I am working on a number of different projects on it. Twice now (this last friday and 5 weeks early I came into the office to find that the server has somehow been taken down and has rebooted itself (process setup in the bios) but as it doesn't quite complete the boot process, I have to hit a key to tell it to continue and then finally to log in to read Debian (stable).
Who does the stopping? BIOS or Linux kernel?
Bios - - - need to hit F2 (IIRC) to kick the bios in the pants and then I can get to the os prompt a little later.
I ask because, my machine always stops at BIOS prompt when power comes back. I don't know why. I set the BIOS to "power off" when power comes back, so it should stay turned off, but it doesn't.
So I am trying to determine what may have caused the system to do a reboot, whilst I have my suspicions I want to figure out exactly what is happening to cause this kind of behavior. AIUI servers should be able to run happily for years without issues (barring hardware problems) so I want that kind of reliability. Where in /var/log will I be finding the most clues as to the events that lead up to this 'reboot'?
/var/log/message /var/log/syslog /var/log/debug
Thanks - - - that last one was quite useful. Dee

On 03/06/18 10:57 PM, o1bigtenor via talk wrote:
On Sun, Jun 03, 2018 at 02:47:13PM -0500, o1bigtenor via talk wrote:
Greetings
I am quite new at running a server so hopefully the question isn't too out there.
My server has been operational for about a year and I am working on a number of different projects on it. Twice now (this last friday and 5 weeks early I came into the office to find that the server has somehow been taken down and has rebooted itself (process setup in the bios) but as it doesn't quite complete the boot process, I have to hit a key to tell it to continue and then finally to log in to read Debian (stable). Who does the stopping? BIOS or Linux kernel? Bios - - - need to hit F2 (IIRC) to kick the bios in the pants and
On Sun, Jun 3, 2018 at 5:54 PM, William Park via talk <talk@gtalug.org> wrote: then I can get to the os prompt a little later.
Google for "Server stuck at F1 or F2 prompt". Depending on your vendor, you will get quite a bit of information on probable causes and diagnostic processes. I had to debug a work Dell with that a while ago, and sometime in my Copious Spare Time Intel server (;-)) --dave -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest davecb@spamcop.net | -- Mark Twain

On Mon, Jun 4, 2018 at 7:50 AM, David Collier-Brown via talk <talk@gtalug.org> wrote:
On 03/06/18 10:57 PM, o1bigtenor via talk wrote:
On Sun, Jun 3, 2018 at 5:54 PM, William Park via talk <talk@gtalug.org> wrote:
On Sun, Jun 03, 2018 at 02:47:13PM -0500, o1bigtenor via talk wrote:
Greetings
I am quite new at running a server so hopefully the question isn't too out there.
My server has been operational for about a year and I am working on a number of different projects on it. Twice now (this last friday and 5 weeks early I came into the office to find that the server has somehow been taken down and has rebooted itself (process setup in the bios) but as it doesn't quite complete the boot process, I have to hit a key to tell it to continue and then finally to log in to read Debian (stable).
Who does the stopping? BIOS or Linux kernel?
Bios - - - need to hit F2 (IIRC) to kick the bios in the pants and then I can get to the os prompt a little later.
Google for "Server stuck at F1 or F2 prompt". Depending on your vendor, you will get quite a bit of information on probable causes and diagnostic processes. I had to debug a work Dell with that a while ago, and sometime in my Copious Spare Time Intel server (;-))
This beast has been set up in the expectation that a tech is available 24-7. Oh well - - - it is easy to see that management has not ever had these demands placed upon them so they think its quite 'normal' to ask for this. Thanks for the tips! Dee

On Sun, Jun 3, 2018 at 5:54 PM, William Park via talk <talk@gtalug.org> wrote:
On Sun, Jun 03, 2018 at 02:47:13PM -0500, o1bigtenor via talk wrote:
Greetings
I am quite new at running a server so hopefully the question isn't too out there.
My server has been operational for about a year and I am working on a number of different projects on it. Twice now (this last friday and 5 weeks early I came into the office to find that the server has somehow been taken down and has rebooted itself (process setup in the bios) but as it doesn't quite complete the boot process, I have to hit a key to tell it to continue and then finally to log in to read Debian (stable).
Who does the stopping? BIOS or Linux kernel?
I ask because, my machine always stops at BIOS prompt when power comes back. I don't know why. I set the BIOS to "power off" when power comes back, so it should stay turned off, but it doesn't.
Mine is set for 'power on' because I am working on further inputs where the server gets to oversee and possibly manage a bunch of things so I need to have reliability AND uptime.
So I am trying to determine what may have caused the system to do a reboot, whilst I have my suspicions I want to figure out exactly what is happening to cause this kind of behavior. AIUI servers should be able to run happily for years without issues (barring hardware problems) so I want that kind of reliability. Where in /var/log will I be finding the most clues as to the events that lead up to this 'reboot'?
/var/log/message /var/log/syslog /var/log/debug
Thanks for the tips!

On Sun, Jun 3, 2018 at 3:47 PM, o1bigtenor via talk <talk@gtalug.org> wrote:
[snip] So I am trying to determine what may have caused the system to do a reboot, whilst I have my suspicions I want to figure out exactly what is happening to cause this kind of behavior.
Do you have a UPS on that machine? If not, did you have a power interruption that was brief enough that it would not have reset the clocks at your place but long enough to cause the server to reboot? Has that machine been spontaneously rebooting recently? That's usually an indication of a hardware problem. I've had machines reboot spontaneously because: * the CPU fan and heat sink was too clogged with dust to work effectively, * the thermal paste had deteriorated and was no longer effective, * a power supply fan bearing had seized so the power supply's thermal protection kicked in to prevent damage to the components, * the hard disk drive was defective. This is by no means an exhaustive list. What do you suspect? By the way, I don't understand why long up times are considered to be some sort of badge of honour. If you're doing regular updates even with very conservative distributions, like CentOS or Debian stable, you're going to have to reboot your server due to kernel updates at least every few months. Regards, Clifford Ilkay +1 647-778-8696

On 03/06/18 20:05, Clifford Ilkay via talk wrote:
By the way, I don't understand why long up times are considered to be some sort of badge of honour. If you're doing regular updates even with very conservative distributions, like CentOS or Debian stable, you're going to have to reboot your server due to kernel updates at least every few months.
There are a few kernel hot fix tools out there to address this. Canonical offer canonical-livepatch: https://www.ubuntu.com/server/livepatch SuSE has kGraft: https://www.suse.com/products/live-patching/ RedHat develops kpatch: https://access.redhat.com/articles/2475321 - I'm not sure how they distribute patches. Oracle bought ksplice: http://ksplice.oracle.com/ Shameless self-promotion - I think ours is the easiest to setup - snap install, livepatch enable and you're all set. That and you get 3 tokens free whereas all the other offerings seem to require paid subscriptions. You can get a $0 ksplice license for a single desktop system I think, but other than that, Oracle seem to only support their own Linux with it now. None of these helped with spectre/meltdown but for any other patches that I've seen, patches just happen. These tools give more flexibility in terms of planning infrastructure reboots while keeping systems stable and secure. I highly recommend running one! Cheers, Jamon

| From: o1bigtenor via talk <talk@gtalug.org> | My server has been operational for about a year and I am working on a | number of different projects on it. Twice now (this last friday and 5 | weeks early I came into the office to find that the server has somehow | been taken down and has rebooted itself (process setup in the bios) | but as it doesn't quite complete the boot process, I have to hit a key | to tell it to continue and then finally to log in to read Debian | (stable). | | So I am trying to determine what may have caused the system to do a | reboot, Often a crash prevents logging. Clearly logging would have to happen after the crash, something that isn't easy when the system has crashed. But there is some hope. Do you have a working UPS? I don't, and I lose power a few times a year. That knocks out my computers (and clocks everywere). Aside: all device classes evolve to have enough intelligence to have clocks that need setting, and then evolve to be networked to set their own clocks. The timing of these steps is not fixed. Can you believe that I grew up with phones that had no clock? The first small computers I used had no clocks. The big ones did so that IBM could charge for the time that they were used (eg. one used to rent machines and have to pay overtime if they worked more than one shift). CP/M's file system didn't have timestamps (the were added long after I moved on). MS-DOS stupidly used local time for timestamps, even though UNIX got it right (used UTC) before MS-DOS. | AIUI servers should be | able to run happily for years without issues (barring hardware | problems) so I want that kind of reliability. Where in /var/log will I | be finding the most clues as to the events that lead up to this | 'reboot'? Not being a debian user, I don't know which files are most useful. If you are using systemd you might find that journalctl is the command you need. You could look at them all (you can skip the ones which haven't changed recently). I don't know why your system stops at the POST page. Could it be that your HDD doesn't spin up quickly enough for the normal boot logic? I have one server that hangs because the EFI System Partition's filesystem gets corrupted during a crash (oops). I think that the problem is that the OS leaves /boot/efi mounted most of the time (that's dumb) so the filesystem gets marked as "dirty" and the firmware doesn't like that.

On Sun, Jun 3, 2018 at 7:18 PM, D. Hugh Redelmeier via talk <talk@gtalug.org> wrote:
| From: o1bigtenor via talk <talk@gtalug.org>
| My server has been operational for about a year and I am working on a | number of different projects on it. Twice now (this last friday and 5 | weeks early I came into the office to find that the server has somehow | been taken down and has rebooted itself (process setup in the bios) | but as it doesn't quite complete the boot process, I have to hit a key | to tell it to continue and then finally to log in to read Debian | (stable). | | So I am trying to determine what may have caused the system to do a | reboot,
Often a crash prevents logging. Clearly logging would have to happen after the crash, something that isn't easy when the system has crashed. But there is some hope.
Using suggestions offered I think I have been able to pinpoint the issue.
Do you have a working UPS? I don't, and I lose power a few times a year. That knocks out my computers (and clocks everywere).
Aside: all device classes evolve to have enough intelligence to have clocks that need setting, and then evolve to be networked to set their own clocks. The timing of these steps is not fixed.
Can you believe that I grew up with phones that had no clock?
The first small computers I used had no clocks. The big ones did so that IBM could charge for the time that they were used (eg. one used to rent machines and have to pay overtime if they worked more than one shift). CP/M's file system didn't have timestamps (the were added long after I moved on). MS-DOS stupidly used local time for timestamps, even though UNIX got it right (used UTC) before MS-DOS.
| AIUI servers should be | able to run happily for years without issues (barring hardware | problems) so I want that kind of reliability. Where in /var/log will I | be finding the most clues as to the events that lead up to this | 'reboot'?
Not being a debian user, I don't know which files are most useful. If you are using systemd you might find that journalctl is the command you need.
You could look at them all (you can skip the ones which haven't changed recently).
I don't know why your system stops at the POST page. Could it be that your HDD doesn't spin up quickly enough for the normal boot logic?
Dell has som kind of goofy BIOS stuff so that one needs to choose one of 2 options and then the UEFI stuff happens and then the reboot works. The waiting for input is not at issue here (the system has always been this way - - -grin!
I have one server that hangs because the EFI System Partition's filesystem gets corrupted during a crash (oops). I think that the problem is that the OS leaves /boot/efi mounted most of the time (that's dumb) so the filesystem gets marked as "dirty" and the firmware doesn't like that.
Thanks for the ideas! Dee

On 03/06/18 15:47, o1bigtenor via talk wrote:
So I am trying to determine what may have caused the system to do a reboot, whilst I have my suspicions I want to figure out exactly what is happening to cause this kind of behavior. AIUI servers should be able to run happily for years without issues (barring hardware problems) so I want that kind of reliability. Where in /var/log will I be finding the most clues as to the events that lead up to this 'reboot'?
Most servers from the big vendors will have an out of band (aka lights out) management interface. Tools like freeIPMI let you control the physical host - like remote serial console, chassis power control etc. Does yours have this feature? Usually hardware issues show up in a log there - things like power supply issues, CPU overheat conditions etc. If you don't have one, I highly recommend looking into whether your server supports an add-on out of band management card Cheers, Jamon

On Mon, Jun 4, 2018 at 7:13 AM, Jamon Camisso via talk <talk@gtalug.org> wrote:
On 03/06/18 15:47, o1bigtenor via talk wrote:
So I am trying to determine what may have caused the system to do a reboot, whilst I have my suspicions I want to figure out exactly what is happening to cause this kind of behavior. AIUI servers should be able to run happily for years without issues (barring hardware problems) so I want that kind of reliability. Where in /var/log will I be finding the most clues as to the events that lead up to this 'reboot'?
Most servers from the big vendors will have an out of band (aka lights out) management interface. Tools like freeIPMI let you control the physical host - like remote serial console, chassis power control etc.
Does yours have this feature? Usually hardware issues show up in a log there - things like power supply issues, CPU overheat conditions etc.
If you don't have one, I highly recommend looking into whether your server supports an add-on out of band management card
I think it does support such but I'm not sure I want to pay for another add-on at this point. The actual main issue isn't this part of things. Thanks for the ideas though! Dee
participants (6)
-
Clifford Ilkay
-
D. Hugh Redelmeier
-
David Collier-Brown
-
Jamon Camisso
-
o1bigtenor
-
William Park