Windows 10 Reboot Loop – CUDA & Alienware

Late last year I was trying to do some machine learning work on my brand new Alienware 15 R4 gaming laptop. I had bought the laptop in order to have something portable with sufficient performance to actually do convolutional neural network (CNN) training and inference “on the road”. The GTX 1060 in the laptop is just as powerful as my home desktop machine, and should run Tensorflow and Keras well. I had the setup working on the desktop already, and copied the code over to the laptop. When trying to run the code the first time, I got some rather strange errors that I finally figured out meant that I was missing the CUDA toolkit. I downloaded CUDA version 10, installed, and the machine rebooted into the Windows 10 automatic repair mode.

I tried a few rounds of repair and reboot, but nothing helped. I went to the advanced options and tried rolling back to a previous system restore point – but that failed:

Failing to restore to a previous system restore point, error 0x80070003 (whatever that means)

Following threads like this, I tried “bootrec.exe” in a few ways, to no effect.  At this point, I was considering wiping the machine and starting over. Clearly, something was really broken in my Windows installation. I was not willing to give up just quite yet, however… reading various blogs and discussion threads long into the night, an idea started to form.

On the SRT Trail

The next morning, I went through the whole repair loop again, ending up with a message like this (except that in this case the path pointed to D:)

Windows 10 repair screen, just like the one I hit

The key when this message shows up is to actually look at the contents of the log file (C|D:\windows\system32\logfiles\Srt\SrtTrail.txt) to see which problems that it lists.  To do this, I clicked through the Windows startup repair screen to advanced options, eventually ending up with the screen where you can launch a CMD.EXE:

At this point, I thought it was a bit disconcerting that my C: bootdrive has been put at D:, and the D: data drive was at C:. It gave me a bad feeling, but there was nothing to be done about it. Looking at the contents of the srttrail file I did find something rather interesting:

Screenshot of the srttrail.txt file that revealed the problem. To see the file in Notepad like this, just enter the name of the file as a command in Windows CMD, and it will open in Notepad. Even in the limited repair environment.

The reference to bootres.dll being broken is apparently not really relevant. From what I could find on the Internet, Windows complaining about that file really means that something else is broken. Thus, if you find an srttrail.txt that does not mention anything else, you should go back to the start and ask Windows 10 to try to repair itself again – this could well reveal another layer of errors. Sometimes, Windows just seems to stop the automatic repair process and just report issues with bootres.dll.

The file mentioned first, nvpciflt.sys, is a part of the Nvidia drivers. Thus, it seemed rather clear that the act of installing the CUDA tools was the cause of my machine misbehaving. It was a software misconfiguration or failed installation, rather than a hardware issue.

Fix by Delete

I took a chance, figuring that the software setup was kind of shot anyway, and deleted the nvpciflt.sys file. After this, my machine did indeed boot up again and got into Windows! I then completely uninstalled the CUDA toolkit and all other Nvidia software components using Windows Apps and Features, to get to a clean restart.

All the pieces of Nvidia drivers and tools currently on my laptop; I uninstalled and re-installed all of them to resolve the problem.

Rebooting after a total clean-up of the Nvidia drivers still gave me a working system using some kind of fall-back driver, but the options for sleep and hibernation were gone from the Windows Start menu. The GPU was also gone from the Alienware Command Center application too.

The Alienware Command Center, with a working GPU driver.

The Right Driver

I then proceeded to download and install the latest Nvidia drivers, figuring that the problem was that somehow I had installed CUDA on top of an incompatible driver. However, the Nvidia drivers claimed that it could not find any compatible hardware… interesting. The drivers worked fine for the GTX 1060 and GTX 1050 cards in home-build machines, but not for this laptop. The Alienware laptops require a driver from Dell rather than the generic Nvidia drivers, since the laptop hardware is just a little customized – as I understand it, it has something to do with the ability of the Alienware machine to use an external expansion box with an extra GPU.

This driver version problem was the reason why my CUDA installation broke my Windows installation – CUDA 10 required a driver version that was newer than what Dell had provided. Thus, the result of installing CUDA 10 on top of my 398.89 Nvidia driver was a broken driver stack. Using CUDA 9.2 instead worked, since it supported the old driver.

Lessons Learned

Don’t expect a laptop with an integrated graphics card to be supported by the latest drivers from Nvidia in the same way that a desktop with a PCIe-attached graphics card is. Thankfully, it seems that the Nvidia graphics driver installer checks for this right now – that strange error I got over and over again was actually correct and making sure to keep my machine in working order. Driver updates only via the manufacturer is an aspect of buying a gaming laptop that I did not think about. It is January 2019 as I write this, and the latest driver from Dell was released back in September of 2018. Not very impressive.

If you want to use CUDA for something, check the version of the driver currently on your machine before trying to install CUDA tools, since it seems the CUDA tools do not check version compatibility sufficiently to avoid broken installations.

If Windows auto-repair fails, there is no reason to panic. If you can find and delete the files that SRT finds broken, you have a fair chance of getting your system back to working order.

10 thoughts on “Windows 10 Reboot Loop – CUDA & Alienware”

  1. Hi,

    I am facing the same issue.
    Could you let me know what are the drivers one is supposed to install from Dell after the system reboot. Because the ‘sleep’ and ‘hibernate’ options are missing from the start menu for me as well. Also, I am unable to change my display resolution and brightness.

    Thanks.

  2. If I recall correctly, I just went to the Dell support page for the machine and downloaded all the drivers they had on offer. After making sure to remove anything called “Nvidia” from the system with Add/Remove programs in Windows. It seems it is mostly the graphics driver that create issues, the other drivers were not damaged, nor as important.

  3. Unfortunately, I have no precise record of what I did beyond the above, and the machine has been updated many times in the meantime so I have no way to find any specific files from the time of the problems.

  4. hej Jakob!

    There are 2 different types of drivers. Standard desktop drivers will not work on mobile ones. If you go to the website https://www.nvidia.com/Download/index.aspx?lang=en-us# you can see there is a “Windows Driver Type” option where you can chose between standard driver and DCH drivers. The DCH driver is probably the one you need and the one I personally use on my laptop. If you do not want to install the drivers manually like that, you can install the geforce experience and it can automatically find the correct driver package for you, however you will have to create an account for that. After you have the updated drivers, you can install the compatible cudakit with it and you should have no problems. On a side note, it would also be benificial if you can disable the intel integrated card as that can sometimes make working with cuda alot simpler. Give this a try and let me know.
    Lycka till!

  5. I was lucky that I keep regular backups of my computer, and was able to do a system restore point from a couple weeks back. I installed CUDA today as well which broke my system and got me put in this bootloop.
    Good to know there are alternative solutions, sucks that this is a problem though. Going to tell my friends not to install tensor flow without having a new system restore point.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.