Late last year I was trying to do some machine learning work on my brand new Alienware 15 R4 gaming laptop. I had bought the laptop in order to have something portable with sufficient performance to actually do convolutional neural network (CNN) training and inference “on the road”. The GTX 1060 in the laptop is just as powerful as my home desktop machine, and should run Tensorflow and Keras well. I had the setup working on the desktop already, and copied the code over to the laptop. When trying to run the code the first time, I got some rather strange errors that I finally figured out meant that I was missing the CUDA toolkit. I downloaded CUDA version 10, installed, and the machine rebooted into the Windows 10 automatic repair mode.
I tried a few rounds of repair and reboot, but nothing helped. I went to the advanced options and tried rolling back to a previous system restore point – but that failed:
Following threads like this, I tried “bootrec.exe” in a few ways, to no effect. At this point, I was considering wiping the machine and starting over. Clearly, something was really broken in my Windows installation. I was not willing to give up just quite yet, however… reading various blogs and discussion threads long into the night, an idea started to form.
On the SRT Trail
The next morning, I went through the whole repair loop again, ending up with a message like this (except that in this case the path pointed to D:)
The key when this message shows up is to actually look at the contents of the log file (C|D:\windows\system32\logfiles\Srt\SrtTrail.txt) to see which problems that it lists. To do this, I clicked through the Windows startup repair screen to advanced options, eventually ending up with the screen where you can launch a CMD.EXE:
At this point, I thought it was a bit disconcerting that my C: bootdrive has been put at D:, and the D: data drive was at C:. It gave me a bad feeling, but there was nothing to be done about it. Looking at the contents of the srttrail file I did find something rather interesting:
The reference to bootres.dll being broken is apparently not really relevant. From what I could find on the Internet, Windows complaining about that file really means that something else is broken. Thus, if you find an srttrail.txt that does not mention anything else, you should go back to the start and ask Windows 10 to try to repair itself again – this could well reveal another layer of errors. Sometimes, Windows just seems to stop the automatic repair process and just report issues with bootres.dll.
The file mentioned first, nvpciflt.sys, is a part of the Nvidia drivers. Thus, it seemed rather clear that the act of installing the CUDA tools was the cause of my machine misbehaving. It was a software misconfiguration or failed installation, rather than a hardware issue.
Fix by Delete
I took a chance, figuring that the software setup was kind of shot anyway, and deleted the nvpciflt.sys file. After this, my machine did indeed boot up again and got into Windows! I then completely uninstalled the CUDA toolkit and all other Nvidia software components using Windows Apps and Features, to get to a clean restart.
Rebooting after a total clean-up of the Nvidia drivers still gave me a working system using some kind of fall-back driver, but the options for sleep and hibernation were gone from the Windows Start menu. The GPU was also gone from the Alienware Command Center application too.
The Right Driver
I then proceeded to download and install the latest Nvidia drivers, figuring that the problem was that somehow I had installed CUDA on top of an incompatible driver. However, the Nvidia drivers claimed that it could not find any compatible hardware… interesting. The drivers worked fine for the GTX 1060 and GTX 1050 cards in home-build machines, but not for this laptop. The Alienware laptops require a driver from Dell rather than the generic Nvidia drivers, since the laptop hardware is just a little customized – as I understand it, it has something to do with the ability of the Alienware machine to use an external expansion box with an extra GPU.
This driver version problem was the reason why my CUDA installation broke my Windows installation – CUDA 10 required a driver version that was newer than what Dell had provided. Thus, the result of installing CUDA 10 on top of my 398.89 Nvidia driver was a broken driver stack. Using CUDA 9.2 instead worked, since it supported the old driver.
Don’t expect a laptop with an integrated graphics card to be supported by the latest drivers from Nvidia in the same way that a desktop with a PCIe-attached graphics card is. Thankfully, it seems that the Nvidia graphics driver installer checks for this right now – that strange error I got over and over again was actually correct and making sure to keep my machine in working order. Driver updates only via the manufacturer is an aspect of buying a gaming laptop that I did not think about. It is January 2019 as I write this, and the latest driver from Dell was released back in September of 2018. Not very impressive.
If you want to use CUDA for something, check the version of the driver currently on your machine before trying to install CUDA tools, since it seems the CUDA tools do not check version compatibility sufficiently to avoid broken installations.
If Windows auto-repair fails, there is no reason to panic. If you can find and delete the files that SRT finds broken, you have a fair chance of getting your system back to working order.