It Must be the Antivirus

I recently got access to a sparkling new demo machine for VLAB. Based on an AMD 9950X3D processor coupled with fast RAM and a blazing SSD. This is a perfect machine for running virtual platforms, with arguably the best microarchitecture available and a huge L3 cache (which is known to benefit code like simulations). Installed in a box that allows sustained 5GHz+ clocks. Nice. But when I started to run VLAB, things did not seem right. It took a long time to start a new simulation and some things just seemed “off”. What could be wrong?

Symptoms

Some quick investigations showed that running benchmark code on a VP seemed reasonably fast. Compared to my laptop (which in itself is a fairly fast Intel Core Ultra 7 155H, a high-performance version of Meteor Lake), it was up to twice as fast!

However, starting a new simulation run took an awful long time. Some simple time measurements scripted with Python indicated something like 20 seconds. That is a long time to wait for a simulation to start, especially when said simulation might not even take more than that to complete.

Hypothesis: Antivirus

What could be causing that? One obvious answer is “antivirus”. Especially since we had seen issues on our own personal machines when adding new security software recently. Enterprise security does come with a price in overhead, but how bad can it be? Could it be this bad?

Using sysinternals to look into the files opened during a VLAB run did show the antivirus software getting loaded. It also seemed that the slowness came during the time when the simulator was opening and loading executables and disk images – which makes it reasonable to assume that there is some scanning going on, and this would explain why starting is slow but running decently fast.

Testing the Hypothesis

I was honestly a bit sceptical about the antivirus hypothesis. I haven’t really seen antivirus being a serious performance drag for a long time. I recall the bad old times when disks were slow and processor cycles precious, and you might set aside hours of machine time to scan a drive for nastiness. On-demand file scanning was a serious annoyance in the 1990s and early 2000s. But today? Should not matter. Disks are fast, processors have plenty of spare capacity, and memory is plentiful.

But it is worth checking. It seemed easy to test the hypothesis: turn off the antivirus and rerun the test. How hard can it be?

It is never that Easy

However, this is a corporate machine with an antivirus setup managed by IT. The next day, I had a very helpful agent on the line, and we felt this should be super easy. We started by doing the obvious thing according to the manual: use remote management disable the antivirus and reboot the machine to make sure it was not loaded. Just disabling it might still leave some trace of it.

The demo machine is located in an office down in continental Europe, so I did all my work via Windows remote desktop. I could ask the machine to reboot remotely, but when the machine did not come back to RDP I needed some help. I asked the team with physical access to the machine for help, and it turned out it had refused to boot. It was necessary to use Windows recovery to roll back the state a bit to make it boot at all.

This was already two hours spent.

After another hour the machine had rebooted and was at least running. However, the antivirus software was definitely running and not disabled. It was also not in contact with the management console, so it could not be turned off either. And it had backed down a version. Extraordinarily weird. To add insult to injury, it could not be uninstalled as it protected itself against such operations and my user was not privileged enough.

Reboot again!

As an aside, watching a reboot over RDP is pretty interesting. Or not. The machine would come up and become available for RDP login pretty quickly (before the point where you would get the Windows login screen). I would then be looking at a black screen with a white spinner saying “Patientez…” (the machine was configured in French as its basic setup). This could go on for tens of minutes after each reboot operation. Lots of patience was needed.  

Rebooting a few times finally brough the antivirus back in contact with management so that it could be disabled. After a final glorious reboot (which was pretty quick to be honest), we arrived at a state with the antivirus disabled.

The result: same startup time as before. Absolutely no change after half a day of wrestling with machine management.

It wasn’t the antivirus.

New Hypothesis

What else happens when a program is starting? It is not just reading files, but also checking licenses… so maybe that is what is taking time? Switching to a local license file instead of using a network license reduced the startup time to very close to zero.

Bingo!

Why was a network license check slow, though? Couldn’t that also be the result of enterprise date protection software checking the network accesses? I.e., not antivirus but close?

Actually no. It was a badly configured license server list that accidentally contained some bad server names. The local software simply goes through the list of servers in order and if one server does not reply, there is a ten second timeout before it tries the next. Two bad servers gives you 20 seconds. This was found by showing the list to some colleagues who immediately spotted the problem – “those are not the servers you should be looking for.”

Problem solved.

Summary

What did we learn here? At first this seemed like another story of me chasing the wrong lead while trying to fix a computing problem. Given the time it took to disable the antivirus it could be considered a case of tunnel vision or fixation. But I don’t think that is necessarily the case.

My original idea here was to make a quick check to prove or disprove the antivirus hypothesis. That still seems reasonable. In hindsight, it is clear that checking the license checkout could have been done very easily and found the problem in minutes instead of hours or days.

The core of the story is that once we started working with the antivirus software, there was no going back. The initial disabling resulted in a machine that would not boot. Getting it back to a working state was necessary and had to take priority. No other experiments could really be considered valid or even performed until it was back to normal.

Still, there is a learning here. It is a good idea to stop and enumerate possible causes of a problem, before starting to experiment to prove or disprove any particular hypothesis. That makes it more likely that you will try the really easy stuff first. Which in this case would have avoided a day of sysadmin work.

A few days later, we had another issue where we also started to suspect the security software. Thankfully, this time around disabling and enabling went smoothly. And it wasn’t the antivirus this time either.

My base assumption remains that antivirus today is pretty good at not causing problems. If it does, it tends to be rather catastrophic.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.