Go to the source: NetworkManager

I ran into some behavior on my system that was not optimal. I was able to rectify the situation without going into the source code and changing anything, but I still figured I'd do that anyway, because when a piece of Open Source Software is behaving in a way that is unexpected or undesired, it is probably our responsibility to do so, or some shit.

This blog is a representation of both my documentation of solving the issue and some deeper exploration, looking for the actual code that created the behavior, which I may or may not find by the time I get bored and decide to publish this.

I have a statically defined local 10/100 ethernet network in the 10.0.0.0/8 subnet. It has a few Linux boxen hooked up to a little switch that I use to experiment.

One of the systems, the distribution is irrelevant to this discussion, runs NetworkManager, as it has a wifi card and connecting to my home's wifi network via that interface using NetworkManager is easy. Don't judge me.

At this time, I don't have name resolution set up, not even with /etc/hosts and am actually just using statically assigned IP's using a bourne shell script, fu9arnet.sh. It's not hard to remember which one is 10.0.0.1,2,3 etc.

So, on the machine in question, I ran my script to set it's IP on that link, and after some number of seconds, the IP would be dropped and the interface would be down. The reason behind this is that NetworkManager polls it's list of controlled interfaces and resolves differences. The simple fix was as easy as taking this entry and deleting it from the list

So that solved the issue, and if I wasn't stupid, I would have stopped there, not started looking into the NetworkManager source.

Well, that was easy. The first image is the function that monitors the links and makes sure everything is working in accordance with the way NetworkManager expects, the second is a snip out of NetworkManager's main function that puts nm_link_state_monitor, the first function snipped there, into the glibc main event loop with a 5000ms value for it's sleep. At least that's the code that explains why I had enough time to log in with ssh from my other machine and have it suddenly stop responding.

I'm satisfied. My question was about how that loop worked, to see what it tied into to set that timer. It seemed to have taken longer than the 5000ms + 2s, because I was able to turn around and login from my other machine before it dropped, but it's not like I timed it with a watch, or anything but a gut feeling.

Would You Like to Know More?

hal (hardware abstraction layer)

dbus