Einstein makes a good point. You rarely solve a problem by doing the same troubleshooting methods and expecting it to work. I think doing the same thing over and over and expecting a different result is the actual definition of insanity.
A large part of my career has been about troubleshooting. There’s always a problem somewhere that needs my attention. I.T. in general, for me, is about problem solving.
Whether it’s solving a design problem you encounter during implementation. Or an operations issue that falls outside what’s been documented. Problems come in many shapes and sizes. So too must your problem solving skills.
Some things make sense. You don’t really solve these. You pattern match the problem to an existing solution. And you can stay sane doing these.
For the hairier problems I’ve battled over the years, I find the following methods have helped me out a lot.
First thing I want to do when someone brings me their issue is see if I can replicate it.
What that means is I want to know exactly what you were doing? And how you were doing it when the “thing” did the “other thing”?
Some people send screenshots. Which is nice. But I prefer to see it live and direct. Why? Because I need to verify everything this person was doing, or might have been doing or was actually probably doing when the system didn’t work for them.
If I can get the problem in my hands, see it with my own eyes. Taste it. Feel it. In my opinion I have a better chance of understanding the actual problem.
For example, some people will tell me they ran the exact linux command I gave them and got an error. I will go and find out they were running it from the wrong server. Or sending it to the wrong server. Or didn’t copy and paste the command correctly.
If I can’t replicate it and go on hearsay it can send you on a wild goose chase.
A really useful way of getting a deeper understanding of the problem, the system, the problem or possible solution - for me - is to build a miniature version of it.
What do I mean by this? If possible (and with todays technology it is and even easier) I will use virtualisation and container technology to build a mini version of what I’m looking at.
For example, if its an n-tier application design with a web server, and applicaiton server and a database, I will spin up 3 virtual servers with vagrant, install an nginx server on one, a tomcat java application on the other one and a mysql db on the third server.
All on my work laptop.
What this provides me is the freedom to do anything I need. From non-destructive to complete annihilation to this model in a scaled down way. This helps me understand the design a bit more. Feel its functionality, limitations and risks.
Spinning up these mini “worlds” as I call them has been my new favourite thing to do on my work laptop - both for solving problems and for just trying things out.
Simulating the issue helps when its time to move onto the next part.
I like doing evidence based investigations. Once I’ve confirmed the errors are legit I usually want to do a few more tests. This isn’t because I don’t believe what I’m seeing, I just like to confirm that it’s broken in the way that I’m assuming it is right at this moment.
For example, if I’m trying to connect to a webserver on port 80 and it’s refusing connections - is the webserver down? or is the network not letting me connect to port 80? They’re two very different things that look the same.
I like to think of this as evidence gathering. I’ll grab logs of things to get error messages, or warnings to build my case both for myself, and for whoever I need to talk with to get it fixed or describe possible knock-on issues.
Get logs, run a few tests and capture output. You need to be able to say what you tried, what you found, and your interpretation of those results. This is something that’s important on project work. This is your domain and it’s not enough to just fetch and submit raw data. You need to process it, and give your best expert opinion on what’s going on.
Compiling this information is both good for your own troubleshooting and valuable data for the project to make decisions based on.
Now, those are the more sane methods. Everything can be rationalised and thought through logically.
But what are the method when nothings working and it doesn’t make sense?
It’s gone a little mad…
When everything has gone bananas, the system is doing crazy things. It won’t start up. It falls over when you give it the expected instructions. I try to bring a little sanity back into the picture.
I try and strip it all the way back to basic.
I try to prove what I know.
Whitebox testing is a software development concept.
I’m not a software developer.
When I say “whitebox” I mean its a system you can see the insides of. You have control of the system e.g. an SSH server, a router, a web server. You can change the settings. You know what the input it’s expecting. And you know what the expected output is.
If I ping a network address. I expect the address to reply. Seriously, you strip it all the way back to what you know. For example, I know the address is up. So it should reply.
Okay so it replies. Can I connect to a port? Well, it’s my ftp server so port 21 should be open. Try to connect to that port. Login to the server, turn the FTP service off. Try to connect to port 21 again. It should not let you connect.
So now you know that part of the configuration is working as you expect it to. And so you move through each settings. From the client side. Changing on the server side. Confirming from the client side.
It’s config by config confirmation for dummies. Or at least that’s what if feels like.
So what about if you don’t have server side access? And can’t get in there and confirm or change settings?
You blackbox it.
Guess what? Black boxing is also a software developer thing.
And guess also what?
I’m still not a software developer so here’s my application of blackboxing.
From its definition, blackboxing is testing an applications functionality without seeing the insides. You give the system an input. You expect something back. But you don’t have the luxury of maniuplating the settings to make it so.
Black boxing (for me) is a logic game. You throw all manner of crap at the system and see how it responds. Badly formed HTTP requests. Random GET, POST requests. Scanning for ports and sending requests at them.
I basically send it nonsense until it make sense. Sure that sounds crazy. But you just need one thing to make sense. No matter how small. You just need to the system to do one thing that you understand.
And then build on that.
Real world example
Sure this is a super simple example but it’s illustrative.
Today I deployed a docker nginx container to a docker swarm that spanned two hosts. I tried the API endpoint. Looked good.
500 Gateway internal error!?
Refreshed again. Looked fine.
I pumped the ‘refresh’ button like crazy to generate a bunch of requests and watchedthe logs go.
Only one of the logfiles was throwing the 500.
The logs said it was doing some internal loop.
I looked at the nginx config. It was set to try a few endpoints if anything was missing.
There’s the clue. Something was missing. First clue was it was failing about 50% of the time.
Put this together. I had one node which was missing some endpoint files.
You’re always going to need problem solving in an I.T. role. Sometimes they’re easy enough. Most times though, they’re not.
So to help apply some methods that have been beneficial for me:
- Try to replicate the problem first so you can see it for yourself.
- Next try to simulate it somewhere you can control everything.
- Prove its a problem, what could be causing it, how to fix it.
- When it starts getting crazy - try whiteboxing…
- When it’s lost the plot - try blackboxing.
You’re not going to win them all and that’s okay. Just do the best you can with what you know!