renewip: when the router keeps disconnecting

June 15th, 2008

So we now all have broadband connections and everything is great, right? Well, not quite. Some providers have better services than others. My connection seems rather fragile at times and tends to die about once in three-four days. When that happens, no amount of resetting the equipment helps to get it working again. It's an upstream issue that I have no control over.

But there is another problem. Once the cable modem starts working again, the router (which receives an IP address from my provider, and serves LAN and wifi locally) doesn't seem to know this and doesn't automatically re-establish a connection. Or I'm not really sure what it does, it's a black box and there is a web interface to it, where there's a button to press to do this, which sometimes works. But what really is happening, who knows. There seems to be a weird timing problem to the whole thing, where if I kill the power for both the modem and the router and they both come back at the same time, it generally works. However, if the modem is taking longer to negotiate a link, the router will be disconnected. And apparently doesn't try to reconnect on its own, so I've been stuck rebooting the two a few times until the timing is right. Resetting them separately for some reason doesn't seem to work.

So what can be done about it? Well, the router does have that stupid web interface, so it's possible to make those clicks automatically if we're disconnected. Python's urllib makes this very easy to do. First we login with router_login, which submits a form with POST. Then we check the state of the internet connection with check_router_state, which just reads out the relevant information from the page. And if it's disconnected we run renew_router_connection to submit another form (ie. simulating the button click on the web page).

Testing connectivity

More than just testing if the router has a connection to the provider, broadband connections sometimes have connectivity problems. Even if you can get a connection, the provider sometimes has problems on his network, meaning your connection doesn't work anyway.

So I came up with a test to see how well the connection is working. It's an optimistic test, so that first we assume we have a fully functional connection and ping yahoo.com. It doesn't matter what host we use here, just some internet host that is known to be reliable and "always" available. For this to work these conditions must be met:

  1. We have to reach the gateway of the subnet where our broadband IP address lives.
  2. We have to reach the provider's nameserver (known as dns1 in the code) to look up the host "yahoo.com".
  3. We have to reach yahoo.com (we have their IP address now).

So first we ping yahoo.com. If that fails, it could be because dns lookup failed. So we ping the provider's nameserver. If that fails, the provider's internal routing is probably screwed up, so we ping the gateway. And if that fails too then we know that although we have an IP address, the connection is dead (or very unstable).

#!/usr/bin/env python
#
# Author: Martin Matusiak <numerodix@gmail.com>
# Licensed under the GNU Public License, version 3.

import os
import re
import sys
import time
import urllib

ip_factory = "192.168.2.1"
password = ""

inet_host = "yahoo.com"


def write(s):
    sys.stdout.write(s)
    sys.stdout.flush()

def grep(needle, haystack):
    if needle and haystack:
        m = re.search(needle, haystack)
        if m and m.groups(): return m.groups()[0]

def invoke(cmd):
    (sin, sout) = os.popen2(cmd)
    return sout.read()

def ping(host):
    cmd = 'ping -c1 -n -w2 ' + host + ' 2>&1'
    res = invoke(cmd)
    v = grep("rtt min/avg/max/mdev = [0-9.]+/([0-9.]+)/[0-9.]+/[0-9.]+ ms", res)
    if v: return int(float(v))

def find_lan_gateway():
    cmd = "route -n"
    res = invoke(cmd)
    v = grep("[0-9.]+\s+([0-9.]+)\s+[0-9.]+\s+UG", res)
    if v: return v

def load_url(url, params=None):
    data = None
    if params: data = urllib.urlencode(params)
    f = urllib.urlopen(url, data)
    return f.read()


def router_login():
    form = {"page": "login", "pws": password}
    load_url("http://%s/login.htm" % ip, form)

def check_router_state():
    state = { "conn": None, "gateway": None, "dns1": None }
    router_login()
    s = load_url("http://%s/js/js_status_main.htm" % ip)
    if s:
        v = grep("var bWanConnected=([0-9]);", s)
        if v == "1": state['conn'] = True
        elif v == "0": state['conn'] = False
        if state['conn']:
            g = grep('writit\("([0-9.]+)","GATEWAY"\);', s)
            if g and g != "0.0.0.0": state['gateway'] = g
            g = grep('writit\("([0-9.]+)","DNSIP"\);', s)
            if g and g != "0.0.0.0": state['dns1'] = g
    return state
    
def renew_router_connection():
    router_login()
    form = {"page": "status_main", "button": "dhcprenew"}
    s = load_url("http://%s/status_main.htm" % ip, form)
    return s



ip = find_lan_gateway()
if not ip:
    ip = ip_factory
    write("LAN gateway detection failed, using factory ip %s for router\n" % ip_factory)
else:
    write("Router ip: %s\n" % ip)

while True:
    try:
        router = check_router_state()
        t = time.strftime("%H:%M:%S", time.localtime())
        if router['conn']:
            
            hosts = [(inet_host, inet_host),
                ("dns1", router['dns1']), ("gateway", router['gateway'])]
            connectivity = ""
            write("[%s] Connected  " % t)
            for (name, host) in hosts:
                delay = ping(host)
                if delay:
                    write("(%s: %s) " % (name, delay))
                    break
                else:
                    write("(%s !!) " % name)

            write("\n")
        else:
            write("[%s] NOT CONNECTED, attempting reconnect\n" % t)
            renew_router_connection()
    except Exception, e:
        cls = grep("<type 'exceptions.(.*)'", str(e.__class__))
        write("%s: %s\n" % (cls, e))
    time.sleep(3)

Léon

June 15th, 2008

What a weird frickin movie. Picture a superhero comic book without the wholesome moral values and you're getting close. It's so strange to imagine that someone would have written and directed this, and just when the actors thought they were way off the mark would have said "yes, yes, that's exactly what I want". That someone is Luc Besson, who's gone considerably more Hollywood since.

So basically we have a timid, illiterate hitman who lives mostly on milk and cookies. Contrary to that whole ninja school of combat, he's not one of those "my body is my temple" types. He has a sort of rugged fitness which is kept in check by doing sit-ups every morning (and the constant milk, calcium mhm-hm). He's probably not very fast on his feet, cause at no point is there any running. His main gimmick is hanging from the ceiling, so that when the bad guys come into the room they don't see him (a poor man's ninja if you will). Oh, and he's the best hitman in town, sublime when on the job (less so off of it).

Not much is known about his past, but apparently he came to America as a poor, helpless immigrant, taken pity on by a generous Italian restaurant owner. All these facts are stretching poor Jean Reno's acting skills to the limit. Reno has a thick French accent with no vocal skills to get around it, how the hell do you claim he's Italian?

Besson makes no effort to justify Leon's career choice. It's not because he grew up in a war torn country, because his parents were killed or because he read too many comic books, he was just poor. And killing seemed as good as anything else, eh? Then again, he does put on those dark sunglasses when he clocks in, no doubt there is a deep and heart wrenching ethical conflict there, but it goes unarticulated. (Personally I would suggest his superego needs a small tune-up.)

And there is a girl. Dad does coke, family gets nailed by bad guys, same old, same old, yadda yadda yadda. Mathilda takes refuge at Leon's, teaches the big bear (or shall I say pig, that's his favorite fluffy pet) to read, he teaches her about guns, the usual story. If you've read this far just to find out if the "I love you"s are forthcoming, they are.

If you like the idea of Reno as a hitman and you want to see him in a far stronger part, check out "Ronin", it's quite good in more ways than one.

our climate control sucks

June 13th, 2008

We are so preoccupied with weather in our society. Even though we spend most of the day inside buildings, people will actually say that a day is good or bad just based on weather. "Nice day today, eh?" Apparently those little intervals we spend traveling between the house and the office, the office and the market, the market and home, are disproportionately important to our well being in contrast to all those hours we spend on the inside. And we pay so much attention to weather and climate that it can actually determine how we feel about the day as a whole.

And yet we pay so little attention to the climate on the inside. Isn't that a paradox?

When you go into a factory and look at some of their big machinery, they have these gauges on them that show you all sorts of information about the conditions in various critical parts of the system. It's fairly important to know that the temperature is such, the pressure is in some acceptable range, the concentration of some chemical doesn't exceed this; either because the machinery itself can't handle it (eg. nuclear reactor), or because the product on the inside will get ruined if you don't keep these factors under control.

We do this for our products, but we don't do it for ourselves. It's plain to see that the climate in our rooms is more important to our well being than the weather outside, since that's where we spend most of our time. And yet there's no weather forecast for this. We don't know anything about the climate in our homes. We complain about the climate in certain parts of the world, "oh that place is horrible to live in", and just the same there are buildings with an internal climate that is just as unbearable.

And then we talk about education, and health, and productivity. Does anyone see a problem here? Do you think you can be productive at your job if you're standing in the rain, freezing your ass off? No one would expect that from you. And yet you go into the office, where it's too hot, the air is stale because the ventilation stinks, it's noisy, there's so much ambient light that you have to squint to look at the monitor, the chair doesn't have proper support for your back, and the desk is so small your elbows are hanging off the edge of it (less common now with lcd monitors). And this isn't supposed to affect your productivity at all, right?

I cannot begin to quantify the number of days or half days that were ruined for me because the inside climate was bad. I used to hate summer that brought a large number of sunny days while I was sitting in school. Half the time when the sun was up it was either in my eyes or producing glare on the blackboard, either of which meant I had to sit there squinting. Even if the curtains were drawn the sun obviously moves on an axis, so soon enough they wouldn't be in the right place anymore.

And then people say things like "boy, kids are so frail these days. They don't get enough exercise." Yes, that's part of it, no doubt. The other part is spending their days in rooms with a bad climate and non-existent ergonomics. And I know, because I was getting enough exercise, and that didn't magically eliminate the problems of climate.

***

So where do we start? We need to figure out what kind of climate we're living in. When someone is getting a headache from spending 2 hours in a room with so much ambient light that they can't comfortably see, we need to go from "there's something wrong with you" to "this climate sucks, let's fix it". The first step towards fixing is knowing what the problem is. Right now we don't know a damn thing. The only thing we have is thermometers. Imagine if the workers at a nuclear power plant only had one of those hand held thermometers and the guy was trying to "hold it close enough" to the opening so he could get a decent reading on it. That's where we are now.

We need to figure out what the relevant environmental factors are and how to measure them. Don't expect to have an ideal climate out of that, it could turn out to be expensive. But how do we know what it's going to cost since we know nothing? Step one is to be able to measure properties of the climate that impact us. Step two is to figure out how various people are affected by these properties, and which. Step three is to connect these two bits of information to the extent we are able and willing to make the effort.

Climate control right now is an art. There are people who have figured out how to tune the climate, "do a little bit of this. Okay, a little more. There, good." But it's an art, inexact and experience based, full of "maybe this will help". We need to make it not a science, but a commodity. Just as you know that the temperature in your refrigerator is supposed to be between 0 and 4 degrees, we should be able to say the same about our home climate. "My ambient light is x on average, y at peak, I need to fix it." And then teach it in schools, right along with "you should eat this, not that". It's just as important.

the art of fail

June 10th, 2008

I wrote a guest blog for Rami, which I submitted to reddit. The same exact story was submitted on reddit by someone else, in the same category, mere hours after I did.

why is my system slow?

June 9th, 2008

Computer users expect their systems to work well at all times, but unfortunately this isn't always the case. If your system becomes slow, there certainly is something you can do about it. This article will help you understand what's happening on the system, whether it's the computer in front of you or a system you're accessing remotely.

When we say the system is slow we mean that it isn't responding to our input in a reasonable time, or taking too long to complete a task. This can happen when there is *another* program using too many system resources, starving *your* program of resources, causing it to run slowly

There are three common ways in which this can happen, and all three of these scenarios can be equally crippling and put your system into a state where it seems to be frozen. None of these situations are harmful to your system (ie. they make it slow until the problem is resolved, but they don't damage anything).

  1. A program is monopolizing the cpu.
    A program is using all of the cpu cycles, blocking access to the cpu to other programs. This may be intentional (programs that do heavy processing) or accidental (programs get stuck repeating something over and over).
  2. You're nearly out of physical memory.
    You are either running too many programs, or programs that use too much memory. Your physical memory is almost entirely exhausted, and the running programs are using the harddisk as fallback memory, which is very slow.
  3. A program is doing heavy I/O.
    You may be copying a large file, for instance. The program that is doing the copying is requesting lots of data from the harddrive, but while it's doing this the cpu is actually waiting for this data to be read from the harddrive, blocking access to other programs.

The impact of both cpu heavy and I/O heavy programs can be mitigated by tuning the kernel to be more responsive. If you are running a kernel supplied by one of the major distributions (Ubuntu, Fedora etc), then it's already finely tuned for your system, but even so you may still run into these problems sometimes, on your own system or some other one.

1. Cpu bound programs

How it happens

The most common cycle for a program is to 1) accept some input, 2) do some work, 3) give some output. And this sequence is repeated for as long as the program is running. Typically, the work that has to be done takes a very short time compared to the time spent waiting for input, which gives all the other running programs a chance to use the cpu in the meantime.

If a program instead takes no input and gives no output and only does work all the time, then there is much less time in which the cpu is free for other programs to use. This will make the whole system very slow, because all programs have to wait a long time to get their turn.

Demonstration

It's very easy to simulate this scenario. Here is an example. This program is a loop, which checks the condition (which is always true) and then performs the action in the loop (running the command true, which does nothing). No input, no output.

$ while true; do true; done

Hit Ctrl+C when you've had enough.

How to detect it

The easiest way to check for a cpu bound program is to use top. See if there is a program that's using almost 100% of the cpu. To be sure that it doesn't just occasionally spike leave it running for a while (or hit <space> to refresh the display a few times).

$ top
Cpu0  : 98.2%us,  1.8%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu1  :  1.4%us,  0.5%sy,  0.0%ni, 98.2%id,  0.0%wa,  0.0%hi,  0.0%si

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26210 alex      20   0 19428 2368 1492 R   99  0.1   1:14.27 bash
 4075 alex      20   0  606m 195m  28m S    0  6.5  23:19.58 firefox
 6337 root      20   0  505m 116m 6716 S    0  3.9  34:30.25 Xorg

Here we see that cpu0 is idle (free) 0% of the time, which means it is as busy as it possibly can be. 98.2% is due to user programs (ie. the program we just demonstrated). And when we look in the list of programs, we see that bash (the shell in which we ran our one-liner) is using 99% of the cpu.

What you can do about it

There are two cases of cpu bound programs - the intentional and the accidental.

If you're running a program that does a lot of work on purpose, for instance a video encoding task, then the more cpu it uses the quicker it will finish, so using a lot of cpu is good. But if it's making your whole system slow, then you can lower the priority at which the programs gets access to the cpu, so that it only uses as much cpu as the other programs leave available. To do this, use the renice command:

$ renice 20 26210

The first number you give to renice describes how "nice" you are being to other programs, on a scale from -20 (very selfish) to 20 (very nice). The other number is the process id (pid) of the program, which is listed by top above. Doing this will probably still make bash use almost 100% of the cpu, but not at the expense of the other programs.

On the other hand, if the program isn't supposed to be using this much cpu, then it's either a bug in the program (certain versions of firefox used to spike to 100% cpu) or it's just heavier than the cpu can handle. You can still renice the program, but this will make your system more responsive at the expense of the program (so for instance, firefox may become unusable). The last resort is to kill it:

$ kill 26210

On multi-cpu systems this is less of a problem, because most programs can only use one cpu, which leaves the other cpus to serve all the other programs and keep your system responsive.

2. Physical memory is almost full

How it happens

There are two types of memory on your system: physical (RAM) and virtual (swap). Physical memory is relatively small and very quick to access, while virtual memory is just part of your harddisk being used as extra memory (very slow to access). As long as all the running programs can store their work in physical memory, everything is fine. (This is why it's good to have a lot of it.)

But once you fill all of the physical memory, the operating system will start moving some of the work into swap (onto the harddisk) to make space for new programs. You probably won't notice that this is happening. But when you switch from one program to another, and the second program has its work in swap, this work now has to be moved back into memory, and some of the stuff currently in memory has to be moved out to swap. This will definitely be noticeable and will make your system slow until it's finished.

The effect of this situation is that your system will feel normal for some of the time (when using the same program), and then very unresponsive from time to time (when switching between programs).

Demonstration

This effect is best demonstrated with a desktop program. Start the gimp and create a canvas so large that it exceeds your available physical memory. For instance, try a canvas 10,000x10,000 pixels (gimp will tell you how much memory it needs to create it). It will probably take a while to create the canvas, so just let it finish. (In order to make room for this image in memory, other programs are being moved into swap, this is called swapping.) Then do some painting on the canvas. Now switch back to another program (firefox, for instance). You should now sense that your system is slow to respond, but this is temporary for as long as it takes to restore firefox into memory.

How to detect it

It's a good idea to know how much memory your system uses under normal conditions, that way you can keep an eye on things. The command free -m will tell you about the state of your memory:

$ free -m
             total       used       free     shared    buffers     cached
Mem:          3015       1640       1375          0          9        124
-/+ buffers/cache:       1505       1509
Swap:         2878          0       2878

Here we see that we have 3015mb of physical memory, half of which is free. We also have almost as much swap memory, but none of that is in use.

After we create our huge canvas with the gimp we can run free again and see what has changed.

$ free -m
             total       used       free     shared    buffers     cached
Mem:          3015       2993         22          0          2        957
-/+ buffers/cache:       2033        982
Swap:         2878        819       2059

We're now using 819mb of swap memory, so clearly we've exceeded the capacity of physical memory.

While swapping takes place top will also show that there is a lot of I/O activity taking place. (Press Shift+M to sort the program listing by memory use.)

$ top
Cpu(s):  4.0%us,  1.6%sy,  0.0%ni, 51.0%id, 42.1%wa,  0.6%hi,  0.8%si
Mem:   3088224k total,  3043496k used,    44728k free,     3548k buffers
Swap:  2947888k total,   998812k used,  1949076k free,   957956k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 2579 alex      20   0 1593m 1.4g  13m S    0 47.0   0:22.26 gimp

Here we see that user and system activity adds up to only 6%, the cpu is idle 51% of the time, but spending 42% of the time waiting for I/O operations (ie. harddisk activity). And among the programs, the gimp alone is using 1.4gb of memory.

What you can do about it

If you notice that your system becomes unresponsive when switching between programs, you have a pretty good idea that it's because of swapping. There is no way to make swapping faster, so what you should do is less swapping. Keep an eye on how much memory your system is using and you should also have an idea about the memory use of various programs (the big ones). When you notice heavy swapping, the easiest way to fix it is to shut down the program that's using the most memory. When you do this you free up physical memory. The data in swap will not automatically be moved into memory (because this is expensive), but you should notice that your system is performing normally again.

3. IO bound programs

How it happens

Input/Output (I/O, also written io) is an umbrella term for *everything* that happens on your system that does not involved the cpu, the memory or the video card (gpu). When talking about performance, io usually means the harddrive, because that's what your system uses most heavily (and therefore what we spend the most time waiting for), but it can also refer to your network card, your cdrom drive, your keyboard etc.

A program running on the cpu, which does a lot of io (such as reading/writing large files), will spend a lot of time waiting for this io to complete. This leaves the cpu busy and other programs have less opportunity to run. The effect is that the whole system may become consistently unresponsive until the heavy io is completed.

Demonstration

We can demonstrate the effect of heavy io by reading and writing a lot of data to the harddrive. Here we find the device that your root partition is on (probably /dev/sda1) and then read 5gb from it, writing it to a file /tmp/dummy (you may want to check that you have enough free space).

$ device=`mount | grep " / " | awk '{ print $1 }'`
$ sudo dd if=$device of=/tmp/dummy bs=5120 count=1048576

This should take around 10 minutes, so you can see how your system behaves while this is happening.

How to detect it

We can detect heavy io with top.

$ top
Cpu0  : 13.7%us,  6.5%sy,  0.0%ni,  0.0%id, 79.0%wa,  0.8%hi,  0.0%si
Cpu1  :  4.7%us,  3.9%sy,  0.0%ni, 73.2%id, 17.3%wa,  0.8%hi,  0.0%si

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30956 root      20   0 10388 1812  640 D    5  0.1   0:02.76 dd

Here we see that cpu0 is spending 79% of its time waiting for io. In the list of programs we see the program dd that we ran. It's only using 5% cpu, which seems to conflict with the number 79%, but then we see it has status D, which means waiting for io. The reason for this is that while dd is only doing actual work on the cpu 5% of the time, it's still using a lot of cpu time because of all the io.

Most io is harddrive io, but we can see if this is the case with the tool atop, which is similar to top.

$ atop
CPU | sys     14% | user     21% | irq       2% | idle     39% | wait    125% |
cpu | sys     10% | user     11% | irq       2% | idle      0% | cpu000 w 78% |
cpu | sys      4% | user     10% | irq       0% | idle     38% | cpu001 w 48% |
DSK |         sda | busy     98% | read    1371 | write   1011 | avio    4 ms |

  PID  SYSCPU  USRCPU  VGROW  RGROW  RDDSK  WRDSK  ST EXC S  CPU CMD     1/4
30956   0.60s   0.00s     0K     0K 113.4M 114.0M  **   * D   6% dd

Here we again see that the program dd has status D (io wait), and cpu0 is spending 78% waiting for io. In addition, we see that the harddrive sda (which is the one we are reading and writing to, /dev/sda) is busy 98% of the time. So we know that it's the harddrive that's responsible for using 78% of cpu0.

What you can do about it

If your system becomes unresponsive because of io, it is because the cpu is not being shared among the programs in a way that allows them all to stay responsive. So the answer is to prioritize certain programs over others. ionice is the io counterpart to nice.

$ ionice -p30956 -n7

Here we are telling ionice first the process id of the program, and then the io priority it should have, on a scale from 0 (highest priority) to 7 (lowest).