[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

(usagi-users 03459) Showstopper for transport mode IPSec in Linux kernel?



I sent the mail below to this list some weeks ago, and I am a little
surprised that there has been no response.  The problem is actually quite
serious, and it is specific for the Linux kernel implementation of IPsec -
I have talked with people who claim that the problem is not present in
*bsd kernels, for instance.

Is usagi-users the wrong list for this kind of discussion?

best regards

Peder Chr. Nørgaard        	Senior System Developer, M. Sc.
Ericsson Denmark A/S, Telebit Division
Skanderborgvej 232         	tel: +45 30 91 84 31
DK-8260 Viby J, Denmark         fax: +45 89 38 51 01
        e-mail: Peder.Chr.Norgaard@xxxxxxxxxxxx
(old e-mail 2000-2003: Peder.C.Norgaard@xxxxxxxxxxxxxxx)
         (old e-mail 1992-2000: pcn@xxxxxxx)

---------- Forwarded message ----------
Date: Fri, 17 Jun 2005 13:22:48 +0200 (CEST)
From: Peder Chr. Norgaard <Peder.Chr.Norgaard@xxxxxxxxxxxx>
To: usagi-users@xxxxxxxxxxxxxx
Subject: "connect" fails with EAGAIN with Racoon-style transport mode IPSec

As the saying goes, is this a bug or a feature?  I don't really know, but
I would like some opinions.

My problem is that when I use transport mode security with the ipsec-tools
on 2.6.12 rc6 kernel (presumably also on older 2.6 kernels) the first
attempt at communication with an IPSec peer fails consistently.  Second
and later attempts usually succeeds.  Closer investigation discloses that
the error code is EAGAIN, and inspection of the code shows that the message
comes from call of xfrm_lookup in function tcp_v6_connect in
net/ipv6/tcp_ipv6.c.  I have not checked, but I assume that similar code
is found in the code for UDP, SCTP etc.; I have checked IPv4 and the
problem is present there, too.

Now EAGAIN means "resource temporarily unavailable" and it can safely be
argued that in this case a resource is most certainly temporarily
unavailable - the connect is matching a security policy that currently has
no associated security association, and as the racoon daemon is somehow
told about this, the security association is in place a second or so
later.  So it can be argued that this is not formally an error.

But as a practical matter, how many "connect" calls in network
applications are coded to react on EAGAIN by a loop that waits a second or
so, then try again?  Answer: virtually none.  So the problem is pushed to
the user level - the user sees what looks like a fatal communication error
and is often not even told the actual error code.

So in practice this looks to me almost as a showstopper for deployment of
IPsec.  For myself I have had to suspend deployment in a test network in
my project until I have a solution for this - for my users will not accept
this behaviour, and I don't blame them.

Any opinions?  Do we just have to live with this, and start recoding all
connect calls in all internet applications?  Or can a work-around be found
in the kernel?  Queueing the connect until the SA is in place would be one
solution.  Another would be to simply drop sending the initial TCP PDU in
this case - if the case can be detected precisely enough - the TCP logic
will then think that the package is lost in transit and resend it a little
later.

Thank you in advance for any responses.

best regards

Peder Chr. Nørgaard