【网络编程】TCP连接connect几次syn之后一直返回EINVAL问题
最近遇到一个网络问题,一个客户端线程在connect的时候,发几次syn之后不发了,每次connect都返回EINVAL。
用strace追踪了,connect的第一次参数socketfd并未变动,而且地址和端口号也是正确的,第三个参数len更是用sizeof获得的肯定不会有问题。
还好问题比较好复现。
逐步加打印是在__inet_stream_connect函数中返回的EINVAL
https://elixir.bootlin.com/linux/v5.15.178/source/net/ipv4/af_inet.c#L649
switch (sock->state) {
default:
err = -EINVAL; /* 后面connect系统调用一直返回-22,而不触发syn报文发送 */
goto out;
case SS_CONNECTED:
err = -EISCONN;
goto out;
case SS_CONNECTING:
if (inet_sk(sk)->defer_connect)
err = is_sendmsg ? -EINPROGRESS : -EISCONN;
else
err = -EALREADY;
/* Fall out of switch with err, set for this state */
break;
case SS_UNCONNECTED:
err = -EISCONN;
if (sk->sk_state != TCP_CLOSE)
goto out;
if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) {
err = sk->sk_prot->pre_connect(sk, uaddr, addr_len);
if (err)
goto out;
}
... ...
err = sk->sk_prot->connect(sk, uaddr, addr_len);
if (err < 0)
goto out;
sock->state = SS_CONNECTING;
/* Connection was closed by RST, timeout, ICMP error
* or another process disconnected us.
*/
if (sk->sk_state == TCP_CLOSE)
goto sock_error;
/* sk->sk_err may be not zero now, if RECVERR was ordered by user
* and error was received after socket entered established state.
* Hence, it is handled normally after connect() return successfully.
*/
sock->state = SS_CONNECTED;
err = 0;
out:
return err;
sock_error:
err = sock_error(sk) ? : -ECONNABORTED;
sock->state = SS_UNCONNECTED;
if (sk->sk_prot->disconnect(sk, flags))
sock->state = SS_DISCONNECTING; /* 注意这里是关键,最后一次syn之后超时,disconnect返回失败就把sock状态设置成disconnecting */
goto out;
}
继续加打印为什么sk->sk_prot->disconnect会返回失败?返回值是EBUSY
就是这里:
https://elixir.bootlin.com/linux/v5.15.178/source/net/ipv4/tcp.c#L2989
int tcp_disconnect(struct sock *sk, int flags)
{
... ...
/* Deny disconnect if other threads are blocked in sk_wait_event()
* or inet_wait_for_connect().
*/
if (sk->sk_wait_pending)
return -EBUSY; /* 这里返回出错 */
那就是sk_wait_pending值不为0,那看sk_wait_pending修改的位置
https://elixir.bootlin.com/linux/v5.15.178/source/include/net/sock.h#L1128
#define sk_wait_event(__sk, __timeo, __condition, __wait) \
({ int __rc; \
__sk->sk_wait_pending++; \
release_sock(__sk); \
__rc = __condition; \
if (!__rc) { \
*(__timeo) = wait_woken(__wait, \
TASK_INTERRUPTIBLE, \
*(__timeo)); \
} \
sched_annotate_sleep(); \
lock_sock(__sk); \
__sk->sk_wait_pending--; \
__rc = __condition; \
__rc; \
})
而sk_wait_event是在
https://elixir.bootlin.com/linux/v5.15.178/source/net/core/stream.c#L75
/**
* sk_stream_wait_connect - Wait for a socket to get into the connected state
* @sk: sock to wait on
* @timeo_p: for how long to wait
*
* Must be called with the socket locked.
*/
int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
{
DEFINE_WAIT_FUNC(wait, woken_wake_function);
struct task_struct *tsk = current;
int done;
do {
int err = sock_error(sk);
if (err)
return err;
if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV))
return -EPIPE;
if (!*timeo_p)
return -EAGAIN;
if (signal_pending(tsk))
return sock_intr_errno(*timeo_p);
add_wait_queue(sk_sleep(sk), &wait);
sk->sk_write_pending++;
done = sk_wait_event(sk, timeo_p,
!READ_ONCE(sk->sk_err) &&
!((1 << READ_ONCE(sk->sk_state)) &
~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)), &wait);
remove_wait_queue(sk_sleep(sk), &wait);
sk->sk_write_pending--;
} while (!done);
return 0;
}
EXPORT_SYMBOL(sk_stream_wait_connect);
sk_stream_wait_connect这个是在tcp send的时候调用的。
加打印可以看到connect线程和send线程在同时操作这个socketfd,根本原因是connect线程连接发送几个syn包后连接失败返回超时,内核会执行disconnect,而此时正好send线程走到wait for connect中,导致disconnect失败返回EBUSY,进而把sock状态设置成了disconnecting,后面每次connect系统调用就会直接返回EINVAL,不会触发syn报文的发送。
解决办法就是在send参数的flags中传递MSG_DONTWAIT,使得send线程不会去走wait for connect,如果未connect直接返回错误。这时connect线程每次调用都会触发syn报文。