We all experienced calls getting self disconnected after 5-10 seconds – usually disconnected by the callee side via a BYE request – but a BYE which was not triggered by the party behind the phone, but by the SIP stack/layer itself.This is one of the most common issues we get in SIP and one of the most annoying in the same time. But why it happens ?
Getting to the missing ACK
Such a decision to auto-terminate the call (beyond the end-user will and control) indicates an error in the SIP call setup. And because the call was somehow partially established (as both end-points were able to exchange media), we need to focus on the signalling that takes place after the 200 OK reply (when the call is accepted by the callee). So, what do we have between the 200 OK reply and the full call setup ? Well, it is the ACK requests – the caller acknowledgement for the received 200 OK.And according to the RFC3261, any SIP device not receiving the ACK to its final 2xx reply has to disconnect the call by issuing a standard BYE request.So, whenever you experience such 10 seconds disconnected calls, first thing to do is to do a SIP capture/trace and to check if the callee end-device is actually getting an ACK. It is very, very import to check for ACK at the level of the callee end-device, and not at the level of caller of intermediary SIP proxies – the ACK may get lost anywhere on the path from caller to callee.
Tracing the lost ACK
In order to understand how and where the ACK gets lost, we need first to understand how the ACK is routed from caller to the callee’s end-device. Without getting into all the details, the ACK is routed back to callee based on the Record-Route and Contact headers received into the 200 OK reply. So, if the ACK is mis-routed, it is mainly because of wrong information in the 2oo OK.The Record-Route headers (in the 200 OK) are less to blame, as they are inserted by the visited proxies and not changed by anyone else. Assuming that you do not have some really special scenarios with SIP proxies behind NATs, we can simply discard the possibility of having faulty Record-Routes.So, the primary suspect is the Contact header in the 200 OK – this header is inserted by the callee’s end-device and it can be altered by any proxy in the middle – so there are any many opportunities to get corrupted. And this mainly happens due to wrong handling of NAT presence on end-user side – yes, that’s it, a NATed callee device.
No NAT handling
If the proxy does not properly handle NATed callee device, it will propagate into the _200 OK_reply the private IP of the callee. And of course, this IP will be unusable when comes to routing back the ACK to the callee – the proxy will have the “impossible” mission to route to a private IP :). So, the ACK will get lost and call will get disconnected.
The correct handling and flow has to be like this:
Excessive NAT handling
While handling NATed end-points is good, you have to be careful not to over do it. If you see a private IP in the Contact header you should not automatically replace it with the source IP of the SIP packet. Or you should not do it for any incoming reply (like “let’s do it all the time, just to be sure”).
In a more complex scenarios where a call may visit multiple SIP proxies, the proxies may loose valuable routing information by doing excessive NAT traversal handling. Like in the scenario below, ProxyA is over doing it, by applying the NAT traversal logic also for calls coming from a proxy (ProxyB) and not only for replies coming from an end-point. By doing this, the IP coordinates of the callee will be lost from Contact header, as ProxyA has no direct visibility to callee (in terms of IP).
In such a case, with OpenSIPS, you will have to review your logic in the onreply route and to be sure you perform fix_nated_contact() for the 200 OK only if the reply comes from an end-point and not from another proxy.
SIP is complicated and you have to pay attention to all the details, if you want to get it to work. Focusing only on routing the INVITE requests is not sufficient.If you come across disconnected calls:
- get a SIP capture/trace and see if the ACK gets to the callee end-point
- if not, check the Contact header in the 200 OK – it must point all the time to the callee end-point (a public IP)
- if not, check the NAT traversal logic you have in the onreply routes – be sure you do the Contact fixing only when it is needed.
Shortly, be moderate, not too few and not too much …when comes to NAT handling