Tech Sujhav : Transport Failure Detection

Transport Failure Detection

Here today i try to explain in more general fassion, Transport failure detection depends on the deployment of the Network. I will explain this with the help of an Example.

Example

Suppose their are two nodes Node-1 and Node-2 , Peer connection is already  established between them and they are exchanging messages on that connection. Now Node-1 sends a message MESSAGE-X to Node-2 and doesnot receive the response for the MESSAGE-X. So how long Node-1 should WAIT for the Respose (say 10ms) or should Node-1  retry (say YES) or How many time NOde-1 should retry (say 2-Times) all these things are deployment specific. 

After satisfying all the deployment specific conditions Node-1 would check whether there is break in network connection or not. So for this Node-1 send DWR message to Node-2 and does not receive the DWA in specific period of time then it will retry the DWR for 3 time (include in the first DWR). If DWA is not received for any the DWR then it will take this situation as the Connection Failure. and Send the Other Messages to the Secondary Peer.

If Node-1 will receive the DWA with the Error does not mean that Connection Failure, because Node-1 has received the DWA on that Network for which Node-1 was checking whether the Transport-connection was there or not. DWA with error may contain Diameter_too_Busy or any other Error message is just  to inform the Node-1 the status of Node-2.

Failover

The process of detecting the Transport connection failure with its peer and forwarding the all pending messages to the Secondary Peer Node (Alternate Node) is known as failover.

Avp Structure of DWR and DWA

Device-Watchdog-Request

<DWR> ::= < Diameter Header: 280, REQ >

                { Origin-Host }

                { Origin-Realm }

                [ Origin-State-Id ]

Device-Watchdog-Answer

<DWA> ::= < Diameter Header: 280 >

                { Result-Code }

                { Origin-Host }

                { Origin-Realm }

                [ Error-Message ]

                * [ Failed-AVP ]

                [ Original-State-Id ] = [ Origin-State-Id ]

Avp Description

Failed-AVP:- is a grouped avp provide the Debugging information in case of reject or Error during the processing such as AVP not supported etc.

Error-Message:- provides the Error in human readable form.

Original-State-Id:- is misprinted in RFC. It is basically  Origin-State-Id.

Origin-State-Id :- Origin-State-Id is used to infer the session/connection between two nodes. Whenever there is  change is state due break/disconnection in session or transport because of reboot for instance, Then rebooted node will increase the value so that other node become aware of the fact that state of peer is changed and all previous session are no more valid. Origin-State-Id is stored on non-volatile memory on all nodes.

Every time the session fails or the node is rebooted this Origin-State-Id is monotonically increased. Both nodes that are communicating stores or maps this id for mapping the Answer-Message with proper Request-Message.

Your Comments /Suggestions and Questions are always welcome.I would try to clarify doubts with best of my knowledge. So feel free to put Questions.

78 comments:

RishiFebruary 7, 2012 at 9:50 AM
Hi Vinay,

Thanks for this article. I've a query though.

Please let me know when no Origin-State-Id is sent in the DWR, then what Origin-State-Id value should we expect in the DWA message?

I'm facing an issue, where invalid AVP bits of Origin-State-Id is received in DWA when NO Origin-State-Id is sent in the DWR. Error is shown below:-

#### <> <> <> <1322126427209>
180.20.100.90
origin.com
N/A
2001

Regards,
Rishi
ReplyDelete
Replies
NicolasApril 24, 2012 at 6:00 PM
Hi Vinay,
Thank you for the article.
Let's take peer 1 configured to send a DWR every 30 seconds if no traffic is detected.
Peer 2 is configured the same way.
I'd like to verify something:

At t0 peer 1 sends DWR
at t0+30 peer2 sends DWR
at T0+60 peer1 sends DWR

Do you think the DWR is considered as a traffic and in this case peer1 when receiveing the DWR at T0+30 would wait another 3à to send the second DWR, that is at T0+60?

Thank you
Nicolas.
ReplyDelete
Replies
UnknownJune 19, 2012 at 4:44 PM
Hello Vinay,

Thanks for the nice article. lets there is a x-request message and waiting for y-answer message. How long the device will wait for the answer, is it application specific or session specific(depends on particular session say IP-CAN session for Gx)?
ReplyDelete
Replies
AnonymousDecember 1, 2012 at 11:55 PM
Hi i am kavin, its my first time to commenting anyplace, when i read this
post i thought i could also create comment due to this sensible paragraph.
My web page ... piano lessons
ReplyDelete
Replies
kamalApril 15, 2013 at 10:48 PM
Hi Vinay,

Watchdog timer need to enable separately or DWR/DWA are triggered by default?
ReplyDelete
Replies
UnknownMay 3, 2013 at 12:49 PM
Hi,

For First DWR got DWA MESSAGE and after immediately getting DWA message client sending 2nd DWR again after that getting error as SCTP : ABORT : User Initiated Abort. issue will be at DWR timer vlaue or Association ?

ReplyDelete
Replies
UnknownMay 31, 2013 at 5:52 PM
Visited so many blogs, I find this a very unique and interesting, glad to be here -inventhistory
Clothing
Communication
Entertainment
Electric
Financial
Food Preparation
Green Technology
Software
Warfare
Transportation
Instruments
Office

ReplyDelete
Replies
UnknownMay 31, 2013 at 10:48 PM
Hi,

What if the node-1 do not send de DWR?? it only send CER and recive CEA and that all.
ReplyDelete
Replies
UnknownJune 1, 2013 at 2:26 AM
I have a problem, the node-1 does not send the DWR, someone know what happen? Node-1 send de CER and recive de CEA, but thats all, the conections does not establish.
ReplyDelete
Replies
VVJuly 29, 2013 at 5:25 PM
question on transport failure detection in Diameter.
Say I have a Diameter peer connection established and my watchdog timer is 30seconds.
Now if I do a ifconfig down on that IP interface over which the peer connection is established.
How long will it take my local Diameter layer to detect that the IP interface has gone down? Will this be immediate or will it have to do the watchdog procedure

thanks,
Vijaya
ReplyDelete
Replies
VijayAugust 11, 2013 at 5:49 PM
If Origin-State-Id is sent in CER with value 0, is it mandatory to send the Origin-State-Id set to value 0 in the CEA message?
ReplyDelete
Replies
UnknownOctober 7, 2013 at 7:21 PM
Hi Vinay, I've a couple of questions re: transport failure

Lets say as per your example we have Node 1 and Node 2 connected and exchanging messages.

If I understand the RF3539 correctly the Tw timer is reset (with Jitter) for every Answer message. So as you say in the busy hour the DWR is never sent.

So lets say Node 1 has sent a CCR request to Node 2 and response-timeout (10ms in your example expires) Node 1 looks to see if it should retry ('Yes' & twice as per your example) so we would see two more attempts completed before Node 1 stops retrying, the request. Each retry would reset the Tw timer.

Couple of things I need some help with
- I'm not sure I understand why after 3 failures (as per local config) the DWR would be initiated? Assume this is because Tw is reset on Answers and not requests so although there may be more requests sent the lack of answers means that Tw will expire
- How does the Credit Control Tx timer overlay onto the base response-timeout i.e. if Tx was 5ms and we set the Credit Control application to Terminate no further attempts are made, does this override the base config?
- Lean hour vs Busy hour RFC 3539 suggests that in a busy hour it may take 2Tw to fail over I assume this is because only a DWR/DWA failure can be used to infeer the peer is down?

Kind regards Jim

ReplyDelete
Replies
VijayNovember 14, 2013 at 4:36 PM
I want to understand how the DWR exchange is different from the SCTP HEARTBEAT mechanism? A diameter protocol using SCTP as transport layer will any how detect the transport failures using the HEARTBEAT messages exchanged between the two SCTP nodes, then why there is a need to exchage DWR/DWA messages still to detect transport failures?
ReplyDelete
Replies
VijayNovember 26, 2013 at 6:16 PM
I have a query regarding the Failed-AVP AVP content to be encoded whenever a diameter node returns DIAMETER_MISSING_AVP error. RFC describes the following:
7.1.5. Permanent Failures
DIAMETER_MISSING_AVP 5005
The request did not contain an AVP that is required by the Command
Code definition. If this value is sent in the Result-Code AVP, a
Failed-AVP AVP SHOULD be included in the message. The Failed-AVP
AVP MUST contain an example of the missing AVP complete with the
Vendor-Id if applicable. The value field of the missing AVP
should be of correct minimum length and contain zeroes.

7.5. Failed-AVP AVP
……
A Diameter message SHOULD contain one Failed-AVP AVP, containing the
entire AVP that could not be processed successfully. If the failure
reason is omission of a required AVP, an AVP with the missing AVP
code, the missing Vendor-Id, and a zero-filled payload of the minimum
required length for the omitted AVP will be added.

I am confused about the value to be encoded as defined in the above two sections(one section says as it should be filled with zeros and other section says it should be a zero-filled payload??).
May I know what is the expected result? Is it that the Value field be left empty or encode the value field with the value "00" which is one byte and append the padding bytes?

ReplyDelete
Replies
SamJanuary 27, 2014 at 7:54 PM
Hello,

In the example above, if there is an underlying transport link failure between Node-1 and Node-2, but Node-2 has not been seen as suspect Diameter peer by Node-1 because Tw has not expired between Node-1 and Node-2; also DWR/DWA process has not taken place to conclude that Node-2 is suspect and there is a transport link failure.

Questions:

1) I believe in Node-1 Tx timer keeps expiring and it will keep sending CCR to Node-2 setting T-bit at re-transmission each time, until the number of configurable re-transmission times is reached by Node-1?

2) If during this time window, Tw expires and Node-1 starts to send DWR towards Node-2; and Node-1 has not exhausted the number of its configurable re-transmission times for CCR; can CCR and DWR be sent by Node-1 towards Node-2 simultaneously?

Thanks.

Sam
ReplyDelete
Replies
vijayFebruary 4, 2014 at 11:05 PM
This comment has been removed by the author.
ReplyDelete
Replies
vijayFebruary 4, 2014 at 11:08 PM
Hi all ,
can any one help on this

1)have you ever used seagull tool as a client for pumping Sy call flow
when i am using seagull as a client ,as per my requirement i need to put timeout .In that time DWR message is receiving from server to seagull client and seagull response back with DWA,after that subsequent DWR message is sending from server but seagull never sends DWA

is any one faced this problem .kindly provide the solution for this

2)actually when no traffic exchanged in between two nodes with in 30 min DWR and DWA will be initiated is this time configurable in both server and client ?

point 2 is applicable to 3GPP standards ,can we configure time for DWR and DWA both client and server side ?

plese correct me if i am wrong

Thanks in advance
ReplyDelete
Replies
Devesh PrakashApril 8, 2014 at 11:44 AM
Hi Team-Diameter,

I have two questions.

1. If already a connection is established to diameter server. and if we try to open second connection to diameter server using same client identity. How will server react?

2. If 'new Origin-State-Id > older Origin-State-Id' in CER, will the server clear any old socket with same diameter client (if any, and where server is using watchdog mechanism to figure out the connection state, but watchdog timer still has not expired).
ReplyDelete
Replies
HariJune 10, 2014 at 12:05 PM
Hi Team,

Actually I am getting the "DIAMETER_LOGOUT" error.

Could you please anyone let me know what would be the reason.

Regards,
Harish
ReplyDelete
Replies
UnknownJune 26, 2014 at 9:42 PM
HI Team ,

I have a scenario where Node A sent Exchange capability request and Node B sent Exchange capability answer with diameter success result code .Now after 29.79 sec Node B initiates watchdog request and but Node A didnt send any response for the watchdog request.
As well as after 30.28 sec Node initiates the SCTP abort with error code user-initiated ABORT.

User Initiated Abort (12)

Cause of error
--------------

This error cause MAY be included in ABORT chunks which are send
because of an upper layer request. The upper layer can specify
an Upper Layer Abort Reason which is transported by SCTP
transparently and MAY be delivered to the upper layer protocol
at the peer.

now questions :)
1. Why node A sent SCTP-abort ( user-initiated ) ?Is it because the uppe layer ie diameter didnt received watchdog-request ,so diameter request sctp to initiate SCTP abort.
2. what can be the reason for diameter request SCTP to initiate SCTP abort ( is it transport layer failure dected by diameter ) ?
3. After successful exchange capability request and answer which node will initiates the watchdog request if there is no diameter traffic .

Thanks in advance .
Regards
Victor
ReplyDelete
Replies
UnknownJuly 15, 2014 at 7:30 PM
HI Team,

Thanks for your reply . I agree with you that there is some problem with transport layer .
Yes there is sctp heartbeat message sent from Node A which Node B didnt respond to.

SCTP message exchanged between two nodes are

node A Node B
init-------------------------->
< ------------------------init_ack
cookie_echo--------------->
<------------------------cookie_ack
after this diameter establised
CER-------------------------------->
<----------------------------CEA

SCTP heartbeat ------------->
<-----------------------DWR
SCTP abort ------------------>

so from above as nodeB didnt responded to sctp heartbeat message that why Node a sends SCTP abort message .
but just one last question :) why node A didnt responded to DWR is it because of transport layer that is node A didnt recived the DWR message and same could be the reason for Node B didnt responded to heartbeat message .

Am i right ? kindly let me know your views too .

Thanks and regards
Victor
ReplyDelete
Replies
JamesLynchJanuary 10, 2015 at 11:30 PM
This is the one of the best and informatic blogspot i ever seen.thanks for such a nice and unique content with many tips , ideas and guide to other traveler.Thanks again.Car service in Fayetteville GA
ReplyDelete
Replies
UnknownJanuary 23, 2015 at 9:41 AM
Hi,
Since DWR/DWA are not part of the load is it possible for a diameter peer to combine a DWA with another application message?
I do see such combined messages in the same packet.

Thank you
ReplyDelete
Replies
UnknownFebruary 4, 2015 at 10:33 PM
Thank you for your reply.
It's a case where for example I see the ULR message and within the same packet I also see the DWR/A so it appears (in Wireshark) as follows:
DIAMETER 970 cmd=3GPP-Update-Location Answer(316) flags=-P-- appl=3GPP S6a/S6d(16777251) h2h=1326b498 e2e=1326b498 | cmd=Device-Watchdog Answer(280) flags=---- appl=Diameter Common Messages(0) h2h=2987f1 e2e=2987f1 |

Is this normal?
ReplyDelete
Replies
Naseem RahmanMarch 11, 2015 at 9:39 PM
HI, What happens when you have a DRA in place? because NODE1 sends message to NODE2 through DRA, if NODE2 is down, NODE1 has no idea about NODE2?
ReplyDelete
Replies
Naseem RahmanMarch 11, 2015 at 9:39 PM
HI, What happens when you have a DRA in place? because NODE1 sends message to NODE2 through DRA, if NODE2 is down, NODE1 has no idea about NODE2?
ReplyDelete
Replies
UnknownSeptember 9, 2015 at 4:19 PM
As per RFC 3539 section 3.4.1:

Suppose there are 2 nodes - Node A and B. Now Node A has detected an inactivity (No request/response received upto Tw time) and it initiated DWR to Node B. Suppose Node A hasn't received anything in another Tw time (SO 2 Tw time has elapsed ). Now what should be the behaviour of Node A:
1. Node A should fail-over the traffic towards secondary node(if available)
2. Node A should fail-over the traffic and again initiate a DWR and won't break the transport connection (with primary node B)
3. Node A should fail-over the traffic and directly break the transport connection ( and then it will try re-connecting this node)

I presume that this behaviour is same for TCP and SCTP.
ReplyDelete
Replies
UnknownDecember 23, 2015 at 9:35 AM
Hi,
Suppose node A send DWR to node B. Suppose Node A hasn't received anything in another Tw time and set pending flag.
After that, node A send another DWR and hasn't receved anything in another Tw time but receved CCR message continuously in Twtime.

In this case according to RFC implementation, Tw will be continuously reseted but pending flag still set. and will be failover sometime after.
I think pending flag should be reset when receiving non-dwa messages. What do you think?
ReplyDelete
Replies
UnknownDecember 24, 2015 at 5:38 AM
Yes, i know that.
But i supposed first dwr failed and second dwr sent but not respond dwa but still receiving other msg at that time.
Its the suppose of certain case.
I am curious this rfc implementation logic.
I am working in telecom company.
ReplyDelete
Replies
UnknownDecember 24, 2015 at 6:21 PM
Really Thank you for the fast response.
I didn't mean argue just curious and understand vender specific implementation. Just wonder RFC failover algorism.
Here is hypothetical situation.

This is just hypothetical situation for verifying failover algorism based on RFC.(not real situation)

if two DWR fail then failover occurred

Node-1 Node-2
----------------------------

pending flag=0
(no load on link)

DWR_1 ------------->
<---x(fail)--- DWA_1

pending flag=1
(Sudden load applied ex) continous CCR or CCA incoming)
(continuous timer reset so no second DWR_2 will be triggered
but pending flag still set to 1)

(( 1 month later ))

pending flag=1
(no load on link)

DWR_2 ------------->
<---x(fail)--- DWA_2

pending flag=2
(failover occured because pending flag setted already
1 month earlyer)

Problem is, just one DWR fail cause failover situation because of the setted pending flag 1month earlyer.

ReplyDelete
Replies
Pankaj GoelJune 28, 2016 at 1:08 PM
I am able to understand the significance of Origin-State-Id in CER. But I am not able to understand how it will be handled, if it is sent in DWR. What is the significance of sending the same in DWR (or in fact any other application message like CCR etc).
ReplyDelete
Replies
UnknownJune 13, 2017 at 12:31 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownJuly 31, 2017 at 5:58 PM
Hello ,

My query in standard protocol site , its been mentioned in rfc6733 page 66 that when transport detection is detected that DWR message MUST NOT be sent to alternate peer ? could you please elaborate this.
ReplyDelete
Replies
GolanJuly 31, 2018 at 3:29 PM
Hello Team-Diameter .
I have a simple question .
can diameter server such a credit control server send DWR to the client ?
Client team said they node cannot accept DWR from server , is that true ?
I looked for an answer in the RFCs documents, but I did not find any reference for that .
Best regards
Golan
ReplyDelete
Replies
UnknownApril 18, 2019 at 11:20 PM
Hi,

I am getting error at the time of exchange of diameter messages. I'm acting as a diameter server. Proper CER/CEA exchange happens, so does DWR/DWA but in between I am getting the error.. connection reset by peer. What can be the possible reason behind this
ReplyDelete
Replies
MahammadMay 28, 2019 at 8:02 PM
Nice article.
How the detection will happened transport layer failure, in case of DWR timeout?
ReplyDelete
Replies
Pankaj pandeyFebruary 29, 2020 at 3:25 AM
Can you pls explain in DPR the 3 cause is configurable in DRA/DSC. what are the possible reasons with example
?
ReplyDelete
Replies
Shuvhashis PaulMay 10, 2020 at 10:03 AM
Hi,

Consider Client-A has two links with Client-B with different realm. If one of the link is disconnected with Client-B then Client-A sends the same request/update message with same session-id using redundant Link or realm.

Question:
1. Client-B will detect it as new request as HOP-by-HOP identifier will be changed for another realm but session-id will be same?

2. How client-B will identify one request/update as a unique request , it is using only diameter session-id or a combination of session-id and HOP-by-HOP identifier?
ReplyDelete
Replies
UnknownDecember 15, 2021 at 11:33 PM
What will be the issue if New origin-state-id is lower than current one ?

Thanks,
Robin
ReplyDelete
Replies
Rupinder kaurFebruary 4, 2022 at 3:12 PM
Informative blog....!!

Transportation Tracking Software
ReplyDelete
Replies

Add comment

Pages

Transport Failure Detection

78 comments: