cancel
Showing results for 
Search instead for 
Did you mean: 

XT9 Link Aggregation

Mbrossart
Level 9

I’m trying to get Link Aggregation to work between an XT9 router and a NetGate GS110EMX switch.  Both devices support 802.3ad.

  • On the XT9, in LAN | Switch Control, I enable Bonding/Link Aggregation.
  • On the GS110EMX
    • Under LAG | LAG Membership, I select the ports to aggregate.
    • Under LAG | LAG Configuration, I enable the LAG ID I just created.
  • I cable the switch ports defined under LAG Membership to the router LAN ports 2 & 3 per the ASUS instructions.

After the router reboots

  • The switch reports the LAG is up.
  • Switch monitors show traffic is being passed on both ports of the LAG.
  • Link lights on both ports of the LAG flutter as expected.

However, my XT9 nodes no longer backhaul over Ethernet.  They backhauled over Ethernet just fine before I enabled LAG.  Sometimes one node will backhaul over Ethernet but never both.

The instructions I find for ASUS LAG always refers to connecting a NAS device.  Does LAG not work with network switches for ASUS?  Are there special routing considerations?  Has anyone else done this?

3,018 Views
1 ACCEPTED SOLUTION

Accepted Solutions

Mbrossart
Level 9

Thank you both so much for your time, attention and input.  I think I may have it solved, and I think the root cause was bad documentation.  When I set up LAG, my XT9 tells me to plug into ports 2 & 3.  I am now plugged into ports 1 & 2, and everything is running fine.  Until this configuration, I always either got the link lights pacing back and forth or I had to turn off loop prevention, and I still got a whole lot of anomalous routing issues.  In this configuration, I can re-enable loop prevention and the switch doesn’t complain.  I think plugging into 2 & 3 is what was causing the loop in the first place.

I did make a few changes too, but testing these out on ports 2 & 3 didn’t solve the problem.

  • Changed the LACP System Priority on my switch to 1 based on what I was seeing in your readout Murph_9000.
  • Changed the LACP Priority on my aggregated ports on my switch to 1 based on what I was seeing in the readout.
  • Changed the timeout to Short on my aggregated ports on my switch based on your input Murph_9000.
  • Changed the ports used on my XT9 to 1 & 2 as stated above.
  • Changed the uplink priority of my nodes to my Ethernet link.  Even after the other changes, one of my nodes still flipped to WiFi, but changing the preference to Ethernet, it flipped right back easily.  I don’t know if this is an indication that there’s still an issue with the Ethernet route on this node.  I’d be surprised, it’s a 2.5Gb backhaul.

For comparison, here’s my new readout.

mbrossart@ffw:/tmp/home/root# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): count
System priority: 65535
System MAC address: a0:36:bc:62:be:e0
Active Aggregator Info:
Aggregator ID: 3
Number of ports: 1
Actor Key: 9
Partner Key: 1
Partner Mac Address: e0:46:ee:10:25:5f

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:36:bc:62:be:e0
Slave queue ID: 0
Aggregator ID: 3
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: a0:36:bc:62:be:e0
port key: 9
port priority: 255
port number: 1
port state: 63
details partner lacp pdu:
system priority: 1
system mac address: e0:46:ee:10:25:5f
oper key: 1
port priority: 1
port number: 8
port state: 63

Slave Interface: eth3
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: a0:36:bc:62:be:e0
Slave queue ID: 0
Aggregator ID: 4
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
system priority: 65535
system mac address: a0:36:bc:62:be:e0
port key: 0
port priority: 255
port number: 2
port state: 71
details partner lacp pdu:
system priority: 65535
system mac address: 00:00:00:00:00:00
oper key: 1
port priority: 255
port number: 1
port state: 1
mbrossart@ffw:/tmp/home/root#

View solution in original post

13 REPLIES 13

Murph_9000
Level 14

I'm happily running a LACP LAG from a GT-AX6000 on the 9.0.0.6.102 beta to a Cisco CBS350 switch.  I don't particularly stress the link, and I'm not running AiMesh, but it seems to work well.

I believe LACP is mandatory for ASUSWRT link aggregation, but it might be optional on your switch, so make sure it's enabled for your LAG.  Some switches default to LACP disabled, for old-style static LAG, rather than 802.3ad LACP dynamic LAG.  If ASUSWRT does not receive LACPDUs on a member port, I don't think it will activate that port.

I'm not sure about the XT9, but there's not an easy way to just see the full status of LACP on the GT-AX6000 in the web GUI.  You can check the overall status on the command line as follows:

root@GT-AX6000:/tmp/home/root# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): count
System priority: 65535
System MAC address: 04:42:1a:xx:xx:xx
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 1
Partner Key: 1007
Partner Mac Address: 34:b8:83:xx:xx:xx

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 04:42:1a:xx:xx:xx
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 04:42:1a:xx:xx:xx
port key: 1
port priority: 255
port number: 1
port state: 63
details partner lacp pdu:
system priority: 1
system mac address: 34:b8:83:xx:xx:xx
oper key: 1007
port priority: 1
port number: 9
port state: 63

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 04:42:1a:xx:xx:xx
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: 04:42:1a:xx:xx:xx
port key: 1
port priority: 255
port number: 2
port state: 63
details partner lacp pdu:
system priority: 1
system mac address: 34:b8:83:xx:xx:xx
oper key: 1007
port priority: 1
port number: 10
port state: 63
root@GT-AX6000:/tmp/home/root#

You can check more briefly if the ports are active in the bonding, and some other basic status on the command line as follows (substitute eth1 and eth2 for your interfaces):

root@GT-AX6000:/tmp/home/root# cat /sys/class/net/bond0/bonding/all_slaves_active
1
root@GT-AX6000:/tmp/home/root# cat /sys/class/net/bond0/bonding/mode
802.3ad 4
root@GT-AX6000:/tmp/home/root# cat /sys/class/net/bond0/bonding/slaves
eth1 eth2
root@GT-AX6000:/tmp/home/root# cat /sys/class/net/bond0/lower_eth1/bonding_slave/state
active
root@GT-AX6000:/tmp/home/root# cat /sys/class/net/bond0/lower_eth2/bonding_slave/state
active
root@GT-AX6000:/tmp/home/root#

 

Hey @Murph_9000 , just checking your solution.  You don’t have to bypass loop protection or tweak loop protection in any way do you?  Any static routes on your router?  Do you have any cool analysis tricks to check for loops?

jzchen
Level 16

I don’t have these products in my household I’m afraid…

There’s a fairly recent firmware available for the XT9 as well as for your switch, have you updated to the latest? 3.0.0.4.388.23012 and 1.0.2.7 respectively.

It isn’t clear how you’ve connected the XT9 nodes, I’m gonna assume you’ve connected them to ports 9 and 10 of the switch to take advantage of the 2.5 G speed.  If you haven’t yet:  Have all the cables connected as you wish.  Forget all the nodes and hard reset them.  Re-add the nodes, hopefully they are found with an Ethernet port symbol meaning through the switch.  Don’t be surprised if for a couple of days they drop connection.  (I have had this happen to my nodes.  AiMesh is searching for the best backhaul option.  I leave Ethernet backhaul mode OFF and AUTO on all uplink priority settings, per default).  If you find a node does not come back online try manually turning that node OFF and ON again via it’s power switch, (my RPs do not have a switch so I unplug them then plug them back in, which works also for my RT-AC68U which is perched up about 10 - 12 ft. I’d need a ladder to reach it so I just unplug the extension cord that is down at the UPS.

If they aren’t found then connect to LAN 1 on the XT9 router and try searching again.  (This should work.  Then reconnect it as preferred).

I hope I didn’t miss any steps/make sense.

Mbrossart
Level 9

Thanks for your input.  I’ll have to try some commands in search of more details.  I have a few more data points.  The plot thickens, but I’m still unsuccessful with my LAG.

I’m treating my GS110EMX as a core switch.  Yes, my nodes are plugged into ports 9 & 10.  I’ve aggregated ports 7 & 8 to link to the router.  The router, and all of the nodes for that matter, are running the current firmware.  I’ve updated the switch to the current version and the issue persists.

  • I’ve noticed that the link lights on the switch don’t continuously flicker like I would expect.  They periodically blink slowly back and forth.  There’s a link light on each side of the port.  One side lights for about a second, then the other one and back and forth.  Then they start flickering again.  Then they rock back and forth.  I don’t know what this means other than something is amiss.
  • I got my hands on a GS108T.  It behaves a little differently than the GS110EMX but it also supports 802.3ad.  When I try to connect it to the XT9 in LAG, the ports flicker a bit then turn off almost like they tripped a loop detect or some other mal condition, but I can find no errors or loops reported for those ports.  I also connected the two switches with LAG, and it works perfectly.
  • I chatted with tech support, and they recommended promoting one of my nodes to the router which I did.  So now I have a different XT9 as the router and both nodes have been reset.  I tried to immediately connect the router via LAG, but I saw the same symptoms.  In fact, the the link lights slowly rocking back and forth is even more frequent than before.  None the less, I tried to rejoin the nodes to the new router.  Not only were they discovered via Wi-Fi despite being cabled, they would not add.  I had to take LAG out of the equation to join them.

I talked to tech support again today.  They’ve asked some additional information and logs.  They’re escalating my case.  I’ll see if I can get some more data via some command line.  I’ve only used the GUI.  How do you access the command line?

I know SSH is a way that’s very popular, and I found some instructions:

https://www.tomshardware.com/how-to/use-ssh-connect-to-remote-computer

I haven’t tried…

Okay, figuring a few more things out:

  • For all intents and purposes, it looks like LAG is actually working.  It looks like my switches are struggling with loop prevention.  When I turn it off, my switches look much happier.  The GS110EMX link lights don’t rock back and forth, and the GS108T doesn’t shut the ports off.  I read an article that says that ASUS’ AIMesh, as it checks it’s various backhaul methods creates periodic loops that wreak havoc with managed switches with loop detection.
  • Unfortunately, just turning off loop prevention doesn’t solve the problem entirely.  The article I read though claims that loop prevention causes the auto backhaul setting to incorrectly chose WiFi over Ethernet, even without LAG.  The article was a couple of years old and opined about ASUS fixing this issue.  So, I’ve developed a hypothesis that they did fix it, but there’s an aspect of the bug in LAG that didn’t get addressed.

Anyway, if it’s of interest, here are the results of my cat queries…

mbrossart@ffw:/tmp/home/root# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): count
System priority: 65535
System MAC address: a0:36:bc:62:be:e0
Active Aggregator Info:
Aggregator ID: 3
Number of ports: 2
Actor Key: 9
Partner Key: 1
Partner Mac Address: e0:46:ee:10:25:5f

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: a0:36:bc:62:be:e0
Slave queue ID: 0
Aggregator ID: 3
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: a0:36:bc:62:be:e0
port key: 9
port priority: 255
port number: 1
port state: 63
details partner lacp pdu:
system priority: 32768
system mac address: e0:46:ee:10:25:5f
oper key: 1
port priority: 128
port number: 6
port state: 61

Slave Interface: eth3
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: a0:36:bc:62:be:e0
Slave queue ID: 0
Aggregator ID: 3
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 1
Partner Churned Count: 2
details actor lacp pdu:
system priority: 65535
system mac address: a0:36:bc:62:be:e0
port key: 9
port priority: 255
port number: 2
port state: 63
details partner lacp pdu:
system priority: 32768
system mac address: e0:46:ee:10:25:5f
oper key: 1
port priority: 128
port number: 7
port state: 61
mbrossart@ffw:/tmp/home/root# cat /sys/class/net/bond0/bonding/all_slaves_active
1
mbrossart@ffw:/tmp/home/root# cat /sys/class/net/bond0/bonding/mode
802.3ad 4
mbrossart@ffw:/tmp/home/root# cat /sys/class/net/bond0/bonding/slaves
eth2 eth3
mbrossart@ffw:/tmp/home/root# cat /sys/class/net/bond0/lower_eth2/bonding_slave/state
active
mbrossart@ffw:/tmp/home/root# cat /sys/class/net/bond0/lower_eth3/bonding_slave/state
active
mbrossart@ffw:/tmp/home/root#

Is LAG Type set to Static or LACP on the switch settings?  Default is Static and I think it should be LACP pg 146 of the GS108Tv3 manual.

That “63” is good, that “61” is concerning.  Per:

https://hareshkhandelwal.blog/2022/07/28/lets-understand-lacp-state-machine-using-linux-bond/

 All that I got for now….

State 61 for the member ports indicates LACP is active, but there's a mismatch in the config.  That's 0x02 unset, which is "timeout".  When that bit is set, the port will aggressively timeout on loss of LACPDUs, for the purpose of detecting link failure.  On my Cisco switch, it's the difference between "lacp timeout short" (timeout bit set) and "lacp timeout long" (timeout bit unset) in the interface config for the member ports.  In long timeout mode, it can take a couple of minutes for LACP to detect link failure, but just seconds in short mode.

Both ends of the LAG should be configured the same to avoid problems.  ASUSWRT defaults to short timeout (0x02 set), and I don't think it has an official way to change that, so the switch should be set to fast/short timeout.

Mbrossart
Level 9

Thank you both so much for your time, attention and input.  I think I may have it solved, and I think the root cause was bad documentation.  When I set up LAG, my XT9 tells me to plug into ports 2 & 3.  I am now plugged into ports 1 & 2, and everything is running fine.  Until this configuration, I always either got the link lights pacing back and forth or I had to turn off loop prevention, and I still got a whole lot of anomalous routing issues.  In this configuration, I can re-enable loop prevention and the switch doesn’t complain.  I think plugging into 2 & 3 is what was causing the loop in the first place.

I did make a few changes too, but testing these out on ports 2 & 3 didn’t solve the problem.

  • Changed the LACP System Priority on my switch to 1 based on what I was seeing in your readout Murph_9000.
  • Changed the LACP Priority on my aggregated ports on my switch to 1 based on what I was seeing in the readout.
  • Changed the timeout to Short on my aggregated ports on my switch based on your input Murph_9000.
  • Changed the ports used on my XT9 to 1 & 2 as stated above.
  • Changed the uplink priority of my nodes to my Ethernet link.  Even after the other changes, one of my nodes still flipped to WiFi, but changing the preference to Ethernet, it flipped right back easily.  I don’t know if this is an indication that there’s still an issue with the Ethernet route on this node.  I’d be surprised, it’s a 2.5Gb backhaul.

For comparison, here’s my new readout.

mbrossart@ffw:/tmp/home/root# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: fast
Min links: 0
Aggregator selection policy (ad_select): count
System priority: 65535
System MAC address: a0:36:bc:62:be:e0
Active Aggregator Info:
Aggregator ID: 3
Number of ports: 1
Actor Key: 9
Partner Key: 1
Partner Mac Address: e0:46:ee:10:25:5f

Slave Interface: eth2
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: a0:36:bc:62:be:e0
Slave queue ID: 0
Aggregator ID: 3
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
system priority: 65535
system mac address: a0:36:bc:62:be:e0
port key: 9
port priority: 255
port number: 1
port state: 63
details partner lacp pdu:
system priority: 1
system mac address: e0:46:ee:10:25:5f
oper key: 1
port priority: 1
port number: 8
port state: 63

Slave Interface: eth3
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 0
Permanent HW addr: a0:36:bc:62:be:e0
Slave queue ID: 0
Aggregator ID: 4
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 1
details actor lacp pdu:
system priority: 65535
system mac address: a0:36:bc:62:be:e0
port key: 0
port priority: 255
port number: 2
port state: 71
details partner lacp pdu:
system priority: 65535
system mac address: 00:00:00:00:00:00
oper key: 1
port priority: 255
port number: 1
port state: 1
mbrossart@ffw:/tmp/home/root#