Enhancement #3289

MultiWAN: remove static routes for checkip

Added by Giacomo Sanchietti almost 3 years ago. Updated almost 3 years ago.

Status:CLOSEDStart date:
Priority:NormalDue date:
Assignee:-% Done:

100%

Category:<multiple packages>
Target version:v6.7
Resolution: NEEDINFO:No

Description

Actually the user is forced to choose one "ping IP" (AKA check IP) for each provider.
While it works, this setup has a drawback: the ping IP can't be reached when a link goes down, so the system need to create a static route for each check IP.

The basic idea is to have a global check IP, simplifying the whole implementation. This choice will permit to:
  • delete the code that tries to auto-detect the right check IP
  • delete the static routes
We need to:
  • do not use shorewall disable in the lsm down script
  • adjust only "balance" routing table on link status change

Reference: http://community.nethserver.org/t/new-configuration-for-the-multi-wan-monitoring/1829

Associated revisions

Revision 4b641111
Added by Giacomo Sanchietti almost 3 years ago

conf: use checkip prop from firewall key. Refs #3289

Revision f6d0790b
Added by Giacomo Sanchietti almost 3 years ago

Interface configuration: remove provider static routes. Refs #3289

Revision 1431c5e6
Added by Giacomo Sanchietti almost 3 years ago

Remove provider static routes. Refs #3289

Revision 669f35a8
Added by Giacomo Sanchietti almost 3 years ago

wan-update event: re-add ip rule for provider. Refs #3289

Revision 3b7d0cd8
Added by Giacomo Sanchietti almost 3 years ago

WebUI: move check ip inside common options. Refs #3289

Revision 44582c2a
Added by Giacomo Sanchietti almost 3 years ago

DB: remove old checkip, add new CheckIP prop. Refs #3289

Revision e5ddfd52
Added by Giacomo Sanchietti almost 3 years ago

Revert "lsm: avoid useless restarts" Refs #3289

This reverts commit b74e6cdc57f251b037f6e0506dc7d2fe623b8042.

Revision 1f4c9569
Added by Giacomo Sanchietti almost 3 years ago

conf: use new props, move event script to groups. Refs #3289

New properties:
  • MaxPercentPacketLoss
  • MaxNumberPacketLoss
  • PingInterval
  • NotifyWan

Revision a2f97315
Added by Giacomo Sanchietti almost 3 years ago

spec: requires lsm >= 0.190 for groups. Refs #3289

Revision 4429a616
Added by Giacomo Sanchietti almost 3 years ago

lsm.conf: lower debug level. Refs #3289

Revision c1ac5d7c
Added by Giacomo Sanchietti almost 3 years ago

DB: add props for link status monitor. Refs #3289

Revision 5e7c6bc3
Added by Giacomo Sanchietti almost 3 years ago

WAN update action: change logic for groups. Refs #3289

Revision 9999cd5d
Added by Giacomo Sanchietti almost 3 years ago

Web UI: add LSM options. Refs #3289

Revision 01036b03
Added by Giacomo Sanchietti almost 3 years ago

lsm.conf: force group status. Refs #3289

Revision 33883553
Added by Giacomo Sanchietti almost 3 years ago

WAN notify: update mail text. Refs #3289

Revision 9f1aa9e3
Added by Giacomo Sanchietti almost 3 years ago

Web UI: improve inline help and labels. Refs #3289

Revision bbf52d5d
Added by Filippo Carletti almost 3 years ago

Minor corrections to inline help. Refs #3289

Revision fc25d0e0
Added by Giacomo Sanchietti almost 3 years ago

createlinks: add static-routes-save to nethserver-firewall-base-update. Refs #3289

Revision a6183571
Added by Giacomo Sanchietti almost 3 years ago

lsm.conf: fix missing new line. Refs #3289

Revision ce6751e0
Added by Filippo Carletti over 2 years ago

multiwan: latency measure with lsm. Refs #3289

Revision e079cbe3
Added by Filippo Carletti over 2 years ago

multiwan: remove latency measure with ping. Refs #3289

Revision fbe1acb8
Added by Filippo Carletti over 2 years ago

multiwan: remove latency measure with ping. Refs #3289

History

#1 Updated by Giacomo Sanchietti almost 3 years ago

  • Status changed from NEW to TRIAGED
  • Target version set to v6.7
  • % Done changed from 0 to 20

#2 Updated by Giacomo Sanchietti almost 3 years ago

  • Status changed from TRIAGED to ON_DEV
  • Assignee set to Giacomo Sanchietti
  • % Done changed from 20 to 30

#3 Updated by Giacomo Sanchietti almost 3 years ago

  • Subject changed from MultiWAN: remove routes for checkip to MultiWAN: remove static routes for checkip

#4 Updated by Giacomo Sanchietti almost 3 years ago

  • Status changed from ON_DEV to MODIFIED
  • % Done changed from 30 to 60
Also implemented:
  • multiple check ip support (requires LSM 0.190)
  • notification mail on provider status change
  • configurable sensibility of LSM

#5 Updated by Giacomo Sanchietti almost 3 years ago

  • Status changed from MODIFIED to ON_QA
  • Assignee deleted (Giacomo Sanchietti)
  • % Done changed from 60 to 70
Packages in nethserver-testing:
  • nethserver-base-2.9.2-1.2.gf6d0790.ns6.noarch.rpm
  • nethserver-lsm-1.0.2-1.6.g3388355.ns6.noarch.rpm
  • lsm-0.190-1.ns6.x86_64.rpm
  • nethserver-firewall-base-ui-2.8.0-1.16.g9f1aa9e.ns6.noarch.rpm
  • nethserver-firewall-base-2.8.0-1.16.g9f1aa9e.ns6.noarch.rpm

NOTE: some test cases require knowledge of advanced ip routing commands

Test case 1
  • Upgrade an already configured machine with at least 2 providers
  • Check the checkip address is deleted from all configured providers (db networks show)
  • Check the multi wan configuration is still working
  • Check that all check IPs are still reachable
  • Inspect ip rules and ip routes
Test case 2
  • After test case 1
  • Try to put a provider in down state by blocking the traffic on the router (or detaching the cable)
  • Check the hosts behind the firewall can still reach Internet
  • Re-enable traffic for the provider
  • Check the provider comes up and both links are used to access Internet
  • Inspect ip rules and ip routes
Test case 3
  • Change LSM paramters
  • Check parameters are applied to LSM
  • To verify the state of LSM use: pkill -SIGUSR1 lsm and see /var/log/messages
Test case 4
  • Enable mail notification and set From and To fields
  • Try to force a down/up state on a provider
  • Check the mail is sent

#6 Updated by Adam P almost 3 years ago

  • Assignee set to Adam P

#7 Updated by Adam P almost 3 years ago

  • Status changed from ON_QA to TRIAGED
  • Assignee deleted (Adam P)
  • % Done changed from 70 to 20

System and Package Version installed
ESXi 5.1 VM - Clean install of Nethserver 6.7 fully updated - 3 eth
Packages installed: Basic firewall, Bandwidth monitor, DNS and DHCP server, Intrusion Prevention System, VPN, Web filter, Web proxy, Web server

Test Original Problem
Setup WAN1 with check IP of 8.8.4.4 and WAN2 with check IP of 4.2.2.2. Simulated a wan failure and static routes existed. Check IPs were not reachable when an interface was down.

Install Updated Package
The following commands installed all mentioned updated packages:
yum --enablerepo=nethserver-testing update nethserver-base
yum --enablerepo=nethserver-testing update nethserver-lsm
yum --enablerepo=nethserver-testing update nethserver-firewall-base-ui
Packages:
nethserver-base-2.9.2-1.2.gf6d0790.ns6.noarch.rpm
nethserver-lsm-1.0.2-1.6.g3388355.ns6.noarch.rpm
lsm-0.190-1.ns6.x86_64.rpm
nethserver-firewall-base-ui-2.8.0-1.16.g9f1aa9e.ns6.noarch.rpm
nethserver-firewall-base-2.8.0-1.16.g9f1aa9e.ns6.noarch.rpm

Test Results after update
Test case 1
*Upgrade an already configured machine with at least 2 providers
*Check the checkip address is deleted from all configured providers (db networks show)
-Confirmed. Check IPs and field were removed from each provider
*Check the multi wan configuration is still working
-Multi wan appears to still work. WAN connectivity to W7 and Ubuntu workstations fails over after simulated outage/failure. Static routes from old multi wan configuration are still static.
*Check that all check IPs are still reachable
-All check IPs are reachable when both WANs are up. when one goes down, even after failover, they're not reachable.
*Inspect ip rules and ip routes

Test case 2
*After test case 1
*Try to put a provider in down state by blocking the traffic on the router (or detaching the cable)
-disconnected virtual nic in vsphere.
*Check the hosts behind the firewall can still reach Internet
-hosts behind NS can reach internet after about 20 seconds
*Re-enable traffic for the provider
-reconnected virtual nic in vsphere
*Check the provider comes up and both links are used to access Internet
-both links are used and eventually all traffic fails over to the higher priority nic.
*Inspect ip rules and ip routes

Test case 3
*Change LSM paramters
-changed check ip
*Check parameters are applied to LSM
-check ip was changed. appeared to apply.
*To verify the state of LSM use: pkill -SIGUSR1 lsm and see /var/log/messages

Test case 4
*Enable mail notification and set From and To fields
-enabled email notification field and set from and to fields to email addresses in my domain
*Try to force a down/up state on a provider
-disconnected the virtual nic a couple times - no email and no logs in spam filter appliance. how does this send emails? would options for smtp settings be helpful?
*Check the mail is sent

Verified or Reopen
Reopen

Note
I also noticed that the default check ip of 8.8.8.8 is not accessible when the primary internet connection is taken offline. After reconnected all traffic fails back, communication with the check IP is possible again. I changed the check IP to 4.2.2.3 and still experienced the same behavior with 8.8.8.8 as well as with 4.2.2.3, but on WAN2. After futher testing, every IP I ping through one WAN becomes a static route through that wan connection; it's unreachable once the wan the static route went through is down. Tried 'ip ro flush cache' to no avail.

#8 Updated by Filippo Carletti almost 3 years ago

Adam P wrote:

Static routes from old multi wan configuration are still static.

Static routes should have disappeared.
Could you please try with

service network restart

to see if those routes really disappear?

ip ro

Some tests will fail if the static routes are still present. Please repeat all tests. Thank you.

how does this send emails? would options for smtp settings be helpful?

mail command is used, it should work out of the box.

A test could be:
  1. set lsm debug (debug=10 in /etc/lsm/lsm.conf)
  2. restart lsm (this is the actual command to restart lsm :-))
  3. see verbose output in /var/log/messages (look for forkexec)

#9 Updated by Adam P almost 3 years ago

I currently only have one check ip specified: 4.2.2.3

With eth2 disconnected, I ran 'service network restart'. It took down all internet access until I rebooted the NS. After reboot, internet was back up so I ran 'ip ro' and got the following results:

ip ro
8.8.4.4 via 192.168.99.1 dev eth0
4.2.2.2 via 192.168.4.1 dev eth2

Those are the static routes that were defined before I upgraded to the test rpm. It appears they were not cleared after upgrading for some reason.

As far as mail, does the mail command resolve MX records and send mail directly? That could be an issue with SPF and PTR records not matching. That may be why the email never made it through. My spam appliance may have refused the smtp session.

Edit: I started receiving emails but found that they're only sent from the eth2 IP. I assume there's a route problem there too.

#10 Updated by Filippo Carletti almost 3 years ago

Adam P wrote:

With eth2 disconnected, I ran 'service network restart'. It took down all internet access until I rebooted the NS.

I can't explain this behavior. If you could share (even privately) your full configuration and logs I may figure what happened.

Those are the static routes that were defined before I upgraded to the test rpm. It appears they were not cleared after upgrading for some reason.

Did those static routes go away after reboot? Do you find /etc/sysconfig/network-scripts/route-*?

As far as mail, does the mail command resolve MX records and send mail directly?

mail usus the local mail system, i.e. postfix.

Edit: I started receiving emails but found that they're only sent from the eth2 IP. I assume there's a route problem there too.

I think that postfix can choose every available ip.

#11 Updated by Adam P almost 3 years ago

Filippo Carletti wrote:

Did those static routes go away after reboot? Do you find /etc/sysconfig/network-scripts/route-*?

They did not. I have rebooted NS several times. I did find the route files. one named route-eth0 and one named route-eth2. Both contain one line (the routes stated above).

I think that postfix can choose every available ip.

It didn't seem to work that way. It was only using eth0. When eth0 goes down, I did not get alert emails. When eth2 goes down, I did receive alerts.

I'll setup remote access to this NS instance and PM it to you.

#12 Updated by Filippo Carletti almost 3 years ago

Adam P wrote:

Filippo Carletti wrote:

Did those static routes go away after reboot? Do you find /etc/sysconfig/network-scripts/route-*?

They did not. I have rebooted NS several times. I did find the route files. one named route-eth0 and one named route-eth2. Both contain one line (the routes stated above).

We need a signal-event interface-update to clean the old static routes.

#13 Updated by Giacomo Sanchietti almost 3 years ago

  • Status changed from TRIAGED to ON_DEV
  • Assignee set to Giacomo Sanchietti
  • % Done changed from 20 to 30

#14 Updated by Giacomo Sanchietti almost 3 years ago

Filippo Carletti wrote:

We need a signal-event interface-update to clean the old static routes.

We should need only to link the static-routes-save action inside the nethserver-firewall-base-update event.

#15 Updated by Giacomo Sanchietti almost 3 years ago

  • Status changed from ON_DEV to MODIFIED
  • % Done changed from 30 to 60

#16 Updated by Giacomo Sanchietti almost 3 years ago

  • Status changed from MODIFIED to ON_QA
  • Assignee deleted (Giacomo Sanchietti)
  • % Done changed from 60 to 70
Packages in nethserver-testing:
  • nethserver-firewall-base-2.8.0-1.18.gfc25d0e.ns6.noarch.rpm
  • nethserver-firewall-base-ui-2.8.0-1.18.gfc25d0e.ns6.noarch.rpm
  • nethserver-base-2.9.2-1.2.gf6d0790.ns6.noarch.rpm
  • nethserver-lsm-1.0.2-1.6.g3388355.ns6.noarch.rpm
  • nethserver-lsm-1.0.2-1.7.ga618357.ns6.noarch.rpm
  • lsm-0.190-1.ns6.x86_64.rpm

Repeat the test case.
Please pay attention that nethserver-base must be installed before nethserver-firewall-base to get rid of old static routes:

yum --enablerepo=nethserver-testing update nethserver-base nethserver-firewall-base-* nethserver-lsm lsm

#17 Updated by Adam P almost 3 years ago

  • Assignee set to Adam P

#18 Updated by Adam P almost 3 years ago

  • Status changed from ON_QA to VERIFIED
  • Assignee deleted (Adam P)
  • % Done changed from 70 to 90

Confirmed. Static routes are gone and MultiWAN functions as expected.

#19 Updated by Giacomo Sanchietti almost 3 years ago

  • Status changed from VERIFIED to CLOSED
  • % Done changed from 90 to 100
Released in nethserver-updates:
  • nethserver-lsm-1.1.0-1.ns6.noarch.rpm
  • lsm-0.190-1.ns6.x86_64.rpm
  • nethserver-base-2.9.3-1.ns6.noarch.rpm
  • nethserver-firewall-base-ui-2.9.0-1.ns6.noarch.rpm
  • nethserver-firewall-base-2.9.0-1.ns6.noarch.rpm

Also available in: Atom PDF