this post was submitted on 01 Jul 2023
10 points (100.0% liked)

Network Engineering

586 readers
1 users here now

All things enterprise network engineering, design, and architecture.

Rules

  1. No low effort posts
  2. No home networking topics
  3. No memes

founded 1 year ago
MODERATORS
 

I am interested in your ways to identify a bottleneck within a network.

In my case, I've got 2 locations, one in UK, one in Germany. Hardware is Fortigates for FW/routing and switches are Cisco/HPE. Locations are connected through an Ipsec VPN over the internet and all internet connections have at least a bandwidth of 100 Mbps.

The problem occurs as soon as one client in UK tries to download data via SSH from a server in Germany. The max download speed is 10 Mbps and for the duration of the download the whole location in UK has problems accessing resources through the VPN in Germany (Citrix, Exchange, Sharepoint, etc).

I've changed some information for privacy reasons but I'd be interested in your first steps on how to tackle such a problem. Do you have some kind of runbook that you follow? What are common errors that your encounter? (independently from my case too, just in general)

EDIT: Current list

  • packet capture on client and server to check for packet loss, latency, etc. - if packets dropped, check intermediate devices
  • check utilization of intermediate devices (CPU, RAM, etc)
  • check throughput with different tools (ipfer3, nc, etc) and protocols (TCP, UDP, etc) and compare
  • check if traffic shaper/ QoS are in place
  • check ports intermediate devices for port speed mismatch
  • MTU/MSS mismatch
  • is the internet connection affected too, or just traffic through the VPN
  • Ipsec configuration
  • turn off security function of FW temporary and check if it is still reproducible
  • traceroute from A to B, any latency spikes?
  • check RTT, RWND, MSS/MTU, TTL via pcap, on the transferring client itself and reference client, without and while an active data transfer

Prob not related but noteworthy:

  • check I/O of server and client

I'll keep this list updated and appreciate further tips.


Update I had to postpone the session and will do the stress test on Monday or Tuesday evening. I'll update you as soon as I have the results.


Update2 So, I'll try to keep it short.

First iperf3 over TCP run (UK < DE) with same FW rules let me reproduce the problem. Max speed 10 Mbps, and DE < UK even slower, down to 1-2 Mbps. Pattern of the test implies an unreliable connection (short up to 30 Mbts, then 0, and so on). Traceroute shows same hops in both directions, no latency spikes, all good.

BUT ICMP and iperf3 over UDP runs show a packet loss of min 10% and up to 30% in both directions! Multiple speed tests to endpoints over the internet (UK>Internet) showed a download of 80 Mbts andupload of like 30 Mbts, which indicates a problem with the IPSec tunnel.

Some smaller things we've tried without any positive effect:

  • routing changes
  • disabling all security features for affected rule set
  • removed traffic shaper
  • Port speed/duplex negotiations are looking good
  • and some other things that I already forgot

Things we prepared:

  • We have opened some tickets at our ISPs to let them check it on their site > waiting for response
  • Set up smokeping to ping all provider/public/gw/ipsec endpoinrts/host IPs and see where packets could be dropped (server located in DE)
  • Planned a new session with an Fortigate expert to look in-depth into the IPSec configuration.

Need to do:

  • look through all packet captures (takes some time)
  • MSS/MTU missmatches / DF flags
  • further iperf3 tests with smaller/larger packet
  • double check ipsec configuration
  • QoS on Switches

I wish I had more time. I'll keep you updated


Update3 Most likely the last big update.

So, the actual infrastructure is a little bit more complex than I've described in this post, so nobody could have suggested tips for this case.

We think that we have found the problem, but we couldn't implement the fix yet since it requires some downtime, and I was on a business trip. We've got multiple locations in the UK that are connected to a third party (MLPS) where their internet breakout points are too. We've now got multiple IPSec tunnels that terminate on the same FW in Germany. The problem is that the third-party FW uses the same IP AND port for all IPSec tunnels too, which most likely causes all the issues. In short: only use one tunnel or change the GW on the German side.

Don't ask me why, please! - It is a cluster fuck, and the goal is to fix it in the future. One site had a large flat /16 network not long ago.

I might share a final update when we get the fix implemented.

top 21 comments
sorted by: hot top controversial new old
[–] [email protected] 4 points 1 year ago (1 children)

Performance problems are the hardest problems to solve unfortunately. I've got more thoughts to add to this, but have to get to some commitments today. I'll add more detail either tonight or tomorrow @[email protected]

[–] [email protected] 3 points 1 year ago

Would like to hear your thoughts and no stress. Won't work on the weekend anyway.

[–] [email protected] 3 points 1 year ago (1 children)

Sounds like firewall fuckery. Something doing too much DPI on some interesting looking packets that can't cope with the volume, choking and dropping what it can't handle. Check packet loss while this is happening and look real close at the Fortigates. Could also be MTU fun with IPsec in the mix.

[–] [email protected] 1 points 1 year ago

Could be, for sure. I could disable the security profile for some tests and check if it happens with it turned off. Good points, thank you.

[–] [email protected] 2 points 1 year ago (1 children)
[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

Thank you for the ping and the update!

Looks like you're on the right path to chasing the gremlins out. I'm glad iperf3 was helpful to you. It has helped me out tremendously many times.

For the record, you can always ping me anytime. I'm here to help and Lemmy notifications don't work half the time. But direct mentions always work.

Please keep me in the loop with further updates. At this time, nothing further to add from me. You're doing the right things.

[–] [email protected] 1 points 1 year ago

Yeah, notifications are really unreliable here. I've got another window for more stress test today. Going to post update later, or tomorrow. Focus on MTU/MSS

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago) (1 children)

@[email protected] Apologies for the delay. I've been very tired lately. I'm going to most likely repeat some of the things others have mentioned and what you've already noted, but this would be my t/s process. (NOTE: all tests should be ran on the endpoints, not network infra)

  1. Traceroute from UK -> Germany and Germany -> UK. Look for latency spikes. The reason I say do both directions is that sometimes there is weird pathing issues present that only show in the opposite direction.

  2. iperf 3 from UK -> Germany and Germany -> UK.

  • 2a. Clear counters on switches/routers/firewalls.
  • 2b. During an extended iperf test, look for interface errors, CPU usage on the devices in path.
  • 2c. This is tedious and will take time, but you're dealing with gremlins.
  1. TCPdump on both sides during a transaction. Check for re-xmits and window scaling problems. Most likely not the endpoints, but something to rule out.

  2. Monitor fortigate logs during all of this

  3. Setup test boxes in UK and Germany that are exempt from IPSec tunnels and test throughput again (this should be a clear indicator that the firewalls are fucked if this is good)

  4. All else fails, open TAC case with Fortigate.

[–] [email protected] 1 points 1 year ago (1 children)

No worries, thank you for your input!

  1. what logging/debugging would you activate for that case? - Not too familiar with Fortigate yet and would appreciate some tipps, IF you are familiar with those.
  2. the IPSec tunnel is the only connection between these locations so it is rather difficult. But I get what you mean and check if there is another option.

Good points!

[–] [email protected] 2 points 1 year ago (1 children)

Not sure on the logging. I’m a data center guy and would rather see firewalls in the trash lol. They usually just cause problems.

For the WAN, surely there is some way you can reach those sites over the general internet. You have ISP connections.

Are you sharing BGP to the ISP? Maybe make a couple of 1:1 NATs with test boxes not in prod so that you can quickly test pathing outside of the tunnel.

[–] [email protected] 1 points 1 year ago (1 children)

Not sure on the logging. I’m a data center guy and would rather see firewalls in the trash lol. They usually just cause problems.

Haha - I'd like to disagree, but you are right.

For the WAN, surely there is some way you can reach those sites over the general internet. You have ISP connections.

I for sure could do it, but it is not that easy to expose a server to the internet. There would be multiple departments involved and I need to get permission. And yeah, even with IP whitelisting. I guess that will be my last resort.

Still waiting for the test clients. Probably going to shift some hours into the weekend so I don't disturb daily business.

[–] [email protected] 1 points 1 year ago (1 children)

Totally understand the security and CAB process. It’s a royal PITA when it comes to troubleshooting.

Mind keeping me in the loop with your findings? I’ll help as much as I can.

[–] [email protected] 2 points 1 year ago

Will do. I'll updated the original post most likely and ping you. I've added a per-IP traffic shaper to limit the bandwidth, so this one user won't be able to slow down the location and I am about to prepare the troubleshooting session on the weekend.

[–] [email protected] 2 points 1 year ago (1 children)

Might be too simple but does a traceroute show pretty standard latency all the way down the line?

[–] [email protected] 2 points 1 year ago (1 children)

I am certain that we block ICMP on multiple FW in between. I could allow it temporary and check. Good suggestion.

[–] [email protected] 4 points 1 year ago (1 children)

Blocking ICMP entirely is a recipe for weird stuff happening. There's some ICMP worth blocking - redirects, etc - but turning it off entirely A) makes debugging stuff a nightmare and B) can break some things entirely e.g. MTU probing.

[–] [email protected] 1 points 1 year ago

You are right. Still an active policy that we have to work on.

[–] [email protected] 1 points 1 year ago (1 children)

this comment is no way related to your post.

I saw the one of the community rules is "more fuck u/spez comments" so I'm trying that out.

fuck u/spez

[–] [email protected] 2 points 1 year ago (1 children)
[–] [email protected] 2 points 1 year ago
load more comments
view more: next ›