this post was submitted on 15 Mar 2024
12 points (87.5% liked)

networking

2826 readers
3 users here now

Community for discussing enterprise networks and the ensuing chaos that comes after inheriting or building one.

founded 1 year ago
MODERATORS
 

Hey there, I've been on a networking journey that has, over a few years, taken me from simple unmanaged networking, to managed networking, to advanced VLAN management. It's all been self taught, but mostly successful. However, I've gotten myself into a bit of a pickle and I'm hitting a wall in troubleshooting. Apologies for the length of the post, however I want to provide as much detail as possible.

High level, I have several /16 vlans for things. VLAN 99 is networking, 2, is servers, 4 is clients, 6 is wireguard clients, and there are some others. They're all 10.99.0.0/16 with a gateway at 10.99.1.254, etc.

I have had a very old Netgear Layer3 switch for some time. I've replaced it with a Brocade ICX6610, mostly so I can move my storage infrastructure to 10G fiber (I have a small hypervisor cluster). I had done a ton of preparatory work to configure the new L3 switch so that it could just be dropped in place of the old one; this was MOSTLY successful...

...However, in doing that I broke the connection to my opnsense firewall and sort of had to redo that piece from scratch. During my planning, I didn't realize some of the config changes I'd made would require changes on the firewall, and after the cut over I was locked out of the firewall. This is all my fault; that's the piece of this I understand the least, and I had followed dodgy guides when getting it to initially work. I have a backup in xml format, but even having that I'm realizing what I had been doing didn't make sense. Previously, I had a firewall interface on all of my vlans and the trunk going to it was carrying all the VLANS. Now, I set this up with only 2 vlans going to the firewall, the networking vlan and the wireguard vlan, as it seems to make more sense with my understanding of how Layer 3 routing works. All routing should happen on the Brocade L3 switch. The firewall itself has 4 physical ports, 1 going to my comcast gateway, and 2 in an LACP lagg going to my L3 switch. (I have a single interface right now going to the L3 switch separately for troubleshooting, removing the LACP lag as a complexity source).

So, in recovering this, I had to get into the firewall at the console and re-define the interfaces and IP's. I got this to work, but at this point I had tons of connection problems which I didn't understand fully. I have found some of opnsense's configuration to be a bit obfuscating, which I think is making my learning more difficult. The following were put in place:

  • The "LAN" interface was given a static 10.99.1.40/16 IP, and an upstream gateway was defined at 10.99.1.254.
  • The "WAN" interface was given DHCP, and is up and works

Once I recovered the connection to the web interface I had to make the following changes:

  • Under the "Firewall" sidebar, under "Aliases", I defined each of my VLANS/Subnets with a CIDR notation and a name.
  • Under the "Firewall" sidebar, under "NAT" and then under "Outbound" I switched the mode to "hybrid" and added a rule for each of my vlans on the "LAN" interface, with the "Source" being the aliases defined above, and the target (NAT Address) being the "WAN address"
  • Under the "Firewall" sidebar, under "NAT" and then under "Port Forward" I added some port forward rules.
  • While it's outside the scope of my immediate troubleshooting, I had a working WireGuard setup. I have an interface defined for it on that VLAN, and a second gateway defined at 10.6.1.254. It's all set up according to the opnsense documentation, and I can connect from the WAN and can access any resources on the LAN.

So onto the problem...I can access the internet from almost all of my LAN clients. I can access LAN clients via the port forward rules from the WAN. The firewall itself CANNOT access the WAN; for example, I can't check for updates. I can access the firewall web interface from anywhere on the LAN, I can ssh to the firewall from anywhere on the LAN, but once I'm ssh'd in, I can't ping back to the client I'm connecting from. The firewall CAN ping things like 8.8.8.8, but as my DNS resolver is on the LAN, DNS queries from the firewall fail. I believe in a related note, my WireGuard clients can access anything on the LAN, but cannot connect to anything on the WAN.

I believe this has to do with outbound routes from the firewall, but any time I mess with it I end up locking myself out and having to reset interfaces from the console. I tried defining some static routes in "System" -> "Routes" -> "Configuration" but that isn't working. I'm kind of stumped and have been looking at it so long that I don't think more reading and configuring is going to help me anymore. I'll post some screenshots of rules and routes as well (you'll be able to see various things enabled/disabled for experimentation), but I'm kind of in over my head and need some help.

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 1 points 8 months ago

Ok, lot to go over. The /16 thing is just history; before I started this, I actually had a full /16 for my whole house as I thought I'd have hundreds of IoT devices one day, and used that third octet as a logical separator. I've kind of got that stuck in my head, so when I moved to a 10. system, I made the vlans/subnets the 2nd octet because I have so much IP memory of that third and fourth octet. It's unnecessary, but tbh I know most of my IP's by heart, and I went into this trying to drive complexity up a bit to further my learning. I don't think necessarily changing them to /24 would solve the problem, because the complexity wouldn't really collapse much. It's things like the 3 network is for our minecraft servers/services, and 10.3.2.* represents the main one, 10.3.3.* represents the one my son runs, etc. It's just muscle memory at this point. The L3 stuff is mostly good I think, I'm mostly concerned about the firewall.

I know that opnsense can be the L3 device, but 1) part of my learning in this was to use kind of raw cli switch commands and not some web UI, and 2) I had the original L3 device before adding the opnsense box (I used to have a comcast modem as the upstream from the L3, now the L3 has a 0.0.0.0/0 route to opnsense, and that should upstream to the comcast device). I have a full VM dedicated as my DHCP/DNS device running bind9 and isc-dhcp-server which has been maintained for over 10 years; I'm not looking to offload that to another device and it works flawlessly on the lan (with an IP helper on each vlan).

I am definitely confused how it does gateways. My understanding is, in opnsense, gateways are the part of the route definition, so you define the opnsense gateways to point to the gateways on the L3 device, they're not on the opnsense box itself. When you add an interface, you select the default gateway for that interface from a dropdown, consisting of the gateways you defined elsewhere. Where I get goofed up and lock myself out is when I change the "upstream" checkbox or mess with the priority. I don't know how it selects one or the other as "active" either. I've iterated on that a lot; the further I get, the more it feels like the obfuscation of opnsense is adding to my complexity rather than reducing it.

It seems the only thing having routing problems are packets essentially originating from an interface on the opnsense; things on the LAN reach the WAN, things on the WAN reach the LAN, but wireguard clients terminating at the opnsense box can't hit the WAN, and the opnsense box can't hit the LAN (despite passing traffic).