Hey there, I've been on a networking journey that has, over a few years, taken me from simple unmanaged networking, to managed networking, to advanced VLAN management. It's all been self taught, but mostly successful. However, I've gotten myself into a bit of a pickle and I'm hitting a wall in troubleshooting. Apologies for the length of the post, however I want to provide as much detail as possible.
High level, I have several /16 vlans for things. VLAN 99 is networking, 2, is servers, 4 is clients, 6 is wireguard clients, and there are some others. They're all 10.99.0.0/16 with a gateway at 10.99.1.254, etc.
I have had a very old Netgear Layer3 switch for some time. I've replaced it with a Brocade ICX6610, mostly so I can move my storage infrastructure to 10G fiber (I have a small hypervisor cluster). I had done a ton of preparatory work to configure the new L3 switch so that it could just be dropped in place of the old one; this was MOSTLY successful...
...However, in doing that I broke the connection to my opnsense firewall and sort of had to redo that piece from scratch. During my planning, I didn't realize some of the config changes I'd made would require changes on the firewall, and after the cut over I was locked out of the firewall. This is all my fault; that's the piece of this I understand the least, and I had followed dodgy guides when getting it to initially work. I have a backup in xml format, but even having that I'm realizing what I had been doing didn't make sense. Previously, I had a firewall interface on all of my vlans and the trunk going to it was carrying all the VLANS. Now, I set this up with only 2 vlans going to the firewall, the networking vlan and the wireguard vlan, as it seems to make more sense with my understanding of how Layer 3 routing works. All routing should happen on the Brocade L3 switch. The firewall itself has 4 physical ports, 1 going to my comcast gateway, and 2 in an LACP lagg going to my L3 switch. (I have a single interface right now going to the L3 switch separately for troubleshooting, removing the LACP lag as a complexity source).
So, in recovering this, I had to get into the firewall at the console and re-define the interfaces and IP's. I got this to work, but at this point I had tons of connection problems which I didn't understand fully. I have found some of opnsense's configuration to be a bit obfuscating, which I think is making my learning more difficult. The following were put in place:
- The "LAN" interface was given a static 10.99.1.40/16 IP, and an upstream gateway was defined at 10.99.1.254.
- The "WAN" interface was given DHCP, and is up and works
Once I recovered the connection to the web interface I had to make the following changes:
- Under the "Firewall" sidebar, under "Aliases", I defined each of my VLANS/Subnets with a CIDR notation and a name.
- Under the "Firewall" sidebar, under "NAT" and then under "Outbound" I switched the mode to "hybrid" and added a rule for each of my vlans on the "LAN" interface, with the "Source" being the aliases defined above, and the target (NAT Address) being the "WAN address"
- Under the "Firewall" sidebar, under "NAT" and then under "Port Forward" I added some port forward rules.
- While it's outside the scope of my immediate troubleshooting, I had a working WireGuard setup. I have an interface defined for it on that VLAN, and a second gateway defined at 10.6.1.254. It's all set up according to the opnsense documentation, and I can connect from the WAN and can access any resources on the LAN.
So onto the problem...I can access the internet from almost all of my LAN clients. I can access LAN clients via the port forward rules from the WAN. The firewall itself CANNOT access the WAN; for example, I can't check for updates. I can access the firewall web interface from anywhere on the LAN, I can ssh to the firewall from anywhere on the LAN, but once I'm ssh'd in, I can't ping back to the client I'm connecting from. The firewall CAN ping things like 8.8.8.8, but as my DNS resolver is on the LAN, DNS queries from the firewall fail. I believe in a related note, my WireGuard clients can access anything on the LAN, but cannot connect to anything on the WAN.
I believe this has to do with outbound routes from the firewall, but any time I mess with it I end up locking myself out and having to reset interfaces from the console. I tried defining some static routes in "System" -> "Routes" -> "Configuration" but that isn't working. I'm kind of stumped and have been looking at it so long that I don't think more reading and configuring is going to help me anymore. I'll post some screenshots of rules and routes as well (you'll be able to see various things enabled/disabled for experimentation), but I'm kind of in over my head and need some help.
Hmm... Home network and /16 subnets seems insane.
Especially with gateways near - but not at - the start of subnets.
How many clients per vlan are you running?
Would boring or /24 subnets help?
I've also never played around with an l3 switch. Routing on a switch seems like a budget hack for when the actual l3 routing device isnt powerful enough - or for when l3 isnt complex (ie switching subnets of IPs, although i imagine thats more BGP hardware accelerated devices). Seems like an easy way to tie yourself in knots, accidentally allow acces (or block access) when you shouldnt.
But, i've never had a router/firewall that cant keep up with my demands - however ive never had more that 1gbps wan, and internal networking doesnt need as much processing to keep up.
My only guess from my limited knowledge of l3 switches is...
Can 1 vlan access another vlan? If so, whats its route?
Are there assymetrical rules that arent stateful? I dont know if an L3 switch tracks the state of cross-subnet/vlan connections, allowing packets to return.
Why your firewall cannot ping out, sounds like it has an issue with its upstream gateway, doesnt know its next hop, or you are not letting traffic out of the firewall itself.
Have you tried some wireshark/tcpdump captures? Can you mirror a port on your switch to help debug?
Honestly, i dont know why you dont router-on-a-stick.
Have opnsense run vlans over 1 (or 2-lag) physical, have the switch distribute vlans, and let opnsense handle L3.
When you set up opnsense, have the initial config use an unused port for lan. If you ever lock yourself out, use that as emergency access.
1 port for WAN, and 1 port (or 2-lag) for Trunk local.
As for config issues i can see....
OPNsense gateways are its upstream gateway.
I dont know why you would have RFC1918 addresses set as gateways.
OPNsenses upstream gateway is normally provided by PPPoE/DHCP from your ISP. That where it sends unknown packets to... Unless you have a static route for 0.0.0.0/0 set to tge ISPs provided gateway.
I feel like you are misunderstanding gateways. Any gateways for a subnet would be set on its DHCP server, to tell clients where they should send unknown packets to. OPNsense doesnt care about that, the clients do. The client then know where to send their non-local packets - their DHCP (or statically) assigned gateway, normally OPNSense static IP for the VLAN. Considering you have an L3 switch, i imagine it wants to act as a gateway to its known vlans so it can do local-L3 things, and it would forward non-local packets to its assigned gateway: OPNsenses static IP for the VLAN. Opnsense then gets the packet, and forwards it to ITS gateway, which would be your ISP... Likely from PPPoE or DHCP)
Honestly, i dont know where to start with this.
Maybe its because ive never done anything as complex as this, or because ive never complicated things this much.
Id suggest you draw up your requirements, and think about redesigning towards those.
How many VLANs? How many clients per vlan? Max bandwidth requirements? Can high bandwidth connections be solved by multi-homing a service, so L2 deals with it? Whats the actual throughput of your firewall? And so on
Edit: sorry for the wall of text. I honestly didnt know what to concentrate on. Dropping an L3 switch into an odd home network just explodes the possibilities