https://netmaker.org logo
Title
j

jolly-london-20127

12/19/2022, 7:48 PM
@prehistoric-horse-401 @big-dress-87524 We are trying to track down the issue with interface shutting down and provide a hotfix, are you able to provide more details?
b

big-dress-87524

12/19/2022, 7:54 PM
I can dig into this after work today and see if there are any logs or other details that might be helpful - all I can tell you offhand atm is that I’m seeing this on Linux with the normal netclient (using the normal kernel WireGuard etc)
p

prehistoric-horse-401

12/19/2022, 8:08 PM
I sadly had to ditch netmaker for now as I'm heavily relying on private networks But I am remembering that the windows client did not receive a handshake packet back from the sever node With netmaker shutdown on the server the connection did persist and I could reach every other node But as soon as the docker stack was started again the windows client looses connection after around a minute After the shutdown the client reconnected and the loop starts again Adding the windows client as an external device results in no lost connection but this is expected I guess
j

jolly-london-20127

12/19/2022, 8:32 PM
the issue is on Linux as well?
Thanks for this info. So the issue is only while the Netmaker server was running, and when you shut it down, the connection did not have a problem?
b

big-dress-87524

12/19/2022, 8:35 PM
Yeah, I’ve only noticed this on a Linux host running the newest netclient version (tho tbh I only noticed it because network manager sent a notification every time the WireGuard tunnel was disconnected or reconnected)
j

jolly-london-20127

12/19/2022, 8:39 PM
Interesting...okay, so once the tunnel is lost, does it come back on its own or do you need to run a pull?
p

prehistoric-horse-401

12/19/2022, 8:42 PM
Yes that is correct
b

big-dress-87524

12/20/2022, 1:05 AM
yeah, like I said, it seemed to work (well enough for light web browsing at least) which is why I didn't report it at first - what it does is every 30s-1m or so, it disconnects and then immediately reconnects
I'm running netmaker server v0.16.3 atm every time it bounces, the netclient instance logs the below:
Dec 19 20:07:21 workstation netclient[720]: [netclient] 2022-12-19 20:07:21 [mqpublish.go-52] checkin(): checkin with server(s) for all networks
Dec 19 20:07:51 workstation netclient[720]: [netclient] 2022-12-19 20:07:51 [mqpublish.go-252] publish(): could not connect to broker at broker.<netmaker-domain>:443
Dec 19 20:07:51 workstation netclient[720]: [netclient] 2022-12-19 20:07:51 [mqpublish.go-149] Hello(): Network: <network-name> error publishing ping, connection timeout
Dec 19 20:07:51 workstation netclient[720]: [netclient] 2022-12-19 20:07:51 [mqpublish.go-150] Hello(): running pull on <network-name> to reconnect
Dec 19 20:07:56 workstation netclient[720]: [netclient] 2022-12-19 20:07:56 [common.go-162] InitWireguard(): waiting for interface...
Dec 19 20:07:56 workstation netclient[720]: [netclient] 2022-12-19 20:07:56 [common.go-190] InitWireguard(): interface ready - netclient.. ENGAGE
and on the server side, around the time of a bounce, it logged: (this might just be a coincidence, as it's logging similar messages sometimes when this particular client didn't bounce
mq             | 1671498660: New connection from <internal ip address>:56910 on port 8883.
mq             | 1671498660: Client <unknown> disconnected due to protocol error.
now that Im looking at logs, I'm almost wondering if its some sort of misconfiguration that feivel and I have that was only made apparent by the newest netclient
I bet this is related to the switch back to websockets with the MQ broker, as I haven't updated the server yet, so haven't moved over MQ either I think I might have time to go through the upgrade tomorrow, so I’ll see if that helps
p

prehistoric-horse-401

12/20/2022, 9:11 AM
I suspect too that this is related to websockets I've tried upgrading to 0.17.0 (with traefik) but to no avail It seems like my clients can't connect to the broker. At least only the admin user connects to mq after startup
b

big-dress-87524

12/20/2022, 12:49 PM
I'm also using traefik... interesting
update: I seem to have gotten this working by changing the Traefik config, as well as updating the MQ config to use websockets according to the github repo. The reference configuration seems to work, but since I prefer to let traefik handle TLS termination, I'm using the configuration below, which also seems to work The downside of this success is that now all the other nodes not yet on the 17.0 agent are having issues, which I guess is expected with the MQ protocol change
labels:
      - traefik.enable=true
      - traefik.http.routers.mqtt.rule=Host(`broker.${BASE_DOMAIN}`)
      - traefik.http.routers.mqtt.tls.certresolver=http
      - traefik.http.services.mqtt.loadbalancer.server.port=1883
      - traefik.http.routers.mqtt.entrypoints=websecure
      - traefik.http.middlewares.sslheader.headers.customrequestheaders.X-Forwarded-Proto=https
      - traefik.http.routers.mqtt.middlewares=sslheader
weirdly, even though the node is no longer bouncing and according to it's own logs seems to be connecting to the MQ just fine, the node health displayed in the UI continues to be "error", even after >10m
j

jolly-london-20127

12/20/2022, 1:13 PM
do you have ICMP enabled on the server? The node must be able to ping the server to get a "healthy" status
b

big-dress-87524

12/20/2022, 1:17 PM
the client is able to ping the server's netmaker address
I did just notice that when I manually run a pull on the client side I get an error on the server side, but not sure if thats expected or even related - maybe that's part of the issue 🤔 I only see the error on the client side when I run the pull with full verbosity, weirdly
sudo netclient pull --vvvv
[netclient] 2022-12-20 08:13:13 [commands.go-87] Pull(): No network selected. Running Pull for all networks. 
[netclient] 2022-12-20 08:13:43 [commands.go-108] Pull(): error pulling network config for network:  homelab 
 Post "https://api.<netmaker-host>:443/api/nodes/adm/<network>/authenticate": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
[netclient] 2022-12-20 08:13:43 [commands.go-116] Pull(): reset network all and peer configs
server side error:
2022-12-20 13:14:26 failed to send DynSec command [[{createRole   homelab [{publishClientReceive update/<network>/# -1 true} {publishClientReceive peers/<network>/# -1 true} {subscribePattern # -1 true} {unsubscribePattern # -1 true}]  Network wide role with Acls for nodes  [] []} {createClient <UUID/key>  []  <client-hostname>  [] [{node -1} {<network> -1}]}]]: connect timeout
update: after upgrading the netmaker server to 17.0 from 16.3, I don't get this error anymore either and all nodes are showing as healthy again I was lead to believe that wasnt required by the reference docker-compose linked in the github release, but I guess it was
j

jolly-london-20127

12/20/2022, 1:44 PM
So the fix was the added labels for traefik?
p

prehistoric-horse-401

12/20/2022, 2:41 PM
Upgrading to 0.17.0 resulted in a totally broken netmaker setup for me I could not get it to work using traefik ICMP was enabled, even disabled the firewall to test things
b

big-dress-87524

12/20/2022, 2:43 PM
Once I upgraded netmaker and updated the MQ config and traefik config, it all seems to work okay I can try to test later to see if the suggested traefik config actually works too, but the above config is definitely working for me atm