MetalLB’s purpose is to attract traffic directed to the LoadBalancer IP to the cluster nodes. Once the traffic lands on a node, MetalLB’s responsibility is finished and the rest should be handled by the cluster’s CNI.
Because of that, being able to reach the LoadBalancerIP from one of the nodes doesn’t prove that MetalLB is working (or that it is working partially). It actually proves that the CNI is working.
Also, please be aware that pinging the service IP won’t work. You must access the service to ensure that it works. As much as this may sound obvious, check also if your application is behaving correctly.
MetalLB is composed by two components:
controller
is in charge of assigning IPs to the servicesspeaker
s are in charge of announcing the services via L2 or BGPIf we want to understand why a service is not getting an IP, the component to check is the controller.
On the other hand, if we want to understand why a service with an assigned IP is not being advertised, the speakers are the components to check.
In order to work properly, MetalLB must be fed with a valid configuration.
The MetalLB’s configuration is the composition of multiple resources, such as IPAddressPools
, L2Advertisements
,
BGPAdvertisements
and so on. The validation is done to the configuration as a whole, and because of that it doesn’t
make sense to say that a single piece of the configuration is not valid.
The MetalLB’s behavior with regards to an invalid configuration is to mark it as stale and keep working with the last valid one.
Each single component validates the configuration for the part that is relevant to its function, so it might happen that the controller validates a configuration that the speaker marks as not valid.
There are two ways to see if a configuration is not valid:
failed to parse the configuration
plus other insights about the failure.metallb_k8s_client_config_stale_bool
metric on Prometheus, which tells if the given component
is running on a stale (obsolete) configurationNote: the fact that the logs contain an invalid configuration
log does not necessarily mean that the last loaded
configuration is not valid.
The controller performs the IP allocation to the services and it logs any possible issue.
Things that may cause the assignment not to work are:
Each speaker publishes an announcing from node "xxx" with protocol "bgp"
event associated with the
service it is announcing.
In case of L2, only one speaker will announce the service, while in case of BGP multiple speakers will announce the service from multiple nodes.
A kubectl describe svc <service-name>
will show the events related to the service, and with that the
speaker(s) that are announcing the services.
A given speaker won’t advertise the service if:
externalTrafficPolicy=local
and there are no running endpoints on the speaker’s nodeIn order to have MetalLB advertise via L2, an L2Advertisement instance must be created. This is different from the original MetalLB configuration so please follow the docs and ensure you created one.
arping <loadbalancer ip>
from a host on the same L2 subnet will show what mac address is associated with
the Loadbalancer IP by MetalLB.
$ arping -I ens3 192.168.1.240
ARPING 192.168.1.240 from 192.168.1.35 ens3
Unicast reply from 192.168.1.240 [FA:16:3E:5A:39:4C] 1.077ms
Unicast reply from 192.168.1.240 [FA:16:3E:5A:39:4C] 1.321ms
Unicast reply from 192.168.1.240 [FA:16:3E:5A:39:4C] 0.883ms
Unicast reply from 192.168.1.240 [FA:16:3E:5A:39:4C] 0.968ms
^CSent 4 probes (1 broadcast(s))
Received 4 response(s)
By design, MetalLB replies with the MAC address of the interface it received the ARP request from.
tcpdump
can be used to see if the ARP requests land on the node:
$ tcpdump -n -i ens3 arp src host 192.168.1.240
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens3, link-type EN10MB (Ethernet), capture size 262144 bytes
17:04:40.667263 ARP, Reply 192.168.1.240 is-at fa:16:3e:5a:39:4c, length 46
17:04:41.667485 ARP, Reply 192.168.1.240 is-at fa:16:3e:5a:39:4c, length 46
17:04:42.667572 ARP, Reply 192.168.1.240 is-at fa:16:3e:5a:39:4c, length 46
17:04:43.667545 ARP, Reply 192.168.1.240 is-at fa:16:3e:5a:39:4c, length 46
^C
4 packets captured
6 packets received by filter
0 packets dropped by kernel
If no replies are received, it might be that ARP requests are blocked somehow. Anti MAC spoofing mechanisms are a pretty common reason for that.
In order to understand if this is the case, you must use TCPDump on the node that where the speaker elected to announce the service is running and see if the ARP requests are making through. At the same time, you need to check on the host if the ARP replies are coming back.
Additionally, the speaker produces a got ARP request for service IP, sending response
debug log whenever it receives
an ARP request.
If multiple MACs are returned for the same LB IP, this might be because:
If the L2 interface selector is used but there are no compatible interfaces on the node elected, MetalLB will
produce an event on the Service which should be visible when doing kubectl describe svc <service-name>
.
In order to have MetalLB advertise via BGP, a BGPAdvertisement instance must be created.
Advertising via BGP means announcing the nodes as the next hop for the route.
Among the speakers that are supposed to advertise the IP, pick one and check if:
The status of the session can be seen via the metallb_bgp_session_up
metric.
The information can be found on the logs of the speaker container of the speaker pod, which will produce logs
like BGP session established
or BGP session down
. It will also log failed to send BGP update
in case of
advertisement failure.
The FRR container in the speaker pod can be queried in order to understand the status of the session / advertisements.
Useful commands are:
vtysh show running-conf
to see the current FRR configurationvtysh show bgp neigh <neigh-ip>
to see the status of the session. Established is the value related to a healthy BGP
sessionvtysh show ipv4 / ipv6
to see the status of the advertisementsThe FRR configuration that the speaker produces might be invalid. When this is the case, the speaker container
will produce a reload error
log.
Also, the logs of the reloader
might show if the configuration file was invalid.
Things to check are:
If the configuration and the logs look fine on MetalLB’s side, another thing to check is the routing table
on the routers corresponding to the BGPPeer
s.
Networking is complex and there are multiple places where it might fail. TCPDump might help to understand where the traffic stops working.
In order to narrow down the issue, TCPDump can be used:
If the traffic doesn’t reach the node, it might be a network infrastructure or a MetalLB issue. If the traffic reaches the node but not the pod, this is likely to be a CNI issue.
Sometimes the LoadBalancer service is intermittent. This might be because of multiple factors:
Sometimes, narrowing down the scope helps. With many nodes and many endpoints is hard to find the right place where to dump the traffic.
One way to make the triaging simpler is to limit the advertisement to one single node (using the node selector in the BGPAdvertisement / L2Advertisement) and to limit the service’s endpoints to one single pod.
By doing this, the path of the traffic will be clear and it will be easy to understand if the issue is caused only by a subset of the nodes or when the pod lives on a different node, for example.
If after following the suggestions of this guide, a MetalLB bug is the primary suspect, you need to file a bug report with some information.
In order to provide a meaningful bug report, the information on which phase of advertising a service of type LoadBalancer is failing is greatly appreciated. Some good examples are:
Some bad examples are:
Additionally, the following info must be provided:
Setting the loglevel to debug will allow MetalLB to provide more information.
Both the helm chart and the MetalLB operator provide a convenient way to set the loglevel to debug:
.controller.loglevel
and a .speaker.loglevel
fieldWhen deploying manually via the raw manifests, the parameter on the speaker / controller container must be set manually.
A convenience script that fetches all the required information can be found here.
Additionally, the status of the service and of the endpoints must be provided:
kubectl get endpointslices <my-service> -o yaml
kubectl get svc <my_service> -o yaml