NETGRAPH(4) BSD Kernel Interfaces Manual NETGRAPH(4)
NAME
netgraph -- graph based kernel networking subsystem
DESCRIPTION
The netgraph system provides a uniform and modular system for the implementation of kernel objects which perform various networking func-
tions. The objects, known as nodes, can be arranged into arbitrarily complicated graphs. Nodes have hooks which are used to connect two
nodes together, forming the edges in the graph. Nodes communicate along the edges to process data, implement protocols, etc.
The aim of netgraph is to supplement rather than replace the existing kernel networking infrastructure. It provides:
o A flexible way of combining protocol and link level drivers.
o A modular way to implement new protocols.
o A common framework for kernel entities to inter-communicate.
o A reasonably fast, kernel-based implementation.
Nodes and Types
The most fundamental concept in netgraph is that of a node. All nodes implement a number of predefined methods which allow them to interact
with other nodes in a well defined manner.
Each node has a type, which is a static property of the node determined at node creation time. A node's type is described by a unique ASCII
type name. The type implies what the node does and how it may be connected to other nodes.
In object-oriented language, types are classes, and nodes are instances of their respective class. All node types are subclasses of the
generic node type, and hence inherit certain common functionality and capabilities (e.g., the ability to have an ASCII name).
Nodes may be assigned a globally unique ASCII name which can be used to refer to the node. The name must not contain the characters '.' or
':', and is limited to NG_NODESIZ characters (including the terminating NUL character).
Each node instance has a unique ID number which is expressed as a 32-bit hexadecimal value. This value may be used to refer to a node when
there is no ASCII name assigned to it.
Hooks
Nodes are connected to other nodes by connecting a pair of hooks, one from each node. Data flows bidirectionally between nodes along con-
nected pairs of hooks. A node may have as many hooks as it needs, and may assign whatever meaning it wants to a hook.
Hooks have these properties:
o A hook has an ASCII name which is unique among all hooks on that node (other hooks on other nodes may have the same name). The name must
not contain the characters '.' or ':', and is limited to NG_HOOKSIZ characters (including the terminating NUL character).
o A hook is always connected to another hook. That is, hooks are created at the time they are connected, and breaking an edge by removing
either hook destroys both hooks.
o A hook can be set into a state where incoming packets are always queued by the input queueing system, rather than being delivered
directly. This can be used when the data is sent from an interrupt handler, and processing must be quick so as not to block other inter-
rupts.
o A hook may supply overriding receive data and receive message functions, which should be used for data and messages received through that
hook in preference to the general node-wide methods.
A node may decide to assign special meaning to some hooks. For example, connecting to the hook named debug might trigger the node to start
sending debugging information to that hook.
Data Flow
Two types of information flow between nodes: data messages and control messages. Data messages are passed in mbuf chains along the edges in
the graph, one edge at a time. The first mbuf in a chain must have the M_PKTHDR flag set. Each node decides how to handle data received
through one of its hooks.
Along with data, nodes can also receive control messages. There are generic and type-specific control messages. Control messages have a
common header format, followed by type-specific data, and are binary structures for efficiency. However, node types may also support conver-
sion of the type-specific data between binary and ASCII formats, for debugging and human interface purposes (see the NGM_ASCII2BINARY and
NGM_BINARY2ASCII generic control messages below). Nodes are not required to support these conversions.
There are three ways to address a control message. If there is a sequence of edges connecting the two nodes, the message may be ``source
routed'' by specifying the corresponding sequence of ASCII hook names as the destination address for the message (relative addressing). If
the destination is adjacent to the source, then the source node may simply specify (as a pointer in the code) the hook across which the mes-
sage should be sent. Otherwise, the recipient node's global ASCII name (or equivalent ID-based name) is used as the destination address for
the message (absolute addressing). The two types of ASCII addressing may be combined, by specifying an absolute start node and a sequence of
hooks. Only the ASCII addressing modes are available to control programs outside the kernel; use of direct pointers is limited to kernel
modules.
Messages often represent commands that are followed by a reply message in the reverse direction. To facilitate this, the recipient of a con-
trol message is supplied with a ``return address'' that is suitable for addressing a reply.
Each control message contains a 32-bit value, called a ``typecookie'', indicating the type of the message, i.e. how to interpret it. Typi-
cally each type defines a unique typecookie for the messages that it understands. However, a node may choose to recognize and implement more
than one type of messages.
If a message is delivered to an address that implies that it arrived at that node through a particular hook (as opposed to having been
directly addressed using its ID or global name) then that hook is identified to the receiving node. This allows a message to be re-routed or
passed on, should a node decide that this is required, in much the same way that data packets are passed around between nodes. A set of
standard messages for flow control and link management purposes are defined by the base system that are usually passed around in this manner.
Flow control message would usually travel in the opposite direction to the data to which they pertain.
Netgraph is (Usually) Functional
In order to minimize latency, most netgraph operations are functional. That is, data and control messages are delivered by making function
calls rather than by using queues and mailboxes. For example, if node A wishes to send a data mbuf to neighboring node B, it calls the
generic netgraph data delivery function. This function in turn locates node B and calls B's ``receive data'' method. There are exceptions
to this.
Each node has an input queue, and some operations can be considered to be writers in that they alter the state of the node. Obviously, in an
SMP world it would be bad if the state of a node were changed while another data packet were transiting the node. For this purpose, the
input queue implements a reader/writer semantic so that when there is a writer in the node, all other requests are queued, and while there
are readers, a writer, and any following packets are queued. In the case where there is no reason to queue the data, the input method is
called directly, as mentioned above.
A node may declare that all requests should be considered as writers, or that requests coming in over a particular hook should be considered
to be a writer, or even that packets leaving or entering across a particular hook should always be queued, rather than delivered directly
(often useful for interrupt routines who want to get back to the hardware quickly). By default, all control message packets are considered
to be writers unless specifically declared to be a reader in their definition. (See NGM_READONLY in <ng_message.h>.)
While this mode of operation results in good performance, it has a few implications for node developers:
o Whenever a node delivers a data or control message, the node may need to allow for the possibility of receiving a returning message
before the original delivery function call returns.
o Netgraph provides internal synchronization between nodes. Data always enters a ``graph'' at an edge node. An edge node is a node that
interfaces between netgraph and some other part of the system. Examples of ``edge nodes'' include device drivers, the socket, ether,
tty, and ksocket node type. In these edge nodes, the calling thread directly executes code in the node, and from that code calls upon
the netgraph framework to deliver data across some edge in the graph. From an execution point of view, the calling thread will execute
the netgraph framework methods, and if it can acquire a lock to do so, the input methods of the next node. This continues until either
the data is discarded or queued for some device or system entity, or the thread is unable to acquire a lock on the next node. In that
case, the data is queued for the node, and execution rewinds back to the original calling entity. The queued data will be picked up and
processed by either the current holder of the lock when they have completed their operations, or by a special netgraph thread that is
activated when there are such items queued.
o It is possible for an infinite loop to occur if the graph contains cycles.
So far, these issues have not proven problematical in practice.
Interaction with Other Parts of the Kernel
A node may have a hidden interaction with other components of the kernel outside of the netgraph subsystem, such as device hardware, kernel
protocol stacks, etc. In fact, one of the benefits of netgraph is the ability to join disparate kernel networking entities together in a
consistent communication framework.
An example is the socket node type which is both a netgraph node and a socket(2) in the protocol family PF_NETGRAPH. Socket nodes allow user
processes to participate in netgraph. Other nodes communicate with socket nodes using the usual methods, and the node hides the fact that it
is also passing information to and from a cooperating user process.
Another example is a device driver that presents a node interface to the hardware.
Node Methods
Nodes are notified of the following actions via function calls to the following node methods, and may accept or reject that action (by
returning the appropriate error code):
Creation of a new node
The constructor for the type is called. If creation of a new node is allowed, constructor method may allocate any special resources it
needs. For nodes that correspond to hardware, this is typically done during the device attach routine. Often a global ASCII name corre-
sponding to the device name is assigned here as well.
Creation of a new hook
The hook is created and tentatively linked to the node, and the node is told about the name that will be used to describe this hook. The
node sets up any special data structures it needs, or may reject the connection, based on the name of the hook.
Successful connection of two hooks
After both ends have accepted their hooks, and the links have been made, the nodes get a chance to find out who their peer is across the
link, and can then decide to reject the connection. Tear-down is automatic. This is also the time at which a node may decide whether to
set a particular hook (or its peer) into the queueing mode.
Destruction of a hook
The node is notified of a broken connection. The node may consider some hooks to be critical to operation and others to be expendable:
the disconnection of one hook may be an acceptable event while for another it may effect a total shutdown for the node.
Preshutdown of a node
This method is called before real shutdown, which is discussed below. While in this method, the node is fully operational and can send a
``goodbye'' message to its peers, or it can exclude itself from the chain and reconnect its peers together, like the ng_tee(4) node type
does.
Shutdown of a node
This method allows a node to clean up and to ensure that any actions that need to be performed at this time are taken. The method is
called by the generic (i.e., superclass) node destructor which will get rid of the generic components of the node. Some nodes (usually
associated with a piece of hardware) may be persistent in that a shutdown breaks all edges and resets the node, but does not remove it.
In this case, the shutdown method should not free its resources, but rather, clean up and then call the NG_NODE_REVIVE() macro to signal
the generic code that the shutdown is aborted. In the case where the shutdown is started by the node itself due to hardware removal or
unloading (via ng_rmnode_self()), it should set the NGF_REALLY_DIE flag to signal to its own shutdown method that it is not to persist.
Sending and Receiving Data
Two other methods are also supported by all nodes:
Receive data message
A netgraph queueable request item, usually referred to as an item, is received by this function. The item contains a pointer to an mbuf.
The node is notified on which hook the item has arrived, and can use this information in its processing decision. The receiving node
must always NG_FREE_M() the mbuf chain on completion or error, or pass it on to another node (or kernel module) which will then be
responsible for freeing it. Similarly, the item must be freed if it is not to be passed on to another node, by using the NG_FREE_ITEM()
macro. If the item still holds references to mbufs at the time of freeing then they will also be appropriately freed. Therefore, if
there is any chance that the mbuf will be changed or freed separately from the item, it is very important that it be retrieved using the
NGI_GET_M() macro that also removes the reference within the item. (Or multiple frees of the same object will occur.)
If it is only required to examine the contents of the mbufs, then it is possible to use the NGI_M() macro to both read and rewrite mbuf
pointer inside the item.
If developer needs to pass any meta information along with the mbuf chain, he should use mbuf_tags(9) framework. Note that old netgraph
specific meta-data format is obsoleted now.
The receiving node may decide to defer the data by queueing it in the netgraph NETISR system (see below). It achieves this by setting
the HK_QUEUE flag in the flags word of the hook on which that data will arrive. The infrastructure will respect that bit and queue the
data for delivery at a later time, rather than deliver it directly. A node may decide to set the bit on the peer node, so that its own
output packets are queued.
The node may elect to nominate a different receive data function for data received on a particular hook, to simplify coding. It uses the
NG_HOOK_SET_RCVDATA(hook, fn) macro to do this. The function receives the same arguments in every way other than it will receive all
(and only) packets from that hook.
Receive control message
This method is called when a control message is addressed to the node. As with the received data, an item is received, with a pointer to
the control message. The message can be examined using the NGI_MSG() macro, or completely extracted from the item using the
NGI_GET_MSG() which also removes the reference within the item. If the Item still holds a reference to the message when it is freed
(using the NG_FREE_ITEM() macro), then the message will also be freed appropriately. If the reference has been removed, the node must
free the message itself using the NG_FREE_MSG() macro. A return address is always supplied, giving the address of the node that origi-
nated the message so a reply message can be sent anytime later. The return address is retrieved from the item using the NGI_RETADDR()
macro and is of type ng_ID_t. All control messages and replies are allocated with the malloc(9) type M_NETGRAPH_MSG, however it is more
convenient to use the NG_MKMESSAGE() and NG_MKRESPONSE() macros to allocate and fill out a message. Messages must be freed using the
NG_FREE_MSG() macro.
If the message was delivered via a specific hook, that hook will also be made known, which allows the use of such things as flow-control
messages, and status change messages, where the node may want to forward the message out another hook to that on which it arrived.
The node may elect to nominate a different receive message function for messages received on a particular hook, to simplify coding. It
uses the NG_HOOK_SET_RCVMSG(hook, fn) macro to do this. The function receives the same arguments in every way other than it will receive
all (and only) messages from that hook.
Much use has been made of reference counts, so that nodes being freed of all references are automatically freed, and this behaviour has been
tested and debugged to present a consistent and trustworthy framework for the ``type module'' writer to use.
Addressing
The netgraph framework provides an unambiguous and simple to use method of specifically addressing any single node in the graph. The naming
of a node is independent of its type, in that another node, or external component need not know anything about the node's type in order to
address it so as to send it a generic message type. Node and hook names should be chosen so as to make addresses meaningful.
Addresses are either absolute or relative. An absolute address begins with a node name or ID, followed by a colon, followed by a sequence of
hook names separated by periods. This addresses the node reached by starting at the named node and following the specified sequence of
hooks. A relative address includes only the sequence of hook names, implicitly starting hook traversal at the local node.
There are a couple of special possibilities for the node name. The name '.' (referred to as '.:') always refers to the local node. Also,
nodes that have no global name may be addressed by their ID numbers, by enclosing the hexadecimal representation of the ID number within the
square brackets. Here are some examples of valid netgraph addresses:
.:
[3f]:
foo:
.:hook1
foo:hook1.hook2
[d80]:hook1
The following set of nodes might be created for a site with a single physical frame relay line having two active logical DLCI channels, with
RFC 1490 frames on DLCI 16 and PPP frames over DLCI 20:
[type SYNC ] [type FRAME] [type RFC1490]
[ "Frame1" ](uplink)<-->(data)[<un-named>](dlci16)<-->(mux)[<un-named> ]
[ A ] [ B ](dlci20)<---+ [ C ]
|
| [ type PPP ]
+>(mux)[<un-named>]
[ D ]
One could always send a control message to node C from anywhere by using the name ``Frame1:uplink.dlci16''. In this case, node C would also
be notified that the message reached it via its hook mux. Similarly, ``Frame1:uplink.dlci20'' could reliably be used to reach node D, and
node A could refer to node B as ``.:uplink'', or simply ``uplink''. Conversely, B can refer to A as ``data''. The address ``mux.data''
could be used by both nodes C and D to address a message to node A.
Note that this is only for control messages. In each of these cases, where a relative addressing mode is used, the recipient is notified of
the hook on which the message arrived, as well as the originating node. This allows the option of hop-by-hop distribution of messages and
state information. Data messages are only routed one hop at a time, by specifying the departing hook, with each node making the next routing
decision. So when B receives a frame on hook data, it decodes the frame relay header to determine the DLCI, and then forwards the unwrapped
frame to either C or D.
In a similar way, flow control messages may be routed in the reverse direction to outgoing data. For example a ``buffer nearly full'' mes-
sage from ``Frame1:'' would be passed to node B which might decide to send similar messages to both nodes C and D. The nodes would use
direct hook pointer addressing to route the messages. The message may have travelled from ``Frame1:'' to B as a synchronous reply, saving
time and cycles.
Netgraph Structures
Structures are defined in <netgraph/netgraph.h> (for kernel structures only of interest to nodes) and <netgraph/ng_message.h> (for message
definitions also of interest to user programs).
The two basic object types that are of interest to node authors are nodes and hooks. These two objects have the following properties that
are also of interest to the node writers.
struct ng_node
Node authors should always use the following typedef to declare their pointers, and should never actually declare the structure.
typedef struct ng_node *node_p;
The following properties are associated with a node, and can be accessed in the following manner:
Validity
A driver or interrupt routine may want to check whether the node is still valid. It is assumed that the caller holds a reference on
the node so it will not have been freed, however it may have been disabled or otherwise shut down. Using the NG_NODE_IS_VALID(node)
macro will return this state. Eventually it should be almost impossible for code to run in an invalid node but at this time that
work has not been completed.
Node ID (ng_ID_t)
This property can be retrieved using the macro NG_NODE_ID(node).
Node name
Optional globally unique name, NUL terminated string. If there is a value in here, it is the name of the node.
if (NG_NODE_NAME(node)[0] != '