2 * Copyright (c) 2011-2012 The DragonFly Project. All rights reserved.
4 * This code is derived from software contributed to The DragonFly Project
5 * by Matthew Dillon <dillon@dragonflybsd.org>
6 * by Venkatesh Srinivas <vsrinivas@dragonflybsd.org>
8 * Redistribution and use in source and binary forms, with or without
9 * modification, are permitted provided that the following conditions
12 * 1. Redistributions of source code must retain the above copyright
13 * notice, this list of conditions and the following disclaimer.
14 * 2. Redistributions in binary form must reproduce the above copyright
15 * notice, this list of conditions and the following disclaimer in
16 * the documentation and/or other materials provided with the
18 * 3. Neither the name of The DragonFly Project nor the names of its
19 * contributors may be used to endorse or promote products derived
20 * from this software without specific, prior written permission.
22 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
23 * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
24 * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
25 * FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
26 * COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
27 * INCIDENTAL, SPECIAL, EXEMPLARY OR CONSEQUENTIAL DAMAGES (INCLUDING,
28 * BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
29 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
30 * AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
31 * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
32 * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
35 #ifndef VFS_HAMMER2_NETWORK_H_
36 #define VFS_HAMMER2_NETWORK_H_
38 #ifndef _VFS_HAMMER2_DISK_H_
39 #include "hammer2_disk.h"
43 * Mesh network protocol structures.
45 * The mesh is constructed from point-to-point streaming links with varying
46 * levels of interconnectedness, forming a graph. When a link is established
47 * link id #0 is reserved for link-level communications. This link is used
48 * for authentication, registration, ping, further link id negotiations,
49 * spanning tree, and so on.
51 * The spanning tree forms a weighted shortest-path-first graph amongst
52 * those nodes with sufficient administrative rights to relay between
53 * registrations. Each link maintains a full reachability set, aggregates
54 * it, and retransmits via the shortest path. However, leaf nodes (even leaf
55 * nodes with multiple connections) can opt not to be part of the spanning
56 * tree and typically (due to administrative rights) their registrations
57 * are not reported to other leafs.
59 * All message responses follow the SAME PATH that the original message
60 * followed, but in reverse. This is an absolute requirement since messages
61 * expecting replies record persistent state at each hop.
63 * Message state is handled by the CREATE, DELETE, REPLY, and ABORT
64 * flags. Message state is typically recorded at the end points and
65 * at each hop until a DELETE is received from both sides.
67 * One-way messages such as those used by spanning tree commands are not
68 * recorded. These are sent with no flags set. Aborts and replies are not
71 * A normal message with no persistent state is sent with CREATE|DELETE and
72 * the response is returned with REPLY|CREATE|DELETE. A normal command can
73 * be aborted by sending an ABORT message to the msgid that is in progress.
74 * An ABORT sent by the originator must still wait for the reply from the
75 * target and, since we've already sent the DELETE with our CREATE|DELETE,
76 * may also cross the REPLY|CREATE|DELETE message in the opposite direction.
77 * In this situation the message state has been destroyed on the target and
78 * the target ignores the ABORT (because CREATE is not set, and
79 * differentiated from one-way messages because ABORT is set).
81 * A command which has persistent state must maintain a persistent message.
82 * For example, a lock or cache state request. A persistent message is
83 * initiated with just CREATE and the initial response is returned with
84 * REPLY|CREATE. Successive messages are sent with no flags and responses
85 * with just REPLY. The DELETE flag acts like a half-close (and degenerately
86 * works the same as it does for normal messages). This flag can be set
87 * in the initial command or any successive command, and in the initial reply
88 * or in any successive reply. The recorded message state is destroyed when
89 * both sides have sent a DELETE.
91 * Aborts for persistent messages work in the same fashion as they do for
92 * normal messages, except that the target can also initiate an ABORT
93 * by using ABORT|REPLY. The target has one restriction, however, it cannot
94 * send an ABORT with the CREATE flag set (i.e. as the initial reply),
95 * because if the originator reuses the msgid the originator would not
96 * then be able to determine that the ABORT is associated with the previous
97 * session and not the new session.
99 * If a link failure occurs any active or persistent messages will be
100 * auto-replied to the originator, and auto-aborted to the target.
102 * Additional features:
104 * ABORT+CREATE - This may be used to make a non-blocking request.
105 * The target receives the normal command and is free
106 * to ignore the ABORT flag, but may use it as an
107 * indication that a non-blocking request is being
108 * made. The target must still reply the message of
109 * course. Works for normal and persistent messages
110 * but does NOT work for one-way messages (because
111 * ABORT alone without recorded msgid state has to be
114 * ABORT - ABORT messages are allowed to bypass input queues.
115 * Normal ABORTs are sent without the DELETE flag,
116 * even for normal messages which had already set the
117 * DELETE flag in the initial message. This allows
118 * the normal DELETE half-close operation to proceed
119 * so an ABORT is basically advisory and the originator
120 * must still wait for a reply. Aborts are also
121 * advisory when sent by targets.
123 * ABORT messages cannot be used with one-way messages
124 * as this would cause such messages to be ignored.
126 * ABORT+DELETE - This is a special form of ABORT that allows the
127 * recorded message state on the sender and on all
128 * hops the message is relayed through to be destroyed
129 * on the fly, as if a two-way DELETE had occurred.
130 * It will cause an auto-reply or auto-abort to be
131 * issued as if the link had been lost, but allows
132 * the link to remain live.
134 * This form is basically like a socket close(),
135 * where you aren't just sending an EOF but you are
136 * completely aborting the request in both directions.
138 * This form cannot be used with CREATE as that could
139 * generate a false reply if msgid is reused and
140 * crosses the abort over the wire.
142 * ABORT messages cannot be used with one-way messages
143 * as this would cause such messages to be ignored.
145 * SUBSTREAMS - Persistent messages coupled with the fact that
146 * all commands and responses run through a single
147 * chain of relays over reliable streams allows one
148 * to treat persistent message updates as a data
149 * stream and use the DELETE flag or an ABORT to
153 * NEGOTIATION OF {source} AND {target}
155 * In this discussion 'originator' describes the original sender of a message
156 * and not the relays inbetween, while 'sender' describes the last relay.
157 * The two mean the same thing only when the originator IS the last relay.
159 * The {source} field is sender-localized. The sender assigns this field
160 * based on which connection the message originally came from. The initial
161 * message as sent by the originator sets source=0. This also means that a
162 * leaf connection will always send messages with source=0.
164 * The {source} field must be re-localized at each hop, since messages
165 * coming from multiple connections to a node will use conflicting
166 * {source} values. This can lead to linkid exhaustion which is discussed
167 * a few paragraphs down.
169 * The {target} field is sender-allocated. Messages sent to {target} are
170 * preceeded by a FORGE message to {target} which associates a registration
171 * with {target}, or UNFORGE to delete the associtation.
173 * The msgid field is 32 bits (remember some messages have long-lived
174 * persistent state so this is important!). One-way messages always use
179 * Because {source} must be re-localized at each hop it is possible to run
180 * out of link identifiers. At the same time we want to allow millions of
181 * client/leaf connections, and 'millions' is a lot bigger than 65535.
183 * We also have a problem with the persistent message state... If a single
184 * client's vnode cache has a million vnodes that can represent a million
185 * persistent cache states. Multiply by a million clients and ... oops!
187 * To solve these problems leafs connect into protocol-aggregators rather
188 * than directly to the cluster. The linkid and core message protocols only
189 * occur within the cluster and not by the leafs. A leaf can still connect
190 * to multiple aggregators for redundancy if it desires but may have to
191 * pick and choose which inodes go where since acquiring a cache state lock
192 * over one connection will cause conflicts to be invalidated on the other.
193 * In otherwords, there are limitations to this approach.
195 * A protocol aggregator takes any number of connections and aggregates
196 * the operations down to a single linkid. For example, this means that
197 * the protocol aggregator is responsible for maintaining all the cache
198 * state and performing crunches to reduce the overall amount of state
199 * down to something the cluster core can handle.
203 * All message headers are 32-byte aligned and sized (all command and
204 * response structures must be 32-byte aligned), and all transports must
205 * support message headers up to HAMMER2_MSGHDR_MAX. The msg structure
206 * can handle up to 8160 bytes but to keep things fairly clean we limit
207 * message headers to 2048 bytes.
209 * Any in-band data is padded to a 32-byte alignment and placed directly
210 * after the extended header (after the higher-level cmd/rep structure).
211 * The actual unaligned size of the in-band data is encoded in the aux_bytes
212 * field in this case. Maximum data sizes are negotiated during registration.
214 * Use of out-of-band data must be negotiated. In this case bit 31 of
215 * aux_bytes will be set and the remaining bits will contain information
216 * specific to the out-of-band transfer (such as DMA channel, slot, etc).
218 * (must be 32 bytes exactly to match the alignment requirement and to
219 * support pad records in shared-memory FIFO schemes)
221 struct hammer2_msg_hdr {
222 uint16_t magic; /* sanity, synchronization, endian */
223 uint16_t icrc1; /* base header crc &salt on */
224 uint32_t salt; /* random salt helps crypto/replay */
226 uint16_t source; /* source linkid */
227 uint16_t target; /* target linkid */
228 uint32_t msgid; /* message id */
230 uint32_t cmd; /* flags | cmd | hdr_size / 32 */
231 uint16_t error; /* error field */
234 uint16_t icrc2; /* extended header crc (after base) */
235 uint16_t aux_bytes; /* aux data descriptor or size / 32 */
236 uint32_t aux_icrc; /* aux data iscsi crc */
239 typedef struct hammer2_msg_hdr hammer2_msg_hdr_t;
241 #define HAMMER2_MSGHDR_MAGIC 0x4832
242 #define HAMMER2_MSGHDR_MAGIC_REV 0x3248
243 #define HAMMER2_MSGHDR_CRCOFF offsetof(hammer2_msg_hdr_t, salt)
244 #define HAMMER2_MSGHDR_CRCBYTES (sizeof(hammer2_msg_hdr_t) - \
245 HAMMER2_MSGHDR_CRCOFF)
248 * Administrative protocol limits.
250 #define HAMMER2_MSGHDR_MAX 2048 /* msg struct max is 8192-32 */
251 #define HAMMER2_MSGAUX_MAX 65536 /* msg struct max is 2MB-32 */
252 #define HAMMER2_MSGBUF_SIZE (HAMMER2_MSGHDR_MAX * 4)
253 #define HAMMER2_MSGBUF_MASK (HAMMER2_MSGBUF_SIZE - 1)
256 * The message (cmd) field also encodes various flags and the total size
257 * of the message header. This allows the protocol processors to validate
258 * persistency and structural settings for every command simply by
259 * switch()ing on the (cmd) field.
261 #define HAMMER2_MSGF_CREATE 0x80000000U /* msg start */
262 #define HAMMER2_MSGF_DELETE 0x40000000U /* msg end */
263 #define HAMMER2_MSGF_REPLY 0x20000000U /* reply path */
264 #define HAMMER2_MSGF_ABORT 0x10000000U /* abort req */
265 #define HAMMER2_MSGF_AUXOOB 0x08000000U /* aux-data is OOB */
266 #define HAMMER2_MSGF_FLAG2 0x04000000U
267 #define HAMMER2_MSGF_FLAG1 0x02000000U
268 #define HAMMER2_MSGF_FLAG0 0x01000000U
270 #define HAMMER2_MSGF_FLAGS 0xFF000000U /* all flags */
271 #define HAMMER2_MSGF_PROTOS 0x00F00000U /* all protos */
272 #define HAMMER2_MSGF_CMDS 0x000FFF00U /* all cmds */
273 #define HAMMER2_MSGF_SIZE 0x000000FFU /* N*32 */
275 #define HAMMER2_MSGF_CMDSWMASK (HAMMER2_MSGF_CMDS | \
276 HAMMER2_MSGF_SIZE | \
277 HAMMER2_MSGF_PROTOS | \
280 #define HAMMER2_MSG_PROTO_LNK 0x00000000U
281 #define HAMMER2_MSG_PROTO_DBG 0x00100000U
282 #define HAMMER2_MSG_PROTO_CAC 0x00200000U
283 #define HAMMER2_MSG_PROTO_QRM 0x00300000U
284 #define HAMMER2_MSG_PROTO_BLK 0x00400000U
285 #define HAMMER2_MSG_PROTO_VOP 0x00500000U
288 * Message command constructors, sans flags
290 #define HAMMER2_MSG_ALIGN 32
291 #define HAMMER2_MSG_ALIGNMASK (HAMMER2_MSG_ALIGN - 1)
292 #define HAMMER2_MSG_DOALIGN(bytes) (((bytes) + HAMMER2_MSG_ALIGNMASK) & \
293 ~HAMMER2_MSG_ALIGNMASK)
294 #define HAMMER2_MSG_HDR_ENCODE(elm) ((sizeof(struct elm) + \
295 HAMMER2_MSG_ALIGNMASK) / \
298 #define HAMMER2_MSG_LNK(cmd, elm) (HAMMER2_MSG_PROTO_LNK | \
300 HAMMER2_MSG_HDR_ENCODE(elm))
302 #define HAMMER2_MSG_DBG(cmd, elm) (HAMMER2_MSG_PROTO_DBG | \
304 HAMMER2_MSG_HDR_ENCODE(elm))
306 #define HAMMER2_MSG_CAC(cmd, elm) (HAMMER2_MSG_PROTO_CAC | \
308 HAMMER2_MSG_HDR_ENCODE(elm))
310 #define HAMMER2_MSG_QRM(cmd, elm) (HAMMER2_MSG_PROTO_QRM | \
312 HAMMER2_MSG_HDR_ENCODE(elm))
314 #define HAMMER2_MSG_BLK(cmd, elm) (HAMMER2_MSG_PROTO_BLK | \
316 HAMMER2_MSG_HDR_ENCODE(elm))
318 #define HAMMER2_MSG_VOP(cmd, elm) (HAMMER2_MSG_PROTO_VOP | \
320 HAMMER2_MSG_HDR_ENCODE(elm))
323 * Link layer ops basically talk to just the other side of a direct
326 * PAD - One-way message on link-0, ignored by target. Used to
327 * pad message buffers on shared-memory transports. Not
328 * typically used with TCP.
330 * AUTHn - Authenticate the connection, negotiate administrative
331 * rights & encryption, protocol class, etc. Only PAD and
332 * AUTH messages (not even PING) are accepted until
333 * authentication is complete. This message also identifies
336 * PING - One-way message on link-0, keep-alive, run by both sides
337 * typically 1/sec on idle link, link is lost after 10 seconds
340 * HSPAN - One-way message on link-0, host-spanning tree message.
341 * Connection and authentication status is propagated using
342 * these messages on a per-connection basis. Works like SPAN
343 * but is only used for general status. See the hammer2
346 * SPAN - One-way message on link-0, spanning tree message adds,
347 * drops, or updates a remote registration. Sent by both
348 * sides, delta changes only. Visbility into remote
349 * registrations may be limited and received registrations
350 * may be filtered depending on administrative controls.
352 * A multiply-connected node maintains SPAN information on
353 * each link independently and then retransmits an aggregation
354 * of the shortest-weighted path for each registration to
355 * all links when a received change adjusts the path.
357 * The leaf protocol also uses this to make a PFS available
358 * to the cluster (e.g. on-mount).
360 #define HAMMER2_LNK_PAD HAMMER2_MSG_LNK(0x000, hammer2_msg_hdr)
361 #define HAMMER2_LNK_PING HAMMER2_MSG_LNK(0x001, hammer2_msg_hdr)
362 #define HAMMER2_LNK_AUTH HAMMER2_MSG_LNK(0x010, hammer2_lnk_auth)
363 #define HAMMER2_LNK_HSPAN HAMMER2_MSG_LNK(0x011, hammer2_lnk_hspan)
364 #define HAMMER2_LNK_SPAN HAMMER2_MSG_LNK(0x012, hammer2_lnk_span)
365 #define HAMMER2_LNK_ERROR HAMMER2_MSG_LNK(0xFFF, hammer2_msg_hdr)
368 * Debug layer ops operate on any link
370 * SHELL - Persist stream, access the debug shell on the target
371 * registration. Multiple shells can be operational.
373 #define HAMMER2_DBG_SHELL HAMMER2_MSG_DBG(0x001, hammer2_dbg_shell)
375 struct hammer2_dbg_shell {
376 hammer2_msg_hdr_t head;
378 typedef struct hammer2_dbg_shell hammer2_dbg_shell_t;
381 * Cache layer ops operate on any link, link-0 may be used when the
382 * directly connected target is the desired registration.
384 * LOCK - Persist state, blockable, abortable.
386 * Obtain cache state (MODIFIED, EXCLUSIVE, SHARED, or INVAL)
387 * in any of three domains (TREE, INUM, ATTR, DIRENT) for a
388 * particular key relative to cache state already owned.
390 * TREE - Effects entire sub-tree at the specified element
391 * and will cause existing cache state owned by
392 * other nodes to be adjusted such that the request
395 * INUM - Only effects inode creation/deletion of an existing
396 * element or a new element, by inumber and/or name.
397 * typically can be held for very long periods of time
398 * (think the vnode cache), directly relates to
399 * hammer2_chain structures representing inodes.
401 * ATTR - Only effects an inode's attributes, such as
402 * ownership, modes, etc. Used for lookups, chdir,
403 * open, etc. mtime has no affect.
405 * DIRENT - Only affects an inode's attributes plus the
406 * attributes or names related to any directory entry
407 * directly under this inode (non-recursively). Can
408 * be retained for medium periods of time when doing
411 * This function may block and can be aborted. You may be
412 * granted cache state that is more broad than the state you
413 * requested (e.g. a different set of domains and/or an element
414 * at a higher layer in the tree). When quorum operations
415 * are used you may have to reconcile these grants to the
416 * lowest common denominator.
418 * In order to grant your request either you or the target
419 * (or both) may have to obtain a quorum agreement. Deadlock
420 * resolution may be required. When doing it yourself you
421 * will typically maintain an active message to each master
422 * node in the system. You can only grant the cache state
423 * when a quorum of nodes agree.
425 * The cache state includes transaction id information which
426 * can be used to resolve data requests.
428 #define HAMMER2_CAC_LOCK HAMMER2_MSG_CAC(0x001, hammer2_cac_lock)
431 * Quorum layer ops operate on any link, link-0 may be used when the
432 * directly connected target is the desired registration.
434 * COMMIT - Persist state, blockable, abortable
436 * Issue a COMMIT in two phases. A quorum must acknowledge
437 * the operation to proceed to phase-2. Message-update to
438 * proceed to phase-2.
440 #define HAMMER2_QRM_COMMIT HAMMER2_MSG_QRM(0x001, hammer2_qrm_commit)
443 * General message errors
445 * 0x00 - 0x1F Local iocomm errors
446 * 0x20 - 0x2F Global errors
448 #define HAMMER2_MSG_ERR_UNKNOWN 0x20
451 char buf[HAMMER2_MSGHDR_MAX];
452 hammer2_msg_hdr_t head;
455 typedef union hammer2_any hammer2_any_t;