Internet Engineering Task Force R. Perlman
INTERNET DRAFT Sun Microsystems
C-Y Lee
Nortel
A. Ballardie
Research Consultant
J. Crowcroft
UCL
August 1998
A Design for Simple, Low-Overhead Multicast
<draft-perlman-simple-multicast-00.txt>
Status of This Memo
This document is an Internet Draft, Internet Drafts are working
documents of the Internet Engineering Task Force (IETF), its Areas,
and its Working Groups. Note that other groups may also distribute
working documents as Internet Drafts.
Internet Drafts are draft documents valid for a maximum of six
months. Internet Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use Internet
Drafts as reference material, or to cite them other than as a
``working draft'' or ``work in progress.''
Please check the I-D abstract listing contained in each Internet
Draft directory to learn the current status of this or any other
Internet Draft.
Abstract
This paper describes how a lot of the complexity and overhead of
multicast can go away if a slightly different approach is taken.
Approaches like this have been proposed in the past, but the design
has not been carried through to completion. This paper describes the
approach and then compares it with other approaches.
1.0 Introduction
The basic idea is that a multicast group is created by generating:
- a name, suitable for lookup by humans,
- a multicast address (i.e., a class D IP multicast address)
Expires February 1999 [Page 1]
Internet Draft A Design for Simple Multicast August 1998
- a distinguished node known as the "core"
Endnodes look up the group through some sort of directory, e.g., DNS
or SDR. The look-up is based on name, but the information retrieved
is the multicast address and the core address.
Endnodes then join the group by sending a special join message
towards the core, creating state in the routers along the path. The
result is a tree of shortest paths from the core to each member,
where only the routers along the tree need to have any state about
the group.
1.1 Very simplified Review of PIM-SM and BGMP
Originally there was the idea, presented in CBT, of forming a
multicast group by choosing a distinguished node, the "core", and
having all members join by sending special join messages towards the
core. The routers along the path keep state about which ports are in
the group. If a router along the path of the join already has state
about that group the join does not proceed further. Instead that
router just "grafts" the new limb onto the tree. The result is a tree
of shortest paths from the core, with only the routers along the path
knowing anything about that group.
Later modifications included:
- "optimizing" the tree by forming a tree for each sender. The way
this is done, each node could independently decide whether the volume
of traffic from a particular source was worth joining a per-source
tree. The result was that there were two possible trees for traffic
from a particular source for group M, the shared tree and the source
tree. To prevent loops, the shared tree had to be unidirectional,
i.e., to send to the shared tree, you'd have to unicast to the core.
- the assumption is that routers have to be able to figure out,
solely based on the multicast address, who the core was. This
resulted in a protocol whereby "core-capable" routers continuously
advertise themselves, all routers keep track of the current set of
live core-capable routers, and there is some sort of hash to map a
multicast address to one of the set of core-capable routers. This
advertisement protocol is confined to within a domain.
- Because the core-capable advertisements (mercifully) do not
travel through the entire Internet, but are instead confined to a
domain, there needed to be a protocol whereby routers could know the
mapping of multicast address to core, even in other domains. The
proposal (BGMP) was to have each domain acquire (somehow, a current
area for research) a set of addresses. These addresses would be
advertised through the interdomain routing protocol. Therefore they
had to be allocated in blocks so as to attempt not to place too much
of a burden on the interdomain routing protocol.
Expires February 1999 [Page 2]
Internet Draft A Design for Simple Multicast August 1998
We use the term "aggregatable" to mean that the routing protocol can
summarize large numbers of addresses in a small amount of space.
2.0 The Design
The design we propose is most similar in spirit to the original idea
of CBT, because it does not have:
- per source trees
- the necessity for routers to know the mapping between multicast
addresses and cores
2.1 Address Allocation
The only constraint on the multicast addresses in our proposal is
that they be unique in time and space. There is no need for the
addresses to be aggregatable, since they will not be advertised in
routing protocols. It is far simpler to allocate addresses, and you
can do with a smaller number of them, if there are no further
constraints on the addresses than that they be unique in time (and
space).
The way to allocate addresses is to have a bunch of address
allocation servers sprinkled throughout the Internet, each with a
block of addresses. Anyone that wants to form a group finds any one
of the address allocation servers, asks for an address and gives an
amount of time, say 1 day. After 1 day (plus handwave for clock skew,
etc.,) that address can be reallocated to someone else.
There can be a hierarchy of these servers. At the top, there might be
10 of them, each with 1/10 of the address space. Anyone could ask one
of them for a single address, or for a block of addresses. If you
want to be a server handing out addresses, you can request a block of
addresses, and then clients can request individual addresses from
you. If you run out, you can ask for more.
Having a hierarchy with lots of servers, rather than a few servers
with millions of addresses each, has the following advantages:
- no performance bottlenecks
- it's likely a server will be nearby
- servers don't have to keep track of millions of individual
addresses, each one with a different timer as to when it can be re-
allocated.
Aggregation wastes addresses, since if you don't overestimate, you
will run out and have to acquire more blocks, which will cause
fragmentation of the space (more intervals to have to advertise).
Since in our scheme the addresses will not get advertised in any
routing protocol, there is no necessity to have them have any
Expires February 1999 [Page 3]
Internet Draft A Design for Simple Multicast August 1998
topological significance or any other constraint that would waste the
addresses. Therefore there will effectively be more addresses, and
less likelihood they will "run out".
If we are worried about running out even though in our proposal they
do not need to be aggregatable, then we could reserve some to be
limited in scope, say within a domain. If you wanted to create a
group where you know that all members are in that domain, you'd get
one of the "local" addresses. Since the same block of local addresses
can be used simultaneously within different domains, this makes a lot
more addresses. Routers at the domain boundary should set up filters
to make sure that packets with local addresses don't "leak" out. This
concept is known as "administratively scoped addresses".
2.2 Creating the Multicast Group
There should be a "multicast group creation" utility.
To create a group you pick a name that humans will recognize the
group by. You input that into the multicast group creation utility,
along with information about how long the group will live, and a
logical choice for the core address. If no core is specified, the
default is to have the node that created the group be the core. The
utility finds an address allocation server, gets an address, and then
registers the group and core address into the directory.
2.3 Joining a Group
To join a group, you browse the directory to find the appropriate
name, and then tell a "multicast join" utility that you'd like to
join that group. The utility looks up the address and core address.
Then it sends a special control packet, known as a "join" message,
from your node. This message gets sent in the same direction as a
unicast message to the core address would be sent, but it creates
state in the routers along the path, to know which ports are in the
group. If a router receives a join for multicast address M, and it
already has state for M, then it merely adds that port to its set of
ports for M, and does not forward the join further.
The result is a tree of shortest paths from the core to each member.
Each router on the tree has a database of (M, {ports}) that tells it,
for group M, which ports are in the tree.
2.4 Transmitting to multicast group M
Expires February 1999 [Page 4]
Internet Draft A Design for Simple Multicast August 1998
If you are a member of the group, you simply transmit an IP packet
with destination address M. A router that receives a packet with
multicast address M checks its multicast database. If it knows about
M, it checks if the port it received it on is in its database. If
not, it drops the packet. If so, it forwards the packet onto all the
other ports listed in its database for M. The result is a bi-
directional tree.
If you are not a member of the group but want to transmit to the
group, you unicast to the core.
2.5 Interdomain Groups
Everything described above works just as well in the interdomain
case. You join a group by sending a join message to the core
address. No matter what domain the core is in, it has an IP address
and IP unicast routing already knows how to reach it.
2.6 Backbones
There might be concern about a backbone ISP that might need to keep
state about billions of groups. It is extremely unlikely that so many
groups would exist, especially over such a wide geographic area.
However, if there really is that concern, the solution is tunnels
between boundary routers in the backbone. A tunnel can be thought of
as a point-to-point link between the two routers. Traffic is sent
across the tunnel by adding an additional IP header, with the
destination address in the outer header being the address of the
other tunnel endpoint.
The simplest strategy is to assume a full mesh of tunnels between
every pair of boundary routers in that backbone. So for instance, if
there are 100 boundary routers for that backbone, each would maintain
99 tunnels, to each of the 99 other backbone boundary routers. Each
of those are treated as a "port".
When receiving a join, backbone boundary router R1 checks its routing
database to see which backbone boundary router leads to the core
address, say R7. It then forwards the join along the tunnel it
maintains between itself and R7. In its multicast group state, it
makes an entry for the multicast address M, and the set of ports
(including tunnel) included in the tree for M.
In this way the interior nodes of the backbone do not keep any state
about individual groups.
It is not necessary to maintain all the tunnels. They can be formed
on an as-needed basis. In fact, there is no reason for a tunnel to be
Expires February 1999 [Page 5]
Internet Draft A Design for Simple Multicast August 1998
"maintained" at all. If R1 determines that multicast M should be
forwarded on a tunnel to R7, all it has to do is remember "R7", and
when it receives a packet with destination address M, it tunnels it
to R7.
2.7 Per-Source Trees
Don't bother. The most logical thing to optimize is the overhead to
the network to deliver a packet through the tree. The ideal tree is a
minimum weight spanning tree, but no proposals have attempted to try
to calculate a minimum weight spanning tree. It is most likely too
difficult to do so, though a simple algorithm for creating a minimum
weight or near-minimum weight tree would certainly be a welcome
advance in IP multicast.
A per source tree does not use less overhead to deliver the message
than the tree rooted at the core. The only thing more optimal about a
source tree than the shared core tree is the delay from the time the
source transmits to the time everyone receives the message. It is
unlikely that an application that is so delay sensitive that it would
work with a per-source tree but not with the shared core tree would
work at all if there are members located any significant distance
away from the source.
If there were an application that happens to be in a topology where
all members are close enough if per-source trees are used, but the
members are not close enough if the shared tree is used, then it is
certainly possible to explicitly form multiple trees in that case.
But this is an extremely rare scenario, and we should not require the
network to keep n times as much state for all groups, with a mind-
bogglingly complex protocol for switching from shared to per-source
trees, for the convenience of the very rare case.
2.8 Detecting Failure of the tree
This can be done with a simple scheme of keep-alive timers sent when
there has been no traffic for some amount of time. That way failures
can be detected and corrected when they occur.
If a router stops receiving keep-alives from the port away from the
core, it removes that port from the tree. If it stops receiving
keep-alives from the port towards the core, it rejoins the tree. If
a member stops receiving keep-alives, it rejoins the tree.
2.9 Detecting failures of the core
Ideally, the node selected as the core should be robust or a
redundant server may be installed as the hardware backup core.
Expires February 1999 [Page 6]
Internet Draft A Design for Simple Multicast August 1998
To deal with core failures there is no one best way because it
depends on engineering tradeoffs:
- speed of recovery after a core failure
- state in routers
- bandwidth use
To give an example of tradeoffs, consider an extreme example of an
application for which every packet must be delivered without delay.
Such an application cannot tolerate waiting for unicast routing to
notice a link failure and reroute, or react to lack of receipt of
keep-alive messages. In this case, the ideal solution is to
consciously create multiple groups and transmit every message on all
of the trees. This uses a lot of bandwidth since all data is sent on
multiple trees, but there is no other way to accommodate an
application that must have that degree of availability.
For applications that can tolerate some amount of time after failure
to rebuild a new tree, there can be a backup group created, and nodes
would join the backup group if the core for the original group is
unreachable.
In the case where it is reasonable to have state in the routers, the
backup group could be formed proactively, but not used for data
unless the first group fails. The routers would have to maintain
state about two groups, and perhaps incur a modest bandwidth overhead
for keepalives to maintain the health of the backup tree, but there
would be no need to send all data on both trees as it would in the
first example application.
Decisions about whether to create multiple groups proactively can be
made on a per-application basis.
3.0 FAQs and answers
3.1 What if the core is an endnode? Endnodes can't forward packets.
A tree is a tree. It doesn't matter who the core is. Once the tree is
formed, the traffic pattern is the same no matter which node is the
core. If an endnode has only a single link to the network, it will
never forward traffic. If it receives traffic on that link, it does
not have any "other" links to forward the traffic to, so the single-
link core just receives or generates multicast like any other endnode
would do.
3.2 How should the core be chosen for an optimal tree?
Expires February 1999 [Page 7]
Internet Draft A Design for Simple Multicast August 1998
If the core is a member of the group, the tree will be reasonably
optimal, and certainly from the viewpoint of how expensive it will be
for the network to deliver the packet, the core tree is as likely to
be good as any per-source tree.
3.3 IP is already deployed in the endnodes. How can we change all those
kernels?
There is no need to change the kernels. This can be deployed as an
application layer process which sends the special IP packet which is
the join message. There is no need to modify IGMP. As a matter of
fact, IGMP is not needed for this proposal.
IGMP is still useful, as is, for dense mode multicast, and this
proposal does not require removing IGMP or modifying IGMP. So there
are no difficult kernel modifications to the IP stack as a result of
this proposal.
3.4 Won't BGMP allow policy control, somehow?
There is the belief that since BGP allows routing policies, that BGMP
will somehow allow policies, though what exactly it is supposed to be
doing and how it would do it is not exactly well thought out at this
point.
We claim that each border router can be individually configured with
policies which it can individually enforce, and accomplish anything
that might have been accomplished with BGMP. This can be done with
the same sorts of packet filters that are already done in firewalls.
4.0 Security/Policy considerations
This section discusses various problems we might be trying to solve,
and proposed solutions.
4.1 Hiding data from unauthorized receivers
A motivation, perhaps, is to limit delivery to those receivers that
have "paid their bill".
One method of doing this is to try to ensure that the routing
infrastructure does not deliver data to unauthorized receivers. This
is not the right way to do this. The right way to hide data from the
unauthorized receivers is to encrypt the data, making sure that only
authorized receivers are given the key. Then there is no reason to
have the routing infrastructure attempt to prevent unauthorized nodes
from receiving the transmitted data, since it will do them no good
(it will be encrypted).
Expires February 1999 [Page 8]
Internet Draft A Design for Simple Multicast August 1998
There has recently been work on key distribution for multicast, and
there are schemes for efficiently changing the shared group key
periodically, or when a new member joins (if it's not allowed to see
previous data), or when a member leaves (if it is not allowed to see
data after it leaves). These schemes are interesting, but rather than
describing them here, for the purpose of this paper we can assume
that distributing a shared secret key to all authorized receivers is
a solved problem.
The basic idea is that there is a group moderator that a member has
to check in with, first authenticating and then proving somehow that
he is authorized to join the group. Then the group moderator gives
that new member the key.
Making key changes very efficient is accomplished, in the recent work
on multicast key distribution, by having a hierarchy of keys, and
giving each member log(n) keys, so that changing the key only
involves log(n) work rather than having to individually contact each
member and give them a new group key.
4.2 Preventing unauthorized transmitters from cluttering the group with
data
The basic solution is to have authorized transmitters "sign" the
packet (need not be a public key signature), and have various filter
points reject unauthorized packets.
There are three possible scenarios:
a) all authorized group members are trusted to transmit, or at least
not to transmit when they should not.
In this case, the single shared group secret key will suffice. Each
packet can include a MIC (message integrity code), for instance, a
keyed MD with the group secret. Anyone that knows the group secret
can verify the MIC and discard the packet. If routers don't check the
MIC, then receivers will receive the spam, but will be able to
recognize it as bogus and discard it. If some selected routers are
given the shared key (say the firewall at the entrance to your
domain), then it can discard bogus packets before they clutter the
bandwidth of your domain.
b) there are "receiver-only" members that are not trusted to refrain
from transmitting, but they need to be able to recognize bogus
packets injected by unauthorized transmitters
In this case, we can't base the MIC on a secret key known to the
receivers. Instead we'd use public key technology. All the members,
Expires February 1999 [Page 9]
Internet Draft A Design for Simple Multicast August 1998
and any routers that will be doing filtering, are given the
"authorized transmitter public key". All authorized transmitters are
given the authorized transmitter private key. They digitally sign
packets, and receivers and selected routers can recognize and discard
bogus packets.
c) there are authorized transmitters, but you don't want them to
impersonate each other
In this case, we can't have them all use the same private key.
Instead, we'd still have a single group public key that would need to
be distributed to all members and selected routers, but each
authorized transmitter would use an individual private key to sign
packets. There would be a group moderator that would know the group
private key. It would authorize a node to be a transmitter by signing
a certificate authorizing that transmitter's key to transmit packets
to this group.
In all of these cases, packets can be filtered, either at the
receivers, or earlier. If routers do not have the bandwidth to check
every packet for digital signatures, then they can "spot check", and
only start paying a lot of attention to a particular multicast
group's messages if it starts seeing bogus packets. Also, if there
are multiple routers they can share the load, and there are several
variants of this:
- each router checks a random percentage of packets. In this way,
most bogus packets will have been discarded before they reach the
receivers
- the routers along the path coordinate as to which packets each
will handle. For instance, there can be a simple hash function, and
each of the k routers on the path takes the ones that hash to one of
the values mod k.
- at the entrance to the domain, a bit in the packet is flipped
indicating the packet has not been verified. Then any router in the
domain that has spare cycles can check packets that have the bit set,
and either discard the packet (if it's bogus) or clear the bit so
other routers (and receivers) need not waste cycles testing the
packet. (assuming we're trying to protect ourselves from people
outside our domain, so we trust all the nodes within the domain).
5.0 Overhead Compared to other schemes
Address allocation is much simpler. There is no need for addresses to
be aggregatable. There is no reason to have BGMP to pass multicast
Expires February 1999 [Page 10]
Internet Draft A Design for Simple Multicast August 1998
address blocks around in the interdomain protocol. The only routing
is based on the core address, which is a unicast IP address, which IP
already needs to be able to route towards. Passing around multicast
address ranges through BGMP uses bandwidth, CPU, and memory in the
routers, not to mention pages of specifications, implementation
effort, and more things to read and understand. With our proposal
none of that is necessary.
There is no reason, as in PIM-SM, to have core-capable routers
advertise themselves, and to have some sort of hash scheme so that
every router can determine, based on the multicast address, which of
the core-capable routers would be the core for that multicast
address. It is expensive to have all core-capable routers advertising
all the time. It is very complex to have an algorithm whereby a
router from that set is chosen in an identical manner by all routers,
and to hope that the same core will be chosen at all times by all
routers even if the set of core capable routers is different when one
makes a decision vs. another router making the decision.
In PIM-SM, the core is basically a router chosen at random, and there
is no reason to believe it will be conveniently located near the
other members of the group. So therefore the shared tree is likely to
be very suboptimal, leading to the desire for per-source trees. In
this scheme, since the core will be a member of the group or a node
consciously chosen by whoever was creating the group, the shared tree
is likely to be as good a tree as any of the per-source trees,
certainly according to the metric of total cost to the network of
delivering the data.
Using a single tree per group saves the network k times as much
state, assuming on the average, k transmitters per group.
Another possible scheme is Dave Cheriton's Express model, where every
group is a combination of multicast address and source address. The
multicast address, in other words, is unique to the source. Then
address allocation becomes even simpler than in our scheme. However,
the problem with this model is that for something like a conference
call every member needs to know all the possible transmitters so they
can join a separate group for each possible transmitter. If a new
member joins the group, somehow all the other members have to be
alerted in order to join the new tree. Also, this model requires the
network to keep k times as much state, i.e., a separate tree for
every possible transmitter.
6.0 Summary
Although this proposal is similar to CBT and PIM-SM, it differs
because:
Expires February 1999 [Page 11]
Internet Draft A Design for Simple Multicast August 1998
- routers do not need to figure out which core goes with which
multicast address
- the core is a member of the group, or explicitly chosen for that
group, so the shared tree will be fairly good
- use a single shared tree per group, though extra groups can be
formed in the rare cases where a shared tree is not good enough
(i.e., don't bother with dynamically formed per-source trees)
- the shared tree can be bi-directional (and therefore more
efficient)
- no need for a protocol for core-capable routers to advertise
- no need for an algorithm that will hash a multicast address to
the choice among the set of core-capable routers
- no need to pass multicast address ranges in the interdomain
routing protocol
7. Acknowledgments
We would like to thank Brad Cain and Ross Callon for their comments.
Tony Ballardie would like to thank British Telecom Plc, and 3Com
Corporation for assisting with funding his work.
References
[STATIC_MCAST] M. Ohta, J. Crowcroft
Static Multicast, Internet-Draft, March 1998
[DNS_RP] DNS Based RP Placement scheme
Dino Farinacci's presentation in the MBONED WG,
40th IETF Meeting
[Cheriton] IDMR Mailing List discussion
[IGMP] Cain, Deering, Thyagarajan
Internet Group Management Protocol, Version 3,
Internet-Draft, November 1998
[CBT] Ballardie, Cain, Zhang, Core Based Tree Multicast
Routing, Internet-Draft, March 1998
[PIMSM] Estrin, Farinacci, Helmy, Thaler, Deering, Handley,
Jacobson, Liu, Sharma, and Wei.
Protocol independent multicast-sparse mode (PIM- SM)
Specification, RFC-2117, June 1997
[BGMP] Thaler, Estrin, Meyers
Expires February 1999 [Page 12]
Internet Draft A Design for Simple Multicast August 1998
Border Gateway Multicast Protocol Specification,
Internet-Draft, March 1998
[MASC] Estrin, Handley, Kumar, Thaler
Multicast Address Set Clain Protocol
Internet-Draft, November 1997
Authors' Addresses
Radia Perlman
Sun Microsystems Laboratories
2 Elizabeth Drive
Chelmsford, MA 01824
Radia.Perlman@sun.com
Cheng-Yin Lee
Nortel (Northern Telecom), Ltd.
PO Box 3511, Station C
Ottawa, ON K1Y 4H7, Canada
leecy@nortel.ca
Tony Ballardie
Research Consultant
aballardie@acm.org
Department of Computer Science
University College London
Gower Street
London, WC1E 6BT
UK
J.Crowcroft@cs.ucl.ac.uk
Expires February 1999 [Page 13]