Friday, 18th May 2012.

Posted on Tuesday, 26th July 2005 by sean

Route aggregation happens where multiple routes are rolled up into a series of shorter prefixes (ie 4 /24′s become a /22). The idea is to reduce the number of routes that need to be processed. Often, the specifics don’t matter. If I’m advertising the 4 /24′s as above and they all go the the same place, why does the whole Internet need to know about all four of them when a single /22 will describe the whole lot?

This also leads to some global routing stability. If one of the /24′s went down, I don’t have to tell the rest of the Internet since the /22 covers three other networks that are working. Yes, any packets destined to the downed network will return an error, but it’s a small price to pay to reduce the amount of flapping on the Internet.

While getting the size of the global route tables down is a great idea, sites that are multihomed face challenges with route aggregation, which means that BGP has to handle the exceptions.

To illustrate the last point, consider this example network. It’s similar to what we’ve been following along with except that I split AS1 into two separate ASNs for later examples.

Imagine that AS1 advertised a /24 prefix (1.1.42.0/24) to AS2 and AS3, and that AS2 aggregated this to a /16 (1.1.0.0/16) before sending it to AS3. Now, further imagine that both AS2 and AS3 were transit providers to the rest of the Internet, and advertised their prefixes out.

The rest of the Internet would see a /24 via AS3, and a /16 via AS2. Following the longest match rule, all packets to anything in 1.1.42.0/24 would go via AS3 even though R3 in AS1 indicated that both links were available.

The solution is that AS2 must also advertise the /24 in addition to the /16. It is not necessary to advertise all the component routes, just the ones that need to stand alone on the Internet. This is called “punching a hole”, and is often necessary for sites to be properly multihomed. This comes up in many cases where an AS got its address space from a provider rather than obtaining provider independant space from ARIN. In this example, it is possible that AS1 started off with address space from AS2 before becoming multihomed.

In the example above, R3 has several networks in the 10/8 space it can aggregate to R0 in AS4:

r0#show ip bgp 10.0.0.0/8 longer-prefixes
BGP table version is 7, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.1.1.0/24      10.50.0.2                              0 1 2 i
*> 10.2.2.0/24      10.50.0.2                              0 1 3 i
*> 10.3.3.0/24      10.50.0.2                0             0 1 i

Using “aggregate-address” under the BGP configuration on R3, we can force it to only advertise a summary route:

r3(config-router)#aggregate-address 10.0.0.0 255.0.0.0 summary-only 

r0#show ip bgp 10.0.0.0/8 longer-prefixes
BGP table version is 11, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.0.0.0         10.50.0.2                              0 1 i

R3 doesn’t discriminate when it sends out the summary, so R2 and R1 also see it:

r2#show ip bgp 10.0.0.0/8 longer-prefixes
BGP table version is 14, local router ID is 10.2.2.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.0.0.0         192.168.3.5                            0 1 i
*> 10.1.1.0/24      192.168.3.9              0    200      0 2 i
*> 10.2.2.0/24      0.0.0.0                  0         32768 i

Interestingly enough, r3 shows that it is suppressing the more specific routes, and has created a route to Null0 representing the aggregate (remember anything more specific that exists will be in R3′s routing table, and won’t hit the Null0 bit bucket)

r3#show ip bgp
BGP table version is 16, local router ID is 10.3.3.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.0.0.0         0.0.0.0                            32768 i
s> 10.1.1.0/24      192.168.3.2              0             0 2 i
s> 10.2.2.0/24      192.168.3.6              0             0 3 i
s                   192.168.3.2                            0 2 3 i
s> 10.3.3.0/24      0.0.0.0                  0         32768 i
r3#show ip route
...
     10.0.0.0/8 is variably subnetted, 5 subnets, 2 masks
B       10.0.0.0/8 [200/0] via 0.0.0.0, 00:44:14, Null0
B       10.2.2.0/24 [20/0] via 192.168.3.6, 02:37:15
B       10.1.1.0/24 [20/0] via 192.168.3.2, 02:42:55
C       10.3.3.0/24 is directly connected, Loopback5
C       10.50.0.0/24 is directly connected, Ethernet0

In this case it’s fine because the intended goal is met, but if anything in the 10/8 space existed elsewhere we might find some bad routing.

On R0 it knows this is an aggregate:

r0#show ip bgp 10.0.0.0
BGP routing table entry for 10.0.0.0/8, version 11
Paths: (1 available, best #1, table Default-IP-Routing-Table)
  Not advertised to any peer
  1, (aggregated by 1 10.3.3.1)
    10.50.0.2 from 10.50.0.2 (10.3.3.1)
      Origin IGP, localpref 100, valid, external, atomic-aggregate, best

The problem, though, is that the AS-PATH to the aggregate doesn’t include any of the component ASNs. Since this is used for loop avoidance, routing loops can be introduced in larger systems.

The as-set option to the aggregate-address command brings forward some of these things:

r3(config)#router bgp 1
r3(config-router)#no  aggregate-address 10.0.0.0 255.0.0.0 summary-only
r3(config-router)# aggregate-address 10.0.0.0 255.0.0.0 summary-only as-set
r0#show ip bgp 10.0.0.0 255.0.0.0 longer-prefixes
...
   Network          Next Hop            Metric LocPrf Weight Path
*> 10.0.0.0         10.50.0.2                              0 1 {2,3} i
r0#show ip bgp 10.0.0.0
BGP routing table entry for 10.0.0.0/8, version 19
Paths: (1 available, best #1, table Default-IP-Routing-Table)
Flag: 0x820
  Not advertised to any peer
  1 {2,3}, (aggregated by 1 10.3.3.1)
    10.50.0.2 from 10.50.0.2 (10.3.3.1)
      Origin IGP, localpref 100, valid, external, best

The {2,3} shows that these AS contributed to the aggregate. atomic-aggregate has also disappeared, since the information has been included.

The last thing I’ll try today is leaving out summary-only and putting in as-set:

r3#show ip bgp 10.0.0.0 255.0.0.0 longer-prefixes
BGP table version is 5, local router ID is 10.3.3.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.0.0.0         0.0.0.0                            32768 {2,3} i
*> 10.1.1.0/24      192.168.3.2              0             0 2 i
*> 10.2.2.0/24      192.168.3.6              0             0 3 i
*                   192.168.3.2                            0 2 3 i
*> 10.3.3.0/24      0.0.0.0                  0         32768 i
r0#show ip bgp 10.0.0.0 255.0.0.0 longer-prefixes
BGP table version is 30, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.0.0.0         10.50.0.2                              0 1 {2,3} i
*> 10.1.1.0/24      10.50.0.2                              0 1 2 i
*> 10.2.2.0/24      10.50.0.2                              0 1 3 i
*> 10.3.3.0/24      10.50.0.2                0             0 1 i

As expected, we see an aggregate and the specific routes.

Next up, controlling the behaviour of the specific routes to filter what goes out, and controlling the makeup of that aggregate.

Posted in Routing | Comments (0)

Posted on Friday, 22nd July 2005 by sean

BGP receives path updates from its peers, and then makes a decision as to which route is best. This is the route that is potentially transferred to the IP routing table and advertised to peers.

The synchronization rule also plays a part in this decision, and is investigated more closely.

Cisco has the BGP Best Path Selection Algorithm well explained, so what follows here is only a summary.

I look at the decision process as having three phases:

  1. Sanity checks
  2. Policy checks
  3. Tie Breakers

Sanity Checks

These checks make sure that the route is a valid route.

Firstly, an iBGP route must be synchronized if synchronization is enabled. To be synchronized, the route (or a more specific component of it) must be known through an IGP on the router.

In the example, R3 and R0 are in AS1. R0 is advertising 10.0.0.0/24 into iBGP. With synchronization on R3, the route is seen in BGP but not moved to the routing table:

r3#show ip bgp 10.0.0.0 255.255.255.0 longer-prefixes
BGP table version is 5, local router ID is 10.3.3.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
* i10.0.0.0/24      10.50.0.1                0    100      0 i

r3#show ip route 10.0.0.0 255.255.255.0
% Subnet not in table

and therefore is not sent to peers:

r1#show ip route 10.0.0.0 255.255.255.0
% Subnet not in table

We could turn off synchronization with “no synchronization”, run an IGP between R0 and R3, or use a static:

r3(config)#ip route 10.0.0.0 255.255.255.0 10.50.0.1
r1#show ip route 10.0.0.0 255.255.255.0
Routing entry for 10.0.0.0/24
  Known via "bgp 2", distance 20, metric 0
  Tag 1, type external
  Last update from 192.168.3.1 00:00:02 ago
  Routing Descriptor Blocks:
  * 192.168.3.1, from 192.168.3.1, 00:00:02 ago
      Route metric is 0, traffic share count is 1
      AS Hops 1

In the real world, synchronization is usually disabled. It’s been the default since 12.2(8)T.

Next, the NEXT_HOP must be accessible. This is just common sense, since there is no point in sending packets to a next hop you can’t reach.

Cisco’s document lists a few more sanity checks, but these two are the important ones.

Policy Checks

These are the checks that follow your routing policy. Note that the “shortest path” check is left to the end, letting you make manual interventions at many places.
Using the Border Gateway Protocol for Interdomain Routing describes this in more detail.

The following rules are used to decide between routes

  1. Highest weight. Remember weight is Cisco proprietary, and not transmitted as part of the route, so it is a decision local to the router made through a route map.
  2. Highest local preference. Same idea as weight, but this is carried within the AS
  3. Prefer local routes. To quote Cisco,

    Prefer the path that was locally originated via a network or aggregate BGP subcommand, or through redistribution from an IGP. Local paths sourced by network or redistribute commands are preferred over local aggregates sourced by the aggregate-address command

  4. Shortest AS-PATH length. Even though the number of AS hops is smaller, this doesn’t necessarily mean it’ll be faster
  5. Lowest origin type (IGP Lowest multi-exit discriminator (MED), AKA BGP metric. This is used to tell a neighbouring AS which path to take if there are multiple ones between the two ASes. Practically, providers let you set a community that alters their local preference, making the decision earlier in the process.
  6. eBGP over iBGP. AD is 20 vs 200
  7. Lowest IGP metric to NEXT_HOP

Tie Breakers

At this point, the candidate routes are similar, so tie breaker rules come into play.

  1. Oldest route. This minimizes flapping
  2. Lowest routerid of the peer advertising the route
  3. Lowest neighbour address

Examples

This decision is made based on AS-Path length:

r1#show ip bgp 10.2.2.0
BGP routing table entry for 10.2.2.0/24, version 20
Paths: (2 available, best #2, table Default-IP-Routing-Table)
  Advertised to non peer-group peers:
  192.168.3.1
  1 3
    192.168.3.1 from 192.168.3.1 (10.3.3.1)
      Origin IGP, localpref 100, valid, external
      Community: no-export
  3
    192.168.3.10 from 192.168.3.10 (10.2.2.1)
      Origin IGP, metric 0, localpref 100, valid, external, best

We can alter this with either weight or local preference. I looked at this one when going over path attributes

Posted in Routing | Comments (1)

Posted on Wednesday, 20th July 2005 by sean

Networking is all about understanding the behaviour of the underlying protocols. Many of these protocols are deterministic, meaning they will always produce a predictable result for a given set of inputs.

This is achieved by using predictable values, such as MAC and IP addresses, and reproducable values such as cost and hop count. These numbers are compared to others to produce a decision.

For the exam (and in real life, of course), you have to remember how these numbers are compared. Is it the highest or the lowest? Here’s a summary.

Spanning Tree

Lowest Bridge ID on the network is the root bridge. The Bridge ID is the concatination of a 16 bit priority followed by the MAC address. The exception is that if the priority is 0, the device doesn’t participate in the election.

Lowest cost to the root bridge is marked as the root port. Lowest port priority is used to break ties.

General

Lowest administrative distance is used to select from identical prefixes (network & netmask) coming from different routing protocols. Also see Administrative Distance

OSPF

The router id is set as the highest IP address on any loopback interface. If no loopback interfaces exist, then the highest IP address on the router is used.

When electing the DR and BDR, the highest priority wins the coveted position of DR. The highest router id is used to break ties. The runner up wins BDR. Like spanning tree, a priority of 0 removes the router from the election. (Remember there is one election per segment, so one router could be both a DR and BDR, or whatever)

Lowest cost is used to figure out the route. Cost is inversely proportional to bandwidth, so a higher bandwidth pipe will have a lower cost.

In the Exstart phase of OSPF, the highest router id is used to determine the master.

Also see OSPF Neighbor States

IGRP/EIGRP

Lowest metric to a destination becomes the successor route.

BGP

The router id is set as the highest IP address on any loopback interface. If no loopback interfaces exist, then the highest IP address on the router is used. (Same as OSPF)

The selection process for a route then is:

Highest weight
Highest local preference
Locally originated routes routes are given preference
Shortest AS-PATH length
Lowest origin type (IGP Lowest Multiple Exit Discriminator (AKA metric), only if the neighbouring AS is the same
eBGP over iBGP
Lowest IGP metric to the next hop
Oldest route (ie which did the router hear first)
Lowest router id
Lowest neighbour address

See BGP Best Path Selection Algorithm

RIP

The route with the lowest hop count is chosen as the route.

IS-IS

Highest priority, followed by highest MAC address becomes the pseudonode

Added Aug 16, 2006

Multicast

PIM elects the designated router as the one with the highest IP address on the segment
IGMP elects the designated querier as the one with the lowest IP address on the segment

HSRP/VRRP/GLBP

HSRP uses the highest priority followed by the highest IP address to choose the active router. The runner up gets the standby position, the rest are listeners
VRRP first uses the owner of the virtual IP address, followed by the highest priority.
GLBP elects the Active Virtual Gateway as the one with the highest priority followed by the highest IP address.

Posted in General | Comments (0)

Posted on Sunday, 3rd July 2005 by sean

CCNPRecertification.com, and some other sites, have run off my cable modem since inception. The server started to become bogged down after time, so I thought it time to pay for some off site hosting.

Dedicated servers were in the $100US/month range, and the $5/mo cheapie hosting deals didn’t have the features I needed. Virtual Private Servers (VPS) caught my eye. The idea here is that the provider has one physical machine, which is split up into several virtual machines (much like VMWare). You get root on your own VPS and are isolated from other people, but don’t have to pay for your own box.

Whenever you get into a shared services arrangement you want to know how many people you’re sharing with. It’s quite possible to load several hundred people on a server and get pitiful performance all around. So I searched around for VPS providers and reviews of them.

Unfortunately I couldn’t find any decent VPS reviews (a big reason I’m writing this one). However I did manage to find a handful of people offering the service. Some had prices that were too good to be true ($10/month), which led me to think that they’d be loading the heck out of the servers. A couple refused to answer questions about their service (ie how much memory is in your server, how many people are on it). My favourite was Dot5 Hosting, who responded:

virtual server is also a sort of shared server and we do not disclose the specs of virtual server. And we assure that
you get equal resource of your share of server resources

I’m sure I’ll get my share, but what is that?

Finally, I came across www.GoDaddy.com , which is also the place I’ve been registering domain names for the past several years. The specs on the servers are currently:

  • Dual Intel Xeon 2.4GHz w/HyperThreading
  • 12GB DDR RAM (shared among all virtual machines)
  • 5 x 146GB Ultra 320 SCSI Hard Drives, RAID-5 (shared)

In addition, the server is limited to 60 people.

$35US/mo gets you 10G disk and 500GB transfer. Though you can pay a year in advance and get a 20% discount, I figured I’d give it a try for a couple of months and see how it goes.

After signing up, I was able to access my system through the control panel on GoDaddy.com. Total time from me submitting the form to logging in to my server was around 4 hours. Had I waited for the emails to prompt me for the next steps, though, it would have taken around a day.

The system itself runs RedHat 9, which is somewhat dated. However, the kernel is a specially compiled one for the VPS. It seems the VPS software is Virtuozzo, which looks fairly advanced. It’s possible to seamlessly migrate instances from one server to another and reallocate resources between the servers. Pretty slick when compared to VMWare and such.

Upon logging into the system, it looks just like a regular dedicated server except for some devices:

# mount
/dev/vzfs on / type reiserfs (rw,usrquota,grpquota)
# ifconfig
venet0    Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:127.0.0.1  P-t-P:127.0.0.1  Bcast:0.0.0.0  Mask:255.255.255.0
          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
          RX packets:4393338 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4632432 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:540461382 (515.4 Mb)  TX bytes:1799679322 (1716.3 Mb)

venet0:0  Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:x.x.x.188  P-t-P:x.x.x.188  Bcast:x.x.x.188  Mask:255.255.255.255
          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
          RX packets:4393338 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4632432 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:540461382 (515.4 Mb)  TX bytes:1799679322 (1716.3 Mb)

venet0:1  Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          inet addr:x.x.x.204  P-t-P:x.x.x.204  Bcast:x.x.x.204  Mask:255.255.255.255
          UP BROADCAST POINTOPOINT RUNNING NOARP  MTU:1500  Metric:1
          RX packets:4393338 errors:0 dropped:0 overruns:0 frame:0
          TX packets:4632432 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:540461382 (515.4 Mb)  TX bytes:1799679322 (1716.3 Mb)

You might note I have two IP addresses… After signing up and logging in to your server, you can request more addresses through the control panel at GoDaddy. I’m not sure how many you can get, I got one extra and there is still a link saying “ask for more IP addresses”.

In addition, I’ve got access to some 6 gigs of memory:

# free
             total       used       free     shared    buffers     cached
Mem:       6196656    6183684      12972          0    2203932     700208
-/+ buffers/cache:    3279544    2917112
Swap:     12586916    5256936    7329980

So, after a month I’m pretty happy. Performance is good, I’ve seen no downtime, and I’m satisfied with the price. That said, there are a few things that aren’t perfect.

  • The Plesk control panel is a pain to remove. I wanted to remove it because I don’t need it, and it was causing some dependency problems. The problem was that when I tried to remove the “psa” package, it tried to delete some rows from the MySQL database and some files seemed to be missing. I ended up downloading the Plesk RPMs, extracting the .sql file, running that, and removing the packages. Perhaps it was the order that I did things, maybe not.
  • You must use their relay server for email. I didn’t immediately notice this, and had wiped out their qmail install in favour of sendmail.
  • They do relay checks on you. I was seeing email from “emailsupport AT godaddy DOT com” being bounced due to relay, so I allowed relaying for the godaddy domain. A few hours later connections to their relay server were denied for me. After stopping the relaying, my access was restored in a few hours.
  • The Red Hat 9 is a bit of a pain due to its age. However, I set up yum at Fedora Legacy, who still support it (I’m downloading the recent PHP errata as I type this. The only trick is that you have to exclude both the kernel and iptables packages with exclude=kernel*,iptables in /etc/yum.conf.
  • The kernel has statically compiled modules in, and the list of iptables modules is limited. This means that some of the advanced iptables rules I’m used to (TCP state matching) weren’t available

In summary, I’d recommend GoDaddy’s Virtual Private Servers, and will continue to subscribe to mine.

BTW, if you decide to buy anything from GoDaddy, I’d appreciate you clicking through the link below which gives me a small cut and helps support this site.


www.godaddy.com

Posted in General | Comments (5)

Posted on Friday, 10th June 2005 by sean

Continuing on from the previous article, BGP Path Attributes, I’ll cover the remaining popular path attributes.

Multi Exit Discriminator

The MED is an optional non-transitive attribute, which is the metric of BGP. Though other protocols rely almost entirely on the metric to calculate routes, the BGP metric is only used when there are two exit points from one AS into another AS. In my example

there are no places where this occurs. If AS2 and AS3 were combined into AS2, then MED could be used to influence how to get into AS1.

r3#show ip bgp
BGP table version is 5, local router ID is 10.3.3.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.1.1.0/24      192.168.3.2              0             0 2 i
*> 10.2.2.0/24      192.168.3.6              0             0 3 i
*                   192.168.3.2                            0 2 3 i
*> 10.3.3.0/24      0.0.0.0                  0         32768 i

Here, the MED (metric) is all zero. To set it on r3:

router bgp 1
 network 10.3.3.0 mask 255.255.255.0 route-map set-metric
route-map set-metric permit 10
 set metric 42

Here, the network is advertised after being passed through a route map, which sets the metric to 42. Indeed, we can see after a clear ip bgp * soft that it is the case:

r3#show ip bgp neighbors  192.168.3.2 advertised-routes
BGP table version is 6, local router ID is 10.3.3.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.2.2.0/24      192.168.3.6              0             0 3 i
*> 10.3.3.0/24      0.0.0.0                 42         32768 i

And on r1:

r1#show ip bgp
BGP table version is 40, local router ID is 10.1.1.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.3.3.0/24      192.168.3.1             42             0 1 i
...

If R1 had multiple routes to 10.3.3.0/24 through its own AS, then MED might be used to choose the route. Lowest MED wins.

Cisco has more detail on MED: How BGP Routers Use the Multi-Exit Discriminator for Best Path Selection

Just as a final point on MED: By setting MED you are attempting to influence the decision of how your neighbouring AS sends traffic to you (assuming you have two entry points), but the final decision remains with them.

Local Preference

This is a well-known discretionary attribute that is only sent within an AS — never to peers. If routers within an AS have multiple destinations to a route, local preference can be used to break the tie. Since it’s a preference, a higher preference wins.

In the example, R2 has multiple ways to get to R3:

r2>show ip bgp 10.3.3.0
BGP routing table entry for 10.3.3.0/24, version 55
Paths: (2 available, best #2, table Default-IP-Routing-Table)
  Advertised to non peer-group peers:
  192.168.3.9
  2 1
    192.168.3.9 from 192.168.3.9 (10.1.1.1)
      Origin IGP, localpref 100, valid, external
  1
    192.168.3.5 from 192.168.3.5 (10.3.3.1)
      Origin IGP, metric 0, localpref 100, valid, external, best

Without getting into how BGP makes decisions, the AS-path of “1″ via R3 is better than going “2 1″ via R1. However, local preference trumps AS-path, so if I want to use R1′s path to AS1, I’ll need to have R2 set local preference to something higher than the default of 100 for R1′s routes. (note there are many ways to do it, per route, per neighbour, etc. Since I did the metric on a per route basis, I’ll do this one on a per neighbour)

router bgp 3
 neighbor 192.168.3.9 remote-as 2
 neighbor 192.168.3.9 route-map setlocalpref in
ip as-path access-list 1 permit _2_
ip as-path access-list 2 permit .*
!
route-map setlocalpref permit 10
 match as-path 1
 set local-preference 200
!
route-map setlocalpref permit 20
 match as-path 2

This code applies a route map to routes coming in from r1. Anything passing through AS 2 (as-path acl 1) gets local pref set to 200 in clause 10. Any other routes (which there shouldn’t be, since this map is only applied to one peer) are matched (so they aren’t dropped), but no changes are made.

For more information on BGP regular expressions, see Using Regular Expressions in BGP

Now,

r2#show ip bgp 10.3.3.0
BGP routing table entry for 10.3.3.0/24, version 7
Paths: (2 available, best #2, table Default-IP-Routing-Table)
  Advertised to non peer-group peers:
  192.168.3.5
  1
    192.168.3.5 from 192.168.3.5 (10.3.3.1)
      Origin IGP, metric 42, localpref 100, valid, external
  2 1
    192.168.3.9 from 192.168.3.9 (10.1.1.1)
      Origin IGP, localpref 200, valid, external, best

So even though the AS-path is longer, the route via r1 was chosen because of a higher local preference. Had we done this on R3, R0 would also see the local preference, though it should only see one route since R3 will only pass along its best route.

Local pref and MED do much the same thing (LP on a more global scale since it works across different ingress ASes). Often providers will allow you to tag your routes with a particular community (see below) that causes them to set local preference. As such, you’ll rarely use MED with an ISP.

Weight

This one really isn’t an attribute per se, since it is Cisco proprietary and never sent outside the local router. Other than this, it is the same as local preference. Highest weight wins. Weight is checked before local preference, so this attribute lets you override local pref on a per router basis.

By default, routes learned from another router have a weight of 0, and a route originated locally has a weight of 32768.

Atomic Aggregate and Aggregator

I’m going to have to come back to this another time after talking about route aggregation. Like other protocols, BGP lets an administrator aggregate routes. However, depending on how routes are aggregated, both the aggregate and some specific routes can be sent out at the same time! These attributes add information about how the route was aggregated.

If you can’t wait, check out Understanding Route Aggregation in BGP.

Communities

Communities are an addition to the original BGPv4 that let an administrator tag routes with extra information that can be used downstream. RFC 1997 defines a community as:

A community is a group of destinations which share some common property.

For example, a provider might tag all its customer routes (ie direct peers) to let the other customers filter (ie to only accept customer routes)

Similarily, a customer may tag routes according to the provider’s policy to set local preference on a route, or to mark it as non exportable.

There are four predefined communities: NO_EXPORT, NO_ADVERTISE, and NO_EXPORT_SUBCONFED, and INTERNET. The first three tell the peer not to send a route outside the confederation, local router, and peer’s AS respectively. INTERNET routes are the default, meaning they can be exported anywhere.

A real world example of how communities are used can come from the Allstream RADB entry:

$ whois AS15290@whois.radb.net
...
remarks:    ***********************************************************
            CUSTOMER COMMUNITIES:
            Allstream Corp customers may choose to affect our
            local preference from their routers by setting their
            routes with the following BGP communities:
            Community   Definition               Description
            ---------   ----------               -----------
            (default)   Local preference = 100   customer
            15290:100   Local preference = 100   customer primary
            15290:90    Local preference = 90    customer back-up
            15290:30    Local preference = 30    customer fall-back
            If a multi-homed customer sets the community to
            15290:30,   Allstream Corp will prefer an announcement
            from a peer over directly connected customer
            ---------
            Customers may also choose to control their advertisements
            by setting thier routes with following BGP communities:
            15290:333 Don't send to Transit (default)
            15290:333 15290:100  Don't send to Transit LP=100
            15290:333 15290:90   Don't send to Transit LP=90
            15290:444 Send only to BGP downstream (default)
            15290:444 15290:100  Send only to BGP downstream LP=100
            15290:444 15290:90   Send only to BGP downstream LP=90
            -----------------------------------------------------------
            PUBLIC/PRIVATE PEERS:
            None (Default) Local pref=80
            15290:80 Set local pref=80 on AS15290
            15290:75 Set local pref=75 on AS15290
            -----------------------------------------------------------
            FILTERING POLICIES:
            - We do NOT acknowledge MEDs from any:
            - Public Peer
            - Private Peer
            - Downstream Customer
            - Allstream Corp filters RFC1918 and other such
            advertisements with prefixes longer than /24
            - Peering requests/infos should be sent to:
            peering@allstream.com
            - Public Route-Server is:
            route-server.east.allstream.com

Just as a quick example, R3 will tag its local route with NO_EXPORT, and we’ll see that R1 and R2 won’t learn about the route through the other peer.

Before:

r2#show ip bgp
...
   Network          Next Hop            Metric LocPrf Weight Path
*  10.3.3.0/24      192.168.3.5             42             0 1 i
*>                  192.168.3.9                   200      0 2 1 i

R3 config:

router bgp 1
 no synchronization
 bgp log-neighbor-changes
 network 10.3.3.0 mask 255.255.255.0
 neighbor 10.50.0.1 remote-as 1
 neighbor 10.50.0.1 next-hop-self
 neighbor 192.168.3.2 remote-as 2
 neighbor 192.168.3.2 send-community
 neighbor 192.168.3.2 route-map noexport out
 neighbor 192.168.3.6 remote-as 3
 neighbor 192.168.3.6 send-community
 neighbor 192.168.3.6 route-map noexport out
 no auto-summary
!
route-map noexport permit 10
 set community no-export

Back on R2:

r2#show ip bgp 10.3.3.0
BGP routing table entry for 10.3.3.0/24, version 14
Paths: (1 available, best #1, table Default-IP-Routing-Table, not advertised to EBGP peer)
Flag: 0x208
  Not advertised to any peer
  1
    192.168.3.5 from 192.168.3.5 (10.3.3.1)
      Origin IGP, metric 0, localpref 100, valid, external, best
      Community: no-export

(r1 has the same thing except the peer’s IP address is different).

So, we can see that r1 and r2 didn’t tell each other about the routes they know from AS1 because R3 tagged the routes with no-export.

Posted in Routing | Comments (0)

Posted on Wednesday, 8th June 2005 by sean

BGP Path attributes describe a prefix being advertised. A previous article mentioned AS-Path, which is an ordered list of the autonomous systems to the destination network. There are many more attributes that describe a route, letting downstream people make informed decisions about how to route a packet.

Path attributes are sent in an UPDATE message, and fall into one of four classes:

  • Well-known mandatory
  • Well-known discretionary
  • Optional transitive
  • Optional non-transitive

A further flag, “partial”, is used to denote an attribute that has been attached by someone downstream (ie not the originator).

Attributes that are well-known must be understood by all BGP-4 speakers, and passed along to other peers (though according to the RFC, it is permitted to modify the contents). A mandatory attribute must be present in an update, a discretionary doesn’t have to.

Optional attributes, by contrast, don’t have to be supported. These types of attributes can be either transitive or non-transitive. If a BGP speaker does not understand an optional attribute, it is to be passed along to peers if it is transitive, or removed in the case of non-transitive attributes. In this case it would set the partial bit if it weren’t already set.

So, let’s go through some of the major path attributes, using the same example network from before:

Origin

Well known, mandatory: The origin of the route tells us if the route is internal to the originating AS. The legal values are:

  • i – Internal/IGP
  • e – External/EGP
  • ? – incomplete

Note that this attribute refers to the originating router which is the router that advertised the route into BGP, and not the router receiving the route. That is, most routes will be internal since they will have been advertised by the normal means of a network statement:

r2:
router bgp 3
 network 10.2.2.0 mask 255.255.255.0

On r0, we see this network as:

r0#show ip bgp 10.2.2.0
BGP routing table entry for 10.2.2.0/24, version 3
Paths: (1 available, best #1, table Default-IP-Routing-Table)
  Not advertised to any peer
  3
    192.168.3.6 from 10.50.0.2 (10.3.3.1)
      Origin IGP, metric 0, localpref 100, valid, internal, best

If this is changed to

r2:
router bgp 3
 redistribute connected

we see

r0#show ip bgp 10.2.2.0
BGP routing table entry for 10.2.2.0/24, version 12
Paths: (1 available, best #1, table Default-IP-Routing-Table)
  Not advertised to any peer
  3
    192.168.3.6 from 10.50.0.2 (10.3.3.1)
      Origin incomplete, metric 0, localpref 100, valid, internal, best

Since the route has been redistributed from another routing protocol (ie connected) before getting into BGP, the origin is considered incomplete.

Origin is used in the BGP decision process, which we’ll look at later.

AS-Path

Well known, mandatory: This is the advertised path for a route.

r3#show ip bgp
BGP table version is 5, local router ID is 10.3.3.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.1.1.0/24      192.168.3.2              0             0 2 i
*> 10.2.2.0/24      192.168.3.6              0             0 3 i
*                   192.168.3.2                            0 2 3 i
*> 10.3.3.0/24      0.0.0.0                  0         32768 i

The final column, Path, shows the AS-Path for the route.

Each EBGP peer prepends their own AS to each route before sending it out. This has two implications:

  1. The AS-Path, when read left to right is the path the packet will take from the sender to the receiver, with the destination AS being the rightmost value
  2. Since the AS is prepended at the EBGP peer, the AS-path of a route within an AS won’t contain the AS itself.

This last one is more interesting. For example, look at the AS-path for routes from the perspective of r0:

r0>show ip b
   Network          Next Hop            Metric LocPrf Weight Path
*>i10.1.1.0/24      192.168.3.2              0    100      0 2 i
*>i10.2.2.0/24      192.168.3.6              0    100      0 3 i
*>i10.3.3.0/24      10.50.0.2                0    100      0 i

The path from AS1 to 10.1.1.0 is via AS2 (r1). However the path to 10.3.3.0 has no AS-path, because it’s local to the AS. Indeed, looking at the detailed information on the route shows this:

r0>show ip bgp 10.3.3.0
BGP routing table entry for 10.3.3.0/24, version 5
Paths: (1 available, best #1, table Default-IP-Routing-Table)
  Not advertised to any peer
  Local
    10.50.0.2 from 10.50.0.2 (10.3.3.1)
      Origin IGP, metric 0, localpref 100, valid, internal, best

Next Hop

This well known mandatory attribute seems simple, but there are a few rules governing its use. For a route within the AS (iBGP route), it is the internal router originating the route. For eBGP routes, it is the address of the external peer.

Back to our example, anything coming from AS 2 will have R1′s’s address as the next hop. From r0 (AS1)

r0>show ip bgp regexp _2$
BGP table version is 12, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i10.1.1.0/24      192.168.3.2              0    100      0 2 i

This is on r0! Therefore, all internal routers must have a route entry to get to all other peers, or the next hop will be unreachable, and the route won’t get placed into the ip routing table. Indeed, r0 has such a route:

r0>show ip route 192.168.3.2
Routing entry for 192.168.3.0/24
  Known via "static", distance 1, metric 0
  Routing Descriptor Blocks:
  * 10.50.0.2
      Route metric is 0, traffic share count is 1

Take that away, and no external routes will be shown, unless the next hop is changed to something reachable.

If we change the r0-r3 relationship somewhat on r3:

router bgp 1
  neighbor 10.50.0.1 next-hop-self

this tells r3 to change the next hop attribute to be itself, rather than the EBGP peer:

r0>show ip bgp regexp _2$
BGP table version is 21, local router ID is 10.0.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
              r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*>i10.1.1.0/24      10.50.0.2                0    100      0 2 i

To be continued in the next article…

Posted in Routing | Comments (1)

Posted on Tuesday, 31st May 2005 by sean

Before I can get to BGP Path Attributes, I realized I have to cover the mecanics of BGP. The BGP 4 RFC covers it in detail, Riverstone has good coverage, but I’ll summarize it below.

BGP State Machine

BGP goes through six states to establish an adjacency.

Idle – incoming connections are refused, and the system gets ready to start speaking BGP. After this is done (by way of a Start event), move to Connect.
Connect – a connection is made to the peer. Send a BGP OPEN message, and go to OpenSent.
Active – a connection comes in from a peear. Send a BGP OPEN message, and go to OpenSent.
OpenSent – Wait for an OPEN message from the peer. When received, send a KEEPALIVE and go to OpenConfirm.
OpenConfirm – Wait for the KEEPALIVE from the peer, then move to Established.
Established – Bidirectional communication is established. Start sending UPDATE and KEEPALIVE messages as required

BGP Message Types

There are four types of packets that are part of the base BGPv4 protocol:

OPEN – As above, used to negotiate the parameters of a session (ie timers) during initialization
UPDATE – Used to advertise new routes, or withdraw old routes. The packet looks like:

      +-----------------------------------------------------+
      |   Withdrawn Routes Length (2 octets)                |
      +-----------------------------------------------------+
      |  Withdrawn Routes (variable)                        |
      +-----------------------------------------------------+
      |   Total Path Attribute Length (2 octets)            |
      +-----------------------------------------------------+
      |    Path Attributes (variable)                       |
      +-----------------------------------------------------+
      |   Network Layer Reachability Information (variable) |
      +-----------------------------------------------------+

Since the list of added and removed routes is variable, it is possible (and encouraged) to have multiple routes in one packet if they share common path attributes. For example, if two routes had an AS-PATH of “1 2″, they could both go in the NLRI field. If a third route had an AS-PATH of “3 1 2″, then it would need a separate packet.

KEEPALIVE – Sent every 1/3 of the HoldTime to indicate the peer is still alive

NOTIFICATION – Indicate an error condition, after which both sides close the connection and move back to the Idle state.

BGP Timers

Several timers are in use to control the intervals of BGP events.

HoldTime – If a peer has not received an UPDATE or KEEPALIVE message by this time, it is considered dead. By default this is 180 seconds on Cisco, though is determined in the OPEN packet (smallest of the two is used)
KeepAlive – Set to 1/3 of HoldTime, a peer should send a KEEEPALIVE message if it hasn’t sent an UPDATE in that interval
ConnectRetry – 120 seconds by default, it controls how often an idle router will try to connect to its peer.
MinASOriginationInterval – Used to space out UPDATE messages dealing with routes coming from the speaker’s AS
MinRouteAdvertisementInterval – Any route will not be advertised twice within this interval. The RFC states that it will be difficult to keep a timer for every route, so it proposes some alternatives.

Posted in Routing | Comments (0)

Posted on Monday, 30th May 2005 by sean

Every BGP autonomous system (AS) is identified by a 16 bit unsigned integer. An AS originates route advertisements consisting of a prefix and various attributes. This article will look at the AS numbers, which will then lead in to path attributes.

Like IP address space, the issuing of AS numbers is controlled by a registry, in this case, the American Registry for Internet Numbers (ARIN). They have a series of policies governing the issuing of the numbers. There are also private AS numbers, from 64512 (0xFC00) to 65535 (0xFFFF). Your provider will have to strip these off, since they’re not allowed on the Internet.

In general, to get an AS number, you have to be multihomed. If not, you have to prove your routing policy is different then that of your peer.

Once you’ve filled out the application form and paid your fee, you get your number.

That’s the easy part. Now you have to use your AS number to advertise routes to your peers. These routes will be distributed over the Internet, and based on the path attributes, people will choose their preferred path. Take for example the following network:

asdiagram.jpg

AS1 advertises a network to both AS2 and AS3. AS2 and AS3 exchange routes, so they know they can go direct to AS1, or via the other guy. AS2 and AS3 announce to AS4 they know how to get to AS1 by their preferred route.

BGP is policy based. This means that in the situation above, AS2 can choose to send all traffic to AS1 via AS3, rather than direct. However, AS1 makes its own decision on how to return. AS2 can try to influence AS1, but has no absolute control.

Likewise, if both AS2 and AS3 advertise to AS4 that they know how to get to AS1, AS4 has the choice of which one it uses. There are not guarantees, however, that the path doesn’t go AS4->AS3->AS2->AS1.

This list of AS numbers that describe a path to the destination AS is known as the AS-Path attribute. As each AS advertises a prefix to its neighbour, it prepends its AS number to the list. For example, if AS1 were advertising 10/8, it would send the AS-Path of “1″ to AS 2 and 3. AS 2 would then send “2 1″ to AS3. AS3 would send “3 2 1″ to both 1 and 4. AS1 would see the 1 in the AS-Path and know that path would cause a routing loop, so it would discard the route. AS 4, however, would see “3 2 1″ from AS 3 and “2 1″ from AS 4, and likely choose the latter (more on this when we go through the BGP decision algorithm).

Next up, BGP Path attributes. I had intended to do that in this article but it was taking forever, so I’ll close this one up and start fresh.

Posted in Routing | Comments (0)

Posted on Saturday, 16th April 2005 by sean

BGP, the Border Gateway Protocol, is an exterior routing protocol. This means that it is used between two entities, rather than OSPF and EIGRP, which are used within an entity. In routing language, an entity is an Autonomous System, or an AS.

The BCSI exam has several items relating to BGP.

  1. Describe the features and operation of BGP
  2. Explain how BGP policy-based routing functions within an autonomous system
  3. Explain the use of redistribution between BGP and Interior Gateway Protocols (IGPs)
  4. Given a set of network requirements, identify the steps to configure a BGP environment and verify proper operation (within described guidelines) of your routers
  5. Describe the scalability problems associated with internal BGP

In this article I’ll work on #1, and maybe touch on #2. #3 will require several articles, and #4 will require a few more as we talk about route reflectors and confederations.

The lab I’m using at the moment is:

Interior protocols like OSPF, RIP, and EIGRP are great for a company, but they don’t scale to the size of the Internet. Furthermore, their path selection is based purely on finding the quickest path to a destination, where in the Internet, other factors come into play (ignoring the incredible difficulty of finding the “fastest” path). For example, you may have two connections to different ISPs, the first is a flat rate, the second is a lower capacity, pay per megabyte service. In this case, you’ll likely want to use the latter only if the former goes down.

BGP lets you get away from having to know the details of the intermediate ASes. If the path from you to your customer is via Sprint, you really don’t care how Sprint handles the packet, as long as they send it closer to your destination. As such, BGP hides the internal workings of an AS.

To further classify BGP, it is an advanced distance vector protocol like EIGRP in that it starts off like regular distance vector protocols, but brings in features from link state protocols. It chooses routes based on the attributes of the routes, including the path to the destination. In fact, there are 13 different ways a route can be chosen:
BGP Best Path Selection Algorithm

(more on this in another article)

One of the things that is important to understand is when to use BGP vs using other means. If you only have one connection to one ISP, then a static route is a much better way. Similarily, if you have two connections to the same ISP and only want a failover link, static routes or a simple routing protocol is probably better. BGP is useful when

  • You have multiple connections to different ISPs, either as a transit (passing traffic between ISPs) or as a non transit (basic multihomed site, all traffic either originates or terminates in your AS)
  • You need control over how traffic enters or leaves your AS

In addition, you’ll need a bigger router, and the smarts to run BGP. Sadly, some ISPs don’t filter their customer’s routes, so it’s possible that a misconfiguration on your router could have you routing someone else’s traffic or worse!

While studying BGP, it helps to be able to run some of the commands. If you don’t have a lab, or a BGP setup to play with yourself, you can do a lot of the show commands on a route server, such as route-server.ip.att.net (just telnet to it).

Checking it now, I see there are about 157,000 routes on the Internet. The CIDR report is a great source of statisticcs on the historical growth of the global tables. For example, here’s a look at the number of Internet routes since mid 1998 to the present:

Some quick facts about BGP, before I go:

If two peers are in the same AS, they run internal BGP (iBGP). If they are in a different AS, they are speaking external BGP (eBGP). There are different rules for each of them.
BGP runs on TCP port 179, meaning you have to be able to reach your neighbour before you can play. This usually isn’t a problem, since peers are usually directly connected in eBGP.
Since TCP is used the BGP connection is reliable. At the beginning, a full table is exchanged, and then only updates. There are also keepalives that go over the channel. If the channel goes down, the peers are considered to be down.

Next up, a closer look at ASes and the attributes making up BGP advertisements.

Posted in Routing | Comments (1)

Posted on Friday, 1st April 2005 by sean

I’ve done a bit of work on NAT lately:

I’m just going to finish up with the use of a route-map in NAT instead of the traditional access-list.

As an administrative note, while adding new contact information on the lower right (Skype and AIM), I noticed that my email address had the wrong domain. Silly me. If you emailed me before and I didn’t get back to you, my sincere apologies.

Again, using the following network for testing:

The first situation is using route-maps to allow multiple access-lists to be used.

access-list 101 permit tcp any any
access-list 102 permit icmp any any
!
route-map natmebaby permit 10
 match ip address 101
!
route-map natmebaby permit 20
 match ip address 102

This route-map permits tcp and icmp, but in separate clauses. Now, to create a pool and do the NAT:

ip nat pool mypool 192.168.1.64 192.168.1.69 prefix-length 24
ip nat inside source route-map natmebaby pool mypool

This creates a pool from 192.168.1.64 to .69, and NATs anything matching the natmebaby route-map (TCP and ICMP traffic).

From r1, I initiated a telnet connection to Bob on port 22, and also pinged it. On R0 I then saw:

r1#
r0#show ip nat trans
Pro Inside global      Inside local       Outside local      Outside global
tcp 192.168.1.64:11000 192.168.15.2:11000 192.168.1.10:22    192.168.1.10:22
icmp 192.168.1.64:7720 192.168.15.2:7720  192.168.1.10:7720  192.168.1.10:7720
icmp 192.168.1.64:7721 192.168.15.2:7721  192.168.1.10:7721  192.168.1.10:7721
icmp 192.168.1.64:7722 192.168.15.2:7722  192.168.1.10:7722  192.168.1.10:7722
icmp 192.168.1.64:7723 192.168.15.2:7723  192.168.1.10:7723  192.168.1.10:7723
icmp 192.168.1.64:7724 192.168.15.2:7724  192.168.1.10:7724  192.168.1.10:7724

Trying it the traditional way:

access-list 103 permit ip any any
no ip nat inside source route-map natmebaby pool mypool
ip nat inside source list 103 pool mypool
... telnet and ping from r1 ...
r0#show ip nat translations
Pro Inside global      Inside local       Outside local      Outside global
--- 192.168.1.65       192.168.15.2       ---                ---

Here, the traditional method only creates one translation entry for all connections, while using a route-map gives more detail.

Since the route-map method generates multiple translation entries, it can be used when there are multiple outside interfaces. NAT Support for Multiple Pools Using Route Maps describes the problem well. If you have two NAT pools and are using multiple ip nat inside source list commands, the first translation will stick, and if another flow uses the other interface, the wrong address will be used.

route-maps ensure that the decision of the source address to translate to is done on a per flow basis rather than once.

Another interesting use is route -maps and static NAT, where the set ip next-hop action is used in the route-map. For now, I won’t lab it up, but it’s an interesting read.

Posted in Routing | Comments (1)

Citations Keywords About