Skip to content

Routing Transcript

Chaired By:
Ignas Bagdonas, Sebastian Becker, Ben Cartwright-Cox
Session:
Routing
Date:
Time:
(UTC +0100)
Room:
Main Room
Meetecho chat:
View Chat

RIPE 92,

Main room

20 May 2026

11 a.m.

BEN CARTWRIGHT‑COX: Hello everybody. We'll be getting started in just a moment. Welcome to the Routing Working Group. We have ‑‑ first of all, meet your chairs, I go mass couldn't be here today but we have me and Sebastian. We have three talks today on the agenda. Some formatting. We have three talks today, if. We have two talks, we're RIPE NCC around the RPKI and we have another talk from Job Snijders about RPKI. So, welcome to the RPKI Working Group.

Please rate your talks. Your talks, please rate our/the talks as we go along because it's important to get feedback, it helps us build a better programme for the next time. You will need to log into RIPE which means logging in with 2FA. Do that now, and when the talks finish, submit feedback. We love to see the feedback. The presenter also loves to see the feedback.

I think that's it.

SEBASTIAN BECKER: Maybe one more little thing. On the next meeting, there will be ‑‑ we are starting the selection or re‑election procedure for the Working Group Chairs for the Routing Working Group. So, anyone that wants to stand up and be a Routing Working Group co‑Chair, please come to us.

BEN CARTWRIGHT‑COX: Cox what we'd recommend to do in this case is you should join the mailing list, because that will be where the announcements will be made. Other than that, I think we're all good. The first talk been non‑functional delegated RPKI CAs from Tim and Bart.

BARK BAKKER: Good morning, my name is Bart I work at the RIPE NCC in the RPKI team. And today, I am here to do Tim's talk apparently. Either we need to swap slides or Tim needs to get up here. I guess it's you ‑‑ no, I guess it's me actually.

I am here to tubing about this. The other one is in about 20 minutes I suppose.

RPKI delegated CAs and they are policy accepted 847. I hope the clicker will do its thing.

STENOGRAPHER: Don't you just love technology!

.

BART BAKKER: I am here to talk about the implementation of this policy, policy text is on the slide, it's about revocation of non‑functional CAs, and the goal is to clean up the RPKI ecosystem. We see that roughly about one third of our delegated CAs are persistently non‑functional and this slows down the validators must fetch on every validation round they must fetch from all the CAs, many of the repositories are unavailable, many of them time‑out and this ads a lot of seconds to the validation process and the synchronisation process. What we will do is we will start after this policy comes into effect, after 90 days, we will start revoking and the persistently non‑functional RPKI delegated CAs.

First, let's talk a bit about what a delegated CA is.

In short, it's a CA that's an RIR delegated resource as opposed to to hosted RPKI. This delegated certificate authority has its open keys, they hold their own private key, they host their own private key, they host their own repositories, they host their own publication server. But they are on their own CAs.

This is how that works. This is how the setup works. You create a click ‑‑ you click to create a delegated CA obviously you read the terms and conditions carefully and you accept them. Then you do the key exchange. The key exchange according to RFC 8183 is the exchange of public key. So you essentially upload your public key of your CA, your Krill, your gel CA, then you download the key of the server you upload that the local CA that, the entire setup process. These keys are used in both the synronisation with the parent for provisioning and the key issue to sign and public RPKI objects to your own repository.

After setup your CA will start at regulator intervals to the parent, which is the RIR, essentially to synchronise the resources with the parent. If there are any resource updates, the certificates get updated and the CA can decide what to do if there is any object they can deal with that case.

Regularly, they publish objects to their publication server following RPKI 8181. This can be their own publication server that they then need to host this one must be available to the Internet at all times. So it must be up, well at least most of the time, so that all the value tears out there can constantly fetch the data from the repository and can validate to their validation passes. If you are host your own repository, make sure to follow the publication server that's current practices document from the IETF.

An alternative to this is you can use publication as a service. Then the RIPE NCC RPKI team will do most of the heavy lifting for you. We will make sure that the publication server is available all of the time, is available to validators out there, and you only have to run your local Krill. That doesn't need to be so high available.

Both these protocols can break at points and they require monitoring. If one of them breaks, a CA can become non‑functional. If a repository breaks for a long time, RFC can no longer synchronise objects, if provisions breaks, at some CAs may end up over claiming and it becomes a mess and that is what we refer to as non‑functional CAs.

Now that we know that all these things can fail, let's quickly talk about an incident that happened recently.

May 4, the identity certificate of the RIPE NCC parent CA expired. This was a certificate that was valid for ten years, we put in in production in May 4, 2016, you can do the math, automated renewal was an oversight. So, we had to implement that quickly.

Because we stopped doing any transfers involving delegated CAs, what broke really was the provisioning protocol, there was nothing to synchronise because there were no resource update so the only impact was on the availability of the provisioning protocol. All the CA certificates remain public and remain published, all the RPKI objects of the delegated CAs remained valid to there was no incident affecting any of the safety RPKI.

So then, on to our implementation of this CA monitoring.

We start to monitor delegated CAs for all or tough delegated CAs we start trafficking the provisions status, which is the synchronisation status, not just whether it was successful, but also when we last saw them and if unsuccessful, we aim to have a reason why it was unsuccessful. There is many statuses in which they can fail, for example timing issues, and we do store that in order to help operators debug this later.

We also monitor whether there is a valid manifest in the RPKI. We produce metrics from that, essentially the metric becomes whether this particular CA is functional or not. We store them in a time series database. So that over time, we can track what happens with the CAs. This allows us to over time not just see, have a bullion at 90 day interval and see this is non‑functional for 90 days. It helps us to see how CAs operate a bit better.

And this is how that works. RPKI core so far is a trust anchor so far, we load the active delegated CAs from that. We fetch the validated RPKI cache in CCR format from RPKI client. We produce metrics from that and we store that in a time serious database.

I mentioned CCR. Canonical Cache Representation, that's what it stands for, it's a new format to represent the validated RPKI state t contains man tests, VRPs, ASPAs, are trust anchors and router keys or any of these really at least one set we use only the manifest from the CCR file. And it's a new format, it's a type link fill. It's encoded in DER. Must like the objection in the RPKI 6789 it's a binary format, it's compact and it compresses very well, as Mr. Snijders presented earlier this meeting in the MAT Working Group, if you haven't seen it it might be nice to watch that.

It's great for archiving. Currently there are two implementations, one in the RPKI client by OpenBSD and one in the RIPE NCC RPKI valve a library. Interoperability has been tested and the RFC or the draft 05 is up for Working Group last wall in the IETF ops Working Group.

So why CCR? Rather than building our own validate err specifically for these delegated CA manifests, we like to leverage the validated RPKI state from existing validators. CCR will be standardised format. We hope that more validators will implement this so we can compare between validators, but we save ourselves from having to implement our own validation passs with all the consequences of that.

So we do match delegated CA certificates with manifest instances in the CCR archive. And all this information from CCR helps us to provide more information to operators, as I will show you in a little bit.

The other end of this is revocation. So after we know which CAs are persistently non‑functional we need to revoke them. Although we love automating things, we decided to do manual revocation. At least to start with. First and foremost, because we don't expect a huge burden on the team, we have a workload of about 125 CAs that are currently persistently non‑functional. After that initial batch of about 125 CAs, we don't expect this to happen, to be a burden on the team on a daily basis. If it does, we'll reconsider automation. Not a reason not to automate this from the start is that we have our query, Freddie, we think we know how to select the CAs. But we really want to make sure before we start automatically revoking CAs and then figure out that our query was somewhat wrong or we had a bug in some area, because revoking a ‑‑ the non‑‑‑ well the still functional CAs is really not what we want and what we want to prevent.

We do have automation in the area where we get automated alerts as a team, not just when a CA becomes persistently non‑functional for 90 days, we also can get alerts and see from history when CAs become non‑functional, when they become functional again so that we can reach out to operators and work with them in order to keep their CAs functional and operational.

This is some of the future work we do. We did implement up to here future work includes reaching out to operators and reaching out more to operators. Via e‑mail, show information in the RPKI dashboard, we like to alert early and present this CA from being non‑functional for 90 days if we can prevent it. Our goal is to, well have all CAs operational in our ecosystem.

You see here that on the slide in the metrics, that this is why we use the time serious database. We cannot just have this bullion at 90 days, we can see CAs, the state of CAs, see them encrypting. Some CA becomes unavailable or non‑functional at some point and we see that a day later they become functional again. This is an availability issue they CA. That's fine. If this happens a lot we may reach out to you, if this is your publication server that is failing a lot we may recommend to you start considering to use publication as a service, for example, to have a higher available publication server.

So we aim to work well, the delegated CA operators in order to make the ecosystem better.

We will also show some information in the RPKI dashboard, including some basic CA information, the last provisioning messages, when it was, the status of it, if it was unsuccessful, the reason why it was unsuccessful. We will also show some details about the current manifest, as in the this update, the next update, the manifest number and we will provide some more contextual support, links to coming for example, if your provisioning breaks for operators, some link to documentation what you can consider doing about this.

So, if you are a CA operator and you have ideas of what you need in order to make your CA more functional, feel free to reach out and let's have a chat in the hallway at some point.

At that point, thank you for listening.

.

(Applause) I think we have time for questions.

SEBASTIAN BECKER: Anyone on the mic?

JOB SNIJDERS: Are the rumours true that if your CA is revoked through this policy mechanism that you have to buy cake for others?

BART BAKKER: I haven't read it in the policy, but rumour is true, yeah.

SEBASTIAN BECKER: Nothing in the queue.

TIM BRUIJNZEELS: Just an additional comment. So, the policy required the change in the terms and conditions, they have been published and should and come into effect on the 8th June. So, that's also the moment that we can start could you think the 90 days. Just for your info. There will be a bit of a delay still before we can actively revoke this, this is the non‑functional delegated CAs, but because we flow we need to have the terms and conditions in place before we do that.

BART BAKKER: After June 8th, there will be 90 days and we can start revoke it.
AUDIENCE SPEAKER: Thomas Strikes, data nerd. I presume Job and others are monitoring the latencies and overall availability for validation so that we can track some metrics once this hits and then I guess Job will have a talk to do at Sofia.

JOB SNIJDERS: I was actually hoping Bart could do the talk. Because we have some intuition about if a CA breaks, it usually break it usually breaks forever, that sort of seems the pattern, but I have not done the actual data analysis to understand what is the break pattern and do people recreate CAs after they are revoked because they serve as a reminder.

AUDIENCE SPEAKER: Maybe you both do the talk.

BART BAKKER: We'll figure something out.

(Applause)

SEBASTIAN BECKER: Then Tim for six months of ASPA.

TIM BRUIJNZEELS: Hi everyone. I am Tim, I work for the RIPE NCC.

I wanted to talk to you about ASPA again for those who weren't here the last time and the previous time before that.

And maybe new for some others, hopefully.

So, six months of ASPA, the disclaimer there is in the RIPE NCC RPKI dashboard because ASPA has been around for a draft for longer than that. Of course.

First, what is ASPA? Because not everybody ‑‑ I cannot discern that everybody knows what it is yet and then, you know, it's going to be a rather abstract talk if I won't go into this.

Bear with me. I try to touch on some of the corner case that is people have been asking about as well so maybe some of the questions that came up earlier might be answered by some of these slides.

So, taking it from the top. ASPA are AS provider authorisation objects. They are a lot like ROAs except in this case they are not signed by prefix holders but they are signed by the holder of an AS number who then can declare as a customer AS who their provider ASs are in BGP. Provider AS here does not necessarily mean a business context provider but provider in the context of the BGP path.

Now, ASPA can be used to look at paths, to verify paths. And if we look ‑‑ there is two algorithms. So the two versions of this, the simplest one is, what if you look on a session, on a customer session, you get updates and paths and how do you validate those? Well in that case, you expect these paths to be valid from customers to providers. In this case, the path is analysed and each hop is looked at and then each hop from one to the next can either be provider, in which case there was an AS ‑‑ there was an ASPA object issued by the customer AS saying the next one is a provider. It could be not provider, that means they made an object but it did not include this next AS as their provider or there could be no attestation, which means you didn't train an object.

The path would be considered invalid if there is any occurrence of not provider. So attestation is fine, and that's crucial, because well you need ‑‑ it needs to support incremental deployment of course, also things that break become invalid when there would be an issue with the RPKI itself or with your validation chain etc. That could have a very big impact.

So, like with ROAs, the idea is that it should fail open at least as much as possible in the sense that if you don't validate, then you don't, but if you know this is actually wrong, then you use that information to say, to reject it.

So that's at simple case and graphically that look like this. As was pointed out to me, I should apologise here to the rightful holders of AS1, 2, 3, etc. No harm intended but the documentation numbers are quite long, and ASA, B, C, didn't work for me. So sorry if this is you.

In this example, we see an announcement coming up from AS4 up to AS1. We see this entire path, but AS4 here has said AS5 is my only provider, in this case typically they might have more but for the example they have one and that's not AS3, so this is wrong.

Important to realise though, don't assume malice. This might be just a mistake. It could be a leak, right? It could be that AS3 and AS4 are peers and it wasn't supposed to go to AS2 but it did. In any case it doesn't matter in terms of verification of the path because it will react as something that is wrong.

Other routes, it's a bit more complicated you need to wrap your head around a little dance where something becoming invalid or not provider is actually a feature, another bug. But I'll get to that. Conceptually, it's valley free routing, that may be familiar to quite a lot here. But in any case, so, what is the idea here?

.

You would have an upper ramp of an announcement coming from an origin from a customer to their provider to their providers and so on, that can be a short chain or it can be a very long chain. And then at some point it reaches an apex and then either through a shared common provider or a peering relationship, it would then go downhill again, let's say, on the other side, where you have the inverse so it then goes from provider to their customers.

And valleys would be where you have something in the middle where you have maybe three hops in the middle or you have something goes down again and then up again, and those are considered harmful because they, there is some negative effects, latency, congestion, cost might be an issue. Data security because packets go where they were not supposed to go. So, we'd like to avoid that.

How does ASPA find valleys?

So, to find invalids, which is the most important objective here, it tries to get the longest possible up ramps and down ramps and see if they match up essentially. This is where you start looking from one end of the path, the origin, to look at the up ramp and you just keep going and go okay, is there a not provider? If there is no not provider it's okay you keep going. If nobody created an ASPA in the whole chain you make it all the way to the end, which is fine. But if people did issue an object and you find a moment where the next hop is no longer a provider, then that's where that chain terminates. Then you do the same from the other end. And the expectation is that the provider to customers would be inversing that would be customer to provider from the other end, right. So you do this from the other end as well and then you match them up, and for this to be considered not invalid, confusingly, because valid is not a case but I'll get to that later on, it would be considered invalid if there is a gap in the middle.

So, a shared overlapping paths from both ends ‑‑ overlapping ramps from both ends would be fine meeting at AS adjacency is fine because I say assumed to be peering, meeting in a common AS is also fine, but anything longer in between is considered a problem and it's considered invalid.

Here is an example. So AS4 bubbles up this announcement. Here you just keep going, AS3 here has a challenge of validating. They can see from AS4 to to 2 is fine, to 1 is fine. But from 1 to me, actually in this case, is no longer fine because AS1 said we have AS6, just another random AS here. That's where it ends. From the other end of the path, they can go out to AS1 but no further. But these ramps, they, combined they cover the path.

So this is not invalid.

This is another example where we look at peering, so this is where the distance between the two ramps is actually one adjacent hop. Again not flagged as invalid. And I'll get to this in a bit but then it can be either unknown or actually valid. Valid would mean that every single AS in the path essentially created an ASPA object so you have a positive statement that it's complete and valid. Whereas unknown leaves room for like there is some information missing so you assume it to be okay. Invalid would be we know this to be wrong. So that's why I keep saying here, it's not invalid, it could be unknown, it could be valid, but crucially, it's not invalid because that's when you want to drop things.

This is a question that has come up before. Length of the ramps, does that matter? In a word, no. But they do need to, combined, need to ever could the path.

Here is an example of a leak. So, in this case AS2 has AS1 as their provider, not AS3. But then AS3 sends it back up to AS1. AS1... yeah, so it goes back up. From the other end, so essentially, I am trying to work my example myself now. But the thing is if you look at it from one end and AS2, if you look at it from the other end you have an AS 1, so there is a gap in the middle, therefore this is invalid.

Now in a complicated case with the draft calls complex peering relationships, is that in some cases, peers are to the always just peers but they can also be providers in some contexts. If that is the case, then you have to add them to your ASPA object as a provider because it may happen. So in this case, let's say AS4 is sometimes a provider to AS3, they need to add them to their ASPA object. In this case AS7 is validating, and they say, well, this all looks possible, so they accept. But now of course, AS4 can propagate any prefix maybe they were not supposed to. That would all be fine for ASPA because they just look at the path and they won't look at the prefix and think like it looks okay.

If this is the case, then the only customer attribute in RFC 9234 can be of help because that would allow you to say on a prefix level that this prefix is supposed to go to your customer only so there is another additional mechanism that can help here.

So with that, I'd like to look at the core of the presentation, I would say. So what have we learned in the last six months or what do we see?

.

Uptake: This is on the signing side so we have looked at how many ASPA objects there are in the world and we see that before July 2025, now the line is quite condensed but there were some, mainly in Krill, because it had been supporting it for a while through the CLI. After July, the line goes up a bit more because that's when AS4 was added to the UI there. November, well that's clearly visible. End of November, we enabled this in the RPKI dashboard of RIPE NCC and ARIN also enabled this in January. So, now we see quite a large number of ASPA objects appearing in the world. So, we're quite happy that, you know, we see this uptake going quite fast. I think we're at 3 point something percent of ASNs registered on the RIPE NCC actually that have a corresponding ASPA object at this moment. So I think that that goes ‑‑ well its much quicker than I thought it would be. But that's a good thing because to me that says that people like it and want want to use it.

When we look at validation, I just have a question mark for you. I know that validation is done sometimes, we even do it on our network and maybe if Ondrej is around he might comment on that. I know there are other place that is do validation or at least inspect their paths using ASPA but I don't know of public efforts where statistics are published for example of you know do we know who is validating, how people are validating, etc.? I think that would be a very interesting subject to look at in the coming years and if you have any ideas about this or you want to do beacons and what not, I would be very interested to have a talk about that. But for now I have a question mark for you.

All right. That leads me to another thing. So the curious case of AS0. We like AS0. We like to use it to mean things. Because AS0 is not actually a real AS that is allowed in the path in BGP. It's also used in ROAs as you might know. In ASPAs we also use it.

So in an ASPA object you can have AS0 as the only provider so indicate that your network actually does not have any providers and the primary use case here would be for Tier 1 networks that have no providers themselves, like they are at the pinnacle, but there are some other use cases where it might be used as well, like a route server might use it because they don't want to appear in a path at all, and maybe this is something they can use to flag as well that ‑‑ well, if you see us, it's probably wrong.

But the primary use case, as I said, is for Tier 1 networks.

Now, going back to the previous example, let's say AS1 is actually a Tier 1 here, what they would have issued is an ASPA object that says AS0 is our only provider, we are provider free. The rest is still the same as the previous example. So I won't run through the detail again. It's just to say you endpoint have to make up some random AS as your provider.

Same here, this is where something goes through two Tier 1s, by accident. And a takeaway here is also that if Tier 1s sign these objects, that's actually a very ‑‑ you don't need many of them to do this to already flag these issues, right, so with ‑‑ well, fairly quickly, if there is enough uptake by Tier 1s in signing this this can become useful to detect these kind of leaks.

AS0 in the wild. We do see them in the wild and I have a plot. So here, we see just a fairly low number before we enabled it in the RPKI dashboard and then since then it's gone up quite dramatically.

There are also some in other regions of course, but mostly it's RIPE NCC and then ARIN. And that got me thinking, so can we use this? Can we use this AS0, the appearance of AS0 ASPA objects as a signal? Can we use this to learn something? So with great help from this here, who pointed me at a file that contains essentially all the routes seen by various RIS route collectors, I got that information, I did an export, and then I did an analysis where I just peeled out the unique paths, ignoring the prefixes for now because I made some poor choices in my implementation that I just didn't fit in the memory of my computer, but essentially it was interested in the paths mostly. So, what I did is, I looked at all the paths and I tried to figure out okay, let me just look at the paths that go up to an AS, that said AS0 is our only provider, right, so presumably a Tier 1 network or at least it's something where the up‑ramp would end. And then if I look at the bits that I have, and, you know, just the unique ones, I grouped them, I analyse those and figure out okay are these actually valid or invalid or unknown, and unknown is, well it could be partially covered, could be not covered at all if you look at the drafts that doesn't differentiate the verification the outcome is either invalid, valid, or unknown. In my case I was interested to know if covered at all because I like to see where up take is going in terms of how does this match up to what we see.

And that works out like this.

So, I ended up with about 700,000 paths from 45 million routes. Of course there is a lot of overlaps in those routes, that's why it didn't fit the memory.

Not covered is 93.7%. But covered is of course interesting. Valid, so that means fully covered by ASPA objects and all customer to provider links in this case is 0.2. Invalid is bigger, and that might throw you off, 0.5. But actually, unknown in this case is kind of valid, it's partially covered and in as far as it's covered, it's valid.

Still invalid is too high, I believe. Higher than I hoped. And that brings me back to a question that we ended with last time which is:

We had a discussion at the mic about, you know, should we try to help people and suggest, you know, providers that people maybe should add to the ASPA objects? And there were two answers, and they were both overwhelmingly, no and yes. No, because it will include false positives, it will not include providers that we do not see. But yes, because people will forget and it becomes problematic when people start validating because if there are too many things that would be rejected and people start validating, if you do this then suddenly it's on you because you are dropping all these routes, rather than on the people who should be maintaining and fixing their attestations, right.

So, that's a bit of of an issue, that was an issue with ROAs as well.

And looking at the numbers where I see, well out of the ‑‑ so invalid 0.5, that is out of the part is actually covered. So, it's like something, I don't know, I didn't do the maths actually, 8/9% invalid of the things that are actually covered. I think that's probably too high.

So I think we need to do something. And a question to this group would be would these approach that I have used here in this analysis that I have just shown help here, that's another thing. And also, you know, of course if we ask ‑‑ if we do this, then we would definitely do it in a way that we say, did you forget these or not? But it's always up to the AS holder to really verify, because we have to be very clear that we, we may not see everything. We may even see things that should not be there. So, it's going to be a challenge to present that in a nice way but we have to ask people to help with that but based on the information I got, I think we should do something, but please speak up if you feel differently or even if you agree.

Next steps, then I come close to the end of this.

Again, that question. The suggest providers? So let's have that talk. Validation measurements? It would be very interesting to work on that or if you work on that, I'd be very happy to learn about it.

One thing to mention, the other RIRs are also committed to support ASPA in the course of this year, so by the end of the year you should see ASPA appear also in other RIRs beyond ARIN and us.

That it, I would like to go to questions and comment.

(Applause)

AUDIENCE SPEAKER: I have one addition to AS0, I have observed a bit of hijacking recently from people putting in just regular ROAs from legacy space, so, when I try to nudge my customers to ASPA sign, I also told them all the unused ASNs I put AS0 so nobody can easily hijack it so this will poison your data a bit.

TIM BRUIJNZEELS: Then I should not see them in the path, or should I?

AUDIENCE SPEAKER: No.

AUDIENCE SPEAKER: Mick O'Donovan. If you go back to slide 12, just to discuss the setup around the providers that may be providers at some point in time.

So, Max came to INEX through the oldest programme, had a training session on ASPA earlier in the year and there was quite a lengthy discussion about this. It's probably more common than not in the research and education sphere where we meet another NREN that potentially could become a provider at some point in time through either semi‑automated or manual process, but it's not something that's there ready active to be used. And the question I have is maybe philosophical, but when do you actually add AS4 as your provider in this instance? Is it always there ready to be used, or is it actually when you need it to be used? Just like the path of using transit via AS4 is actually going to become available to you?

TIM BRUIJNZEELS: Yeah, I guess that depends on your local situation. I mean you want it out there before you actually use it. So if you know it can happen any time then you should always have that. Also if you have a backup provider that you don't normally use, you need to have them actually.

AUDIENCE SPEAKER: Then I guess, to counter that, if it's always there, and you actually don't want it, maybe there could be a hijack by virtue of the fact that it's there and not yet in use.

TIM BRUIJNZEELS: Yes. ASPA just talks about essentially the policy side of it, so if they say that can be my provider, then that's okay. It doesn't protect against hijacks where people spoof the path. For that you would have to look at something like BGPsec. It does make it harder however to make a plausible path but this this case they could use that, yeah. It depends a bit on how many people make statements. If many people made statements it becomes harder. Yeah, it remains open.

BEN CARTWRIGHT‑COX: Question: Networks who also use AS0 ASPA when they are not connected yet to an upstream to protect their ASN from misuse while setting up their network, the invalids could be this kind of abuse or leaks. What would be your advice to use as ASPA for networks that are not yet in use?

TIM BRUIJNZEELS: Oh, I am not sure if I am qualified to answer that, to be honest.

BEN CARTWRIGHT‑COX: The answer is AS0, right?

TIM BRUIJNZEELS: Well, I think... if you don't use it you don't have a provider, right so, it wouldn't be a lie. I I think it doesn't hurt to say it. If it's best practice to do so, I don't know.

AUDIENCE SPEAKER: I would just like to say that I am operating AS4492 which is ASPA invalid on purpose since January I think, so it's probably part of your invalids but it's ‑‑ I am just saying it as a public service announcement in case anyone wants to do any measurements, this AS should not appear. Like if you see it's; ASPA validation did not work. So it's IPv4 and IPv6 if you want to use it any researchers, any operators. Thank you.

AUDIENCE SPEAKER: Maria, developer, I'd like to thank you for proving my point from Monday, because you got it wrong. This slide, you don't validate, the ASN does not validate the path which begins with 7, it validates the path which begins with 2, because the validation happens on ingress, which means you have not yet pre‑pended. This is what you want the it's what I wanted to put into the draft, but it was rejected because it would be too performance sensitive. So, this is not to blame you, the draft is so ‑‑ the draft is missing key places where you could anchor. The same thing you basically proved my point where describing what is valid and invalid and what is known because you actually went over the algorithm which is in the draft, which is actually backwards, and if you explained it from the other side, it would be easier, but it misleads you. So thank you.

TIM BRUIJNZEELS: All right. Thank you.

BEN CARTWRIGHT‑COX: Alexander Asimov: Great talk, I wonder if you have analysed if there are any common patterns in the invalid paths?

TIM BRUIJNZEELS: Not yet. I didn't have time. I did have ‑‑ I also had a bug, by the way, which I discovered this week, so thank you for that, Maria. No, I had a bug in my implementation where, because people can repeat their AS for traffic engineering purposes and that was all showing up as invalid for me, But it wasn't actually.

But I haven't been able to look at the rest of the data yet.

AUDIENCE SPEAKER: Tom Strikes, operator of RPKI observability platform, I guess that's the only thing we do.

Two things. One, will start advertising ASPA invalid and put it on at BGP sooner rather than later and hopefully we'll have some statistics that we can start talking about. Second thing, that's more to Mick's questions about, what if you have AS4 and you are not actively using it? One the mitigations that you can use if you have an active peering session and there is not a provider at that point in time is using the ODC attribute on your prefixes. It's not a tit for tat kind of same thing but it's close enough that it will at least prevent some of the leaks that could occur so that could be recommendation like if you want to do that as an NREN, set ODC on your prefix that is you are advertising to your peer, put them in the ASPA and in your ASPA and you are fine. The second they become an actual transit you remove it from your prefix and then you'll be fine.

AUDIENCE SPEAKER: Ondrej Filip. I have a kind of small warning about quick adaption, because I have a realtime example or real life example from this meeting. We are early adopters, and one the old peering partners was exporting our route without letting us know, and unfortunately this path was best path for the upstream provider of this meeting, so this was exported to the route server here, and it was invalid of course, so this was just deleted. So I just would like to warn you starting validation on certain networks, I suggest it in ASPA, but we should be careful in the starting validation in some networks, it would fix things like the RIPE NCC stuff for quick reaction. I that network to ASPA and then it was fixed, but that's something we should keep in mind that it's not so easy to start validating everywhere, we should start from the core parts of the network, for instance.

TIM BRUIJNZEELS: Yeah. I can see that. But there are two sides to this, of course. I think, yes of course you are right. You need to be careful, definitely early adopters need to be careful, but it's also a data quality issue. So, looking back at how things went with ROAs, I was there when that happened, and in the early days, ROAs didn't really match what was going on in BGP, it was like 50% really far off. Then we used this, what I suggested like, essentially a suggestion engine in the interface helped people catch a lot of issues. But doesn't get you all the way there. It got us to about 90%, but then when people really started to do route object validation and things that were invalid were dropped, that was also a very strong signal for people to update their ROAs or remove them all together and in most cases they just updated them.

So I think there will be a degree of that when people start validating, then those ASPA objects were not setup properly and I am not saying it's your is make because the is make can be elsewhere. But then yeah, then those things will come out.

AUDIENCE SPEAKER: But this was not a mistake, We couldn't do better, we just know that someone was exporting the route. Don't start validation first, ask your upstream first, that's my advice, you know.

TIM BRUIJNZEELS: I would say it's prudent also if providers warn their customers first and all that. Yeah.

AUDIENCE SPEAKER: We are trying to push all our AS cone to ASPA and as we get closer to having more validation and filtering based on ASPA, then we are missing one of our down streams, AS 12, 654, which are you having any plans in adding ASPA to that?

TIM BRUIJNZEELS: Come again.

AUDIENCE SPEAKER: Routing Information Service doesn't do ASPA yet.

TIM BRUIJNZEELS: Right. Yes.

AUDIENCE SPEAKER: Adding ASPA beacons to that so we get more validation.

AUDIENCE SPEAKER: RIPE NCC. We really would like to do ASPA beacons for that we would need a separate LIR to have a small less radius, and as the RIPE NCC it's kind of difficult to get LIR but we want to go for that.

TIM BRUIJNZEELS: Yeah, so there is an issue around people accessing the beacon configuration, RPKI configuration used for beacons and also having access to the production network, that potentially causing other issues and compliance questions and... yeah.

AUDIENCE SPEAKER: Job Snijders, Calgary Internet Exchange. If you go to slide 15. The Calgary Internet Exchange where I volunteer has been doing ASPA verification on the route servers for three years now, and this is a practice I recommend other Internet exchanges to copy because nowadays open BGP and BIRD support ASPA, so, you can do this as an Internet Exchange operator.

The Internet Exchange has, I don't know, like 100 peers, 250 gigs of traffic, about 175,000 routes on the route server, and of those 470 were about 0.2% ASPA invalid. Which I think is a Superset low noise number and we're never going to get to zero because the whole point of ASPA is to filter out invalid routes, and there is constantly some form of leakage for unintended consequences of configurations that we want to block. So, the system is working, is my conclusion, because we are blocking a modest amount of routes.

So, yeah, there is some uptake, there is some benefits, so I think, to on Ray's point there was no is make, yes there was a is make, there was a leak, an ASPA blocked it the system is working as intended. And I think what he meant was stub network is single home networks, is that correct? Okay, that is a violation of RIPE policy, it must be multihomes, wait a second. I am kidding.

If you are single homes, I consider that an exception of sorts and at that point yes refer to your upstream because you outsource your routing decision‑making so your upstream, you might as well filter to your upstream if you are single homes. If you are multihomed, you have an upstream and say an IX connection or a peer, you know, you have more than eBGP sessions, you are multihomed, then it makes sense to do ASPA filtering because maybe one path is a leak and it's blocked by ASPA and other paths are clear. And these considerations, we went through the same mechanics with the deployment of universal ROA validation in 2020. At the time, there were like 5,000 invalid routes in the ecosystem and people were fearful like, okay, that's a really big number, are we maybe tossing the baby out with the bath water or something? And it took the community some effort to understand like well, the 5,000 number is big but actually the traffic throwing towards those invalid paths is so minuscule, and the advantages are not accepting miss originations are so large compared to missing misconfigured destinations, that eventually the community encrypted to the trade‑off like yes, we want to block invalids.

So, I think with ASPA we'll have similar conversations, and people will be like a, you have 470 invalid path on your route server, that's a problem. Like, I don't know, we're kind of leak free, that's also nice to have. So ‑‑ and in any case, to your point, Ondrej, if we start at say Internet exchanges, the route servers, they, by definition, have partial Internet tables that they distribute to their clients. So if they distribute a little bit less routes with the advantage of also not distributing leaks, that's a great start. So I think you also have a hat with an Internet Exchange on it, what is your deployment plans?

ONDREJ FILIP: Again, I am not against ASPA. I am really into the gist of it, I just warned about the single home network which is by the way this one we are now ‑‑ but I am fine with that and yeah, exchange point has plans soon to support ASPA, so yeah.

BEN CARTWRIGHT‑COX: A follow‑up point. He says pretty much echoing what Job said the previous comment about filtering it's done in multihome networks, it's common to have a default coming from your transit provider in that situation, because you are basically outsourcing all your decisions to that provider anyway so you may as well stick a default.

SEBASTIAN BECKER: So thank you.

(Applause)

And now we go to weird things in RPKI from Job.

JOB SNIJDERS: Good morning everyone. My morning started great. I wept on a run, succeeded and then ate some yogurt with granola and dropped it on the carpet and now I am standing here.

I want to talk about weird things in the RPKI. And this is somewhat related to routing but it's also not entirely. We're talking about quirky aspects of data synchronisation or logically consistent states of how the information that is encapsulated in RPKI... yes, and then in turn influences the routers.

This talk it based on data collected by RPKI views. And this night I was working with word art, and I am really proud of this beautiful image to reflect the weirdness of what is to come.

What is RPKI views? Quick recap. It's about 14 nodes distributed over the globe. They are in various locations but I am still looking for collectors in Africa, South America, Russia, so if you want to contribute to the project, e‑mail me.

What RPKI views does.

It runs a validation process that discovers the current certification authorities of the RPKI and their subordinate products like ROAs, and it validates the discovered products and then it stores the results. And normal ISPs would then send the results to the BGP routers, but what I do is I send the results to a storage bucket somewhere for later inspection.

So, all these 40 nodes run the process in a loop, upload the results to a central location, and on a daily basis, the central location deduplicates, sorts the results and produces an RPKI spool archive.

This results in a pretty complete image of the RPKI. I was able to cross reference this archive with another collection mechanism that doesn't do cryptographic validation and on that, I based some confidence this is not LLM confidence but like Job Snijder's confidence, that more than 99% of all events that transpire in the RPKI are captured and nearly like a hundred percent of the operationally relevant states in RPKI. So this is the all seeing eye of saree but for the RPK.

Latest statistics. From January this year I collected almost 25 million RPKI objects and about a quarter million like snapshots of the RPKI. That's roughly every few second an image is produced of the RPKI.

So, let's highlight aspects. The production is on a daily basis. So everyday, I compact yesterday's data. This makes analysis quite easy. One day, February 1, I looked at what was captured that day and this is the output on my screen. I really like garbled text. And the astute reader will see something really weird in the previous slide, right. Let's highlight what is weird.

Are so the RPKI spool for February 12 should contain the data that was new on February 12. But, we see here in time stamps in the TAR file, in those time stamps are time reflections of timestamps to the internal RPKI objects, that a bunch of these objects originate from 2025, almost a year earlier. I am like, what is happening here? I was expecting objects that are created on February 12 in the February 1 archive. This file is some assumptions about causality.

So this stood out to me like, is my collection broken? Am I somehow you know wrinkling through a Time Machine? Can I make money with this? The answer was no. But it turned out there was a software bug somewhere.

So let's investigate.

I looked up one of these objects and it is an object produced by the ARIN hosted CA system. And there is various fields in this binary encoded object that are worth highlighting. The size: 2,149 bytes. The file name, which is, C 601, blah, blah, blah. The signing time, a moment in time, pinned to January 2026, so that's like a good two and a half weeks earlier than the archive should contain data for. And then the contents of that ROA.

And I discovered that what I had discovered on, in February, was the same named, same sized, same signing time object but it was a different ROA with a different hash and slightly different content. So, if we flip back and forth between these two images, what happened was that a ROA IP address payload entry was removed. And as a consequence, the end entity RFC 3779 listing of the resources contained within the ROA flipped from a cider prefix, /22, to a range. And there are all kinds of rules of how ROAs must be constructed, but the net effect is that by removing a field in one particular field, in another field a few bytes were added, and that results in a same sized ROA and if you then combine that with recycling the CS signing time, which is a form of backdating, ARIN if you are in the room.

So you are like, fine Job, who cares! I will tell you why we should care.

Now, let's look at a formal description of the RSYNC algorithm. What RSYNC does is it connects to the remote RSYNC server around it will asks what have you got on offer for me today? And compliance and server, because of efficiency reasons, will work through a list where the primary key of the files is the file name, the file size and the last modified time stamp on the file system. And that means that client and server only need to do a stat sis call instead of an open sis call and the stat is way cheaper than the open, because the argument is that if all three of these parameters are equal, they be then the contents can be assumed to be equal.

But as I just demonstrated, we now have a case where the three parameters are equal but the contents are not equal at all. Like the top line, the hash identifiers, the 256 of the object contents, and it changes between these two versions of the object so it's not equal, could have moved. But RSYNC is like, this is fine, we're going to skip transferring that file.

Now, what then happens is what RPKI nerds call a failed fetch. So, on the ARIN CA system a new manifest appears, the validators obtain the new manifest, the new manifest references the ROA but that new version of the ROA was not hatched because the RSYNC algorithm is like no, it's fine we're not going to move data that's the same, and then the validator essentially in distributed systems would be called a "dirty read", because it recognises there is an internally inconsistent state, I need a ROA that is not here, it's not here because it's not transferred because the RSYNC algorithm is like there is no need to transfer it, and what then happens is that the validator is like okay, well this package of the information is incomplete, I will keep using the old manifest with the old set of ROAs until that expires.

Now, that means that the manifest needs to time‑out or expire before the spine starts moving again. And it depends a bit on the RIR and the CA system what the exact time parameters are but this could take between eight hours or maybe two weeks, really depends on some variables.

The jury is still out on what the optimal timing parameters are for stuff like manifests and CRLs because shorter lifetimes means better protection but longer lifetimes causes less transfer so it's an engineering trade‑off.

Long story short, you end up with a situation where you updated a ROA and it doesn't reliable transfer or synchronise into the wider ecosystem. So you then have the super weird situation that you changed a ROA and there is no change in some ISP somewhere that was unlucky enough to sort of hit the perfect alignment towards the unfortunate situation that the ROA didn't transfer. And that's insanely hard to debug, there must be somebody somewhere that was just flabbergasted, I changed my ROA but nothing ‑‑ it's not working, what did I do wrong? It's through no fault of their own because they influenced the ARIN web UI but it's the ARIN system that creates the digital data, signs it and publishes it, to some degree it's out of their control.

Okay, do we get rid of RPKI because of this type of race condition? I can assure that is not the correct answer. A race condition like this is exceedingly rare because the manage of validators will use RRDP as their primary synchronise mechanism and this particular race condition cannot exist in RRDP.

And like, the chance of you modifying a ROA and it remaining the same size, are also pretty small. So it's really like you need to, it's the corner case of a corner case before you arrive in a situation where you would ‑‑ the ROA doesn't propagate.

And this particular bug, I only found it with the ARIN system. They acknowledged the bug, indicated they would deploy a fix. I think it's deployed but I am waiting for confidential, so, okay, that should be the end of it. Interesting race condition, it was resolved.

But it does show that there can be some surprises in the RPKI. Like, the RPKI is a distributed database and that brings us to distributed database problems, and they often are hard to debug. And how do you go about debugging some like one out of 1,000 validators does not have my new ROA? You probably wouldn't even know that the the case because you don't have access to all of those validators to analyse their state. So that's one takeaway. Like you can do everything perfectly in your API ROA provisioning software or your process using the web UI, you can configure the ROAs correctly, you can correctly time the modifications of your ROA configurations like get the start or end of project, and still have a situation where things don't work as expected in exceedingly rare circumstances.

Another takeaway, the time stamps embedded in the RPKI objects could be nonsensical. They are a purported time at which the object was signed, but it's an assertion made by the signer and it is not something that correlates to your clock on the wall. In fact there is no universal clock in the RPKI. It is a distributed database. So, when you analyse RPKI data, the CNS signing times or the disupdates or not before us, or whatnots, they are all interesting parameters but you have to realise that they are maybe logically consistent in the context of that certification authority, or they are nonsensical. So you have to establish your sequence of events through other means, and that means you have to look at the validation rules of the RPKI, and that means like a manifest must go up and to the right, it must increase its serial number, the manifest number in a monotonic fashion, the structure of CAs can be modelled as a monotonic semi‑lattice, this is all a little bit more work, but if you do that type of work in your analysis, then you can start seeing the really quirky things.

Another takeaway in all of this is SCA do not reuse file names that were previously used for previous versions of the objects. It is a disadvantage if your ROAs, throughout time, do not use a new file name for each new issuance.

Because if we go back to the situation, the subject information line with the light pink‑ish highlight, it is just entirely unnecessary for the system to reuse that particular file name across multiple issuances or multiple versions of the ROA. And, you know, we're in Europe so it's safe for me to say this but in this regard the RIPE NCC system is much better than the ARIN system. The RIPE NCC system will use different file engagements for different issuances, right?

ARIN should copy that from RIPE. Yes.

Now, so reusing file names is just unnecessary, there is no benefit to it, it makes debugging harder and it can hamper propagation issues if there is some kind of issue with the, like the time stamps on the RSYNC server. So again you have the moon and two planets aligning and it happens once every hundred years but when it happens, you say I wish that did not happen.

And weird things happen in the RPKI. Like, this was a surprise to me and I think a surprise to many other people as well, like oh, we did not foresee that happening. And it is important that we, as a community, remain vigilant and actively search for like are race conditions possibly in the new standards that we're developing? Can we find them? If we can find them, are there reliable mechanisms that we can document to prevent them or is this a risk we have to accept and...? But we have to investigate quirky stuff like this even if it's rare. Because, if one out of a million times something happens but that one time ‑‑ like imagine you created a ROA, you fill in the origin ASN, you hit submit, and like right after you hit submit, same with e‑mails, right, you spot the typo. You are like oh, that 5 should have been a 6. The and you see traffic towards your prefix decrease because now you have made your prefix RPKI route object validation invalid and the big providers are rejecting your routes.

And that ROA propagated like light speed. It wrote the happy path and it propagated to the global Internet in a span of just a few minute. You are like, that is real, how do I explain this to my boss? You know, what, I will quickly recollect fight the typo, I will change that of back to a 5 and you hit submit and for whatever reasons, because you changed that one parameters, the file size of your thing didn't change, and then you hit the unlikely situation where the modification time stamps and it means that your second ROA that is supposed to if I can the error you introduced in your first ROA is not prop ‑‑ it may take hours in some scenarios to propagate. So I think RPKI propagation is sort of a, like P 99 propagates in 15 minutes, one five, twenty minutes, something like that. But there is also a small non‑zero percentage where propagation is just slower for all kinds of reasons, and this could be one of them. I guess I am saying don't make mistakes, don't write bugs in your software. Don't make typographic errors. It's simple.

We're nearing the end of my presentation. And I am just going to ask you myself some questions and answer them.

Is this the only weird thing you have seen? No. In the APNIC system, I found something else. It is not backdating of new objects, but it is recirculation of older objects. So you deleted an object a few months ago, you created a new object, you delete that new object, and then for a few seconds, the old object reappears and then disappears. I have no mental concept of how you would programme a system to make that happen and we're still investigating why that is. It doesn't seem to have operational impact but it is pretty funny, I think.

But, yeah, I don't have a lot of time today to explain all the weird things I have seen in recent year of research.

Another valid question will be like: Okay, Job, RSYNC is stupid, it's old, don't we have RRDP to prevent this type of problem? Answer would be, correct. RRDP would not exhibit this particular type of problem. But luckily enough it introduces new classes of problems and new types of weird internal inconsistencies. So, it's choosing between two evils is too strong of a word but like RSYNC has upsides, it performs way better in low bandwidth circumstances. But it has down sides like I described just now, like the integrity of the synchronisation depends on a heuristic that is cheap but not 100% reliable. And then RRDP is better performance in the happy path, but it has some weird thundering hurt corner cases that can cause RIRs to become congested.

So, I think the RPKI community is working to improve upon these protocols, active research is being conducted. But there is not yet a silver bullet for the ultimate reliable propagation of RPKI but we'll eventually get there surely.

.

(Applause)

SEBASTIAN BECKER: Are there anymore questions besides the one Job pointed to himself? No. There is, there was one question ‑‑ not in the Q&A but from Shane Kerr.

BEN CARTWRIGHT‑COX: : Does RPKI not use the actual wall clock time for expirations of signatures and things like that?

JOB SNIJDERS: The RP ‑‑ so the question is, is the validator not using its local time to ascertain whether the certificates are valid not? And The validator is using the local time to understand is this manifest or valid or is this ROA still valid. But all manifests and all ROAs are distributed with a lifetime of say between eight hours and two weeks. So there is a long window for things to not have yet expired from the perspective of the validator.

AUDIENCE SPEAKER: Tom Strikes, CloudFlare. I know there is a BGP at the IETF in the publication server side of things. Do we need to do some updates on the IETF side to just kind of give you additional sticks to whack operators like ARIN with with a, could you please not do shit like that, because as you said, it's not constructive or helpful in anyway or form.

JOB SNIJDERS: Yeah, that's an insightful question. The Working Group has undertaken an effort to produce a really long document on what the authority lessons are in this, oh you can shoot yourself in this way, you can hang yourself in that way, wow. And this document encapsulates like our current understanding of how to prevent certain race conditions, so it has been written out use unique file names for new issuances. But migrating from your current system to a new strategy is non‑trivial and this is a production grade system, so the likes of ARIN, or for instance Krill, which also reuses file names, are faced with a challenge like okay, we have deployed system running, you know, helping security Internet, so we need to figure out like a reliable migration path towards the new strategy and that, it takes time. So I covered this problem, I think in early April, and it had transpired in February, so, you know, the machinery is running but these are not quick processes. So, I wouldn't be surprised if the likes of ARIN or Krill could adopt a new scheme maybe next year or who knows, and in the meantime, they rely on ensuring that the heuristics of time stamps are correct. And of course the cool thing with documents is you can write them, but, you know, you also have to encourage people to read them. And when you read a document, it's not always clear why we should use short high entropy file names. And I think, because for like 15 years it was written down like use this file name strategy, and everybody is like okay, seems cool, but does anybody know why? Now I am oh, I understand why, it's about the entropy, that is the characteristic of the file naming scheme that is the attractive property. But that means it took me like 12 years to realise why practice even existed. If it takes me that long to understand it, I am surely not the only one.

So in this sense, the RPKI is really a collaborative effort and a joint learning with very, very much based on progressive insight or hindsight.

AUDIENCE SPEAKER: I just want to say that like generally speaking backdating signatures is a very big no no, especially doing it so far. So, I would suggest that software implement checks to ensure that they don't accidentally backdate in the signature systems too far. Yeah, because that's ‑‑ I can't think of any good reason to do that.

JOB SNIJDERS: You are not wrong but also not on the mark, because there is a delay between producing the object and discovering it and validating it. So if I start ‑‑ so imagine you today create a ROA, it's May 20th. I start a validator two weeks from now with an empty cache, that validator instance will download the snapshots and synchronise the data and it will discover the ROA that you created today. So for that validator instance, two weeks from now the ROA you created today is going to be two weeks old. And this is why like clocks in the RPKI are sort of a psycho thing, because you cannot just say it's not realtime. It is, there is an offline protocol where events can be smeared out over multiple months. For instance RIPE NCC top level manifest is resigned every three months, so, yeah, it's ‑‑ you know, at some point in a cycle it's going to appear very fresh, less than 24 hours old, and then as time goes by it becomes older and older but it's not backdated. So, different shading, backdating versus oh this was just an old objective newly fetched because my cache is empty is very hard, and this is why RPKI views is so useful because you have all these validators slurping in data and deduplicating it, and then you can recognise situations like this because in a single instance recognising backdating without prior state is impossible.

SEBASTIAN BECKER: I think we are done, we are closing the RPKI Working Group for today. Big applause to all our speakers please.

(Applause)

We go to the closing notes.

We have two announcements to make.

I was asked for, that there will be an IPv6 at once Kahoot quiz in the side room at 13:30. So half past one. Join for a chance to win exclusive IPv6 T‑shirt if you want.

BEN CARTWRIGHT‑COX: Then I guess the only thing left to say is please rate the talks and if you are interested in potentially running for the next Working Group Chair, join the mailing list, we'll make an announcement probably in the next, I mean, hope, weeks but, you know, join the mailing list please.

I think that's it. That's go for lunch.

(Lunch break)

.