MAT Working Group
Thursday, 25 November 2021
At 4:00 (UTC +1)
BRIAN TRAMMEL: Hello everyone, welcome to MAT WG, I actually probably because of the way that noise cancellation works, let me go and put on some headphones for a moment. So I am Brian, this is Nina, for those of you who have not been to a meeting, we are your humble and friendly, Chairs. We have actually for the first time in a while we have a slightly relaxed agenda, right, like, so the last few meetings we tried to pack slightly too much in. Today, we have sort of three presentations plus time for any other business or open mic or if anyone wants to come up and say anything about measurement analysis for tooling, we will start off with George Michaelson, who is speaking about economy level views of honey level data followed by Robert Kisteleki and Massimo, I understand he is currently in transit and working to get a computer. If he is there and set up in time we will be hearing from Massimo in person, if not we will be playing a prerecorded version of his talk followed by a Q&A, he will be here at the end of the meeting, that's ‑‑ I know we usually give tools update at the end but we have shifted this up for Massimo's schedule. With that, I think we can go ahead and get started and call George Michaelson to the microphone.
GEORGE MICHAELSON: Hello, can you hear me?
BRIAN TRAMMEL: We can, good morning, thank you for joining us at, what is it, 2.00 in the morning there now.
GEORGE MICHAELSON: One in the morning.
BRIAN TRAMMEL: That's perfectly fine then.
GEORGE MICHAELSON: So my slides are coming through okay?
BRIAN TRAMMEL: Yes, they are.
GEORGE MICHAELSON: Hello everyone. I am George Michaelson from APNIC, and I am going to be talking today about a product we have in the information services group at APNIC called DASH, I am really going to be talking about some data in DASH which is about, at this time, bad traffic information, and it's a pack that addresses a straw man question, the classic problem for a speaker, what shall I speak about? I know, I will raise a straw man. So, this is a pack talking about the straw man question: How much bad traffic should I expect to see?
So, bad traffic is a term that we are using in APNIC, in our DASH service, DASH stands for the dashboard for AS number health and it's a service for AS holders to tell them about the things that we think might be a problem in their service announcement going to the world.
Right now, it's focused on traffic measurements from the worldwide honeypot network which is called Honeynet, but we are actually working to bring related routing information services and to provide alerting mechanisms. The pane here shows you a view of our understanding of how much bad traffic is being sourced from your network. I want to stress, this isn't a reporting service about things you receive; this is a reporting service about things you emit. And it includes information about which prefixes you are originating that have been seen and over what what time frame and we show what we think your contribution to the bad traffic problem is compared to your economy and region.
Now, we have had requests to include more detail, the specific view of your IP and the time of the port, the reason being increasingly people are behind CGN and they need this to back trace through the logs in their own network and we have decided to work on that and also on an alerting mechanism to reduce the time people have to spend coming to the web service and to integrate more health checks that don't relate to badness. Well, they do, because they are likely to be indicating problems with your BGP misalignment with IRR and RPKI, but they are not bad traffic in the classic sense. So I want to be clear, the site is a general health site, but currently what it talks about is bad traffic, but I did say, in the title, I was going to say what bad traffic is.
So, bad traffic, we believe, is fundamentally about subverted systems. But what it really is, is the presentation of stuff seen by networks around the world that we are all seeing from time to time, things like SSH, login attempts or probes that port 80 and port 443 to see whether you are running an old version of the pache with known or DNS queries looking for open resolvers. So for our purposes we have asked the Honeynet to provide us with only information about TCP sessions and we have done this because we want to discount unidirectional packet flow, we want evidence that there's a synac flow so that we can understand that this is highly likely to be a real reflection of origination coming from your BGP. Of course it is theoretically possible that this is a hijacked prefix and that what they are doing is seeing two‑way flows within a restricted Horizon that trust this origination and path and the not actually you who is doing it, and we haven't taken that into account in this analysis. Really? I think that's quite infrequent, it's not to say there are no hijacks in the network, but for most people, it's not the experience they have, but they do have the experience of having people inside their customer network who potentially are sourcing probes out into the world, but nonetheless, that would have to be kept in mind.
So, the Honeynet. The Honeynet is a really lovely cooperative project and it's running toy systems, virtual systems in jails to gather information and there's a large number of these being run in a cooperative venture worldwide. There's actually quite a lot going on here, because the bad people now know that honey pots exist so they have included meta tests to try and find out if they are in a jail. It's classic information they cemetery. Anyway, read the URL on the slide here because that explains much better what it is. We get data from them in a JSON format, so we take this data and we use the BGP record of the day and the delegated extended file to add basic information that we can then use to collate things by originator and custodian and registry and economy and perform aggregations. Now, we do also know that the inference of economy here is a bit of a conjecture and it has to be seen as an error bar on the data that I'm going to present here.
So, the straw man. How much bad traffic should I see? Well, we are starting from a hypothesis that traffic distributes evenly and badness distributes evenly across the surface of ISPs, every ISP everywhere is equally likely to have bad actors. So, we'd expect this to be in a view that we can see a distribution of bad traffic which followed approximately the distribution of Internet uses around the globe which means it would be a function of your size as users. And it just so happens, we have access to a model that talks about the size of different networks.
This model is a source of information that comes from outside the RIR system, it's the basic world broadband population statistics and these are being collated online by the originator for economic and Internet planning. Now, of course, it's highly approximate, that is a very nice example when China at the APNIC declared that year they had found 200 million extra online users, so we do realise the model itself is a potential source of error, but the model really isn't too bad. It follows population statistics quite well. It doesn't tend to exceed the local population data so we think that's good because population data is managed really carefully, and secondly, we do discuss the implications of this model with people in the community and broadly speaking, we don't get negative feedback. The community understands the limits of this kind of data.
So, the model exists because labs is using it in a mechanism to do weighting, the individual experiments in the ad campaign beneath labs don't distribute evenly in the world and that's because advert presentations are very skewed, they are known where people are currently online and active, it's a function of time and day and it's a function of the context of use. But the tests themselves, although they may not randomly distribute, they are valid tests of an effect and so it's interesting to look at rebalancing this and at aggregations and labs uses the weights as a rebalancing mechanism to adjust sample counts so they fall into line with world population and distribution of users according to this information.
If it was possible to get ground truth about the distribution of AS relativities within an economy, I think labs would be really happy to do that. But there isn't information that's available that provides that neutral point ‑‑ source of truth about what happens within an economy. So at this stage, we are only really talking about adjusting numbers based on economy.
So, we in this experiment here I am talking about, we don't use weights to adjust things. We are using it to understand the expected volume of the things we should see, the straw man, how does that vary against the real data seen? Well, here is the results. Now, I promise a short talk, that's the end of the talk, we are all done here. Thank you, goodbye and I am going back to sleep. Okay. So, more seriously, what actually is going on here?
So, let's dig down to understand data a bit bit better. The data is a time series and spans February 2020 to October 2021 and each row is an ISO 3166 economy from origin AS. The cells are indicating a rough model of the amount of badness relative to the amount we expected, so it's the ratio of what we saw versus what we thought we'd see. Grey means no data; green means you actually did better and I haven't bothered grading you. So I have decided to bring things down and simply say you are good, but there is variance in how good people are. Orange to yellow is the initial range of round 2 to 10 times worse than expected and red means a lot worse, more worse than we expected, up to two or three or four hundred times more worse.
So let's have a look at this in summary and just viewed as a field, this field is green, which means most economies worldwide arrive at or below their expected ratio. And although, although you can't see in this visualisation, the grey is mostly small economies. So, in a bad month, we do actually see non‑green states but these tend to be patchy, so I have flagged here some instances of intermittent problem and you can see that they really don't persist forever, it shows that there's something going on, something happens but then it's resolved. Whereas, there are also economies that have a persisting problem that go on for extended periods of time.
Now so that straw man, are things evenly distributed by population count? And it's completely exploded, it simply isn't true because that image showed you things don't distribute evenly.
Okay. Let's take a look at some individual economies. Now the principles in DASH is we don't name and shame, but this is a RIPE meeting and what's a RIPE meeting for if you can't talk about your friends, so I'm going to take the opportunity to mention the worst economy in this model is the Netherlands. Now, my suspicions is and I have no proof, that this is because the CERT Honeynet community are using hosts in the Netherlands to do Cima Nmap probes around the world and do detailed scanning into people's hosts. I don't actually believe that the Netherlands authorities really are sitting on a persisting problem this bad but nonetheless they stood out in the crowd.
So let's have a look at how things distribute by intensity and scale.
This is a sort over the data to highlight the highest average hit rate by economy that we saw and you will see the Netherlands have floated up to the top, they are used to floating on top of all that water, you will also possible notice your own economy here. You might see Hungary or Ireland or Finland or Bulgaria or Russia and it would be fascinating to know the stories that lie behind why these economies have a problem. My personal experience from talking to people in the communities that do this kind of analysis in Vietnam and Taiwan and Korea is that the CERTs take the origination of bad traffic inside their economy extremely seriously and when we brought this to the attention of them, they generally were aware of it and they are interested in this display and they have gone back inside and pushed back on the problem, because although they have a defensive posture about bad traffic coming to their economy, they do have an internal problem and they want to fix it.
Okay. It's CDF of about 80% of the problem, the meta problem here is that it's also a CDF of about 17% of the population of users worldwide. It's the classic 80/20 problem. So let's have a look at real world scale.
This is the distribution by user population and you can see it's a completely different situation. If we take the real heavyweight large economies by scale, they are mostly green but there are still problems inside this community. So, when I say CDF, I mean the cumulative distribution frequency, the mechanism of how to add together all of the samples to arrive at a sense how the curve is settling out in time towards the maximum countable things seen. Now, in this case, this is actually 40% of the problem, but it's only coming from 78% rather of the population by CDF. So, 80 plus 40, that doesn't add together, well these are two different sample sets and they don't actually linearly add, some of the heavy hitter economies do actually lie inside the set of both. So, summary: DASH as a service is helping AS delegates to understand bad traffic and we wrote it to be an information source for the AS holders to understand this. If we take the aggregate, we are really quite confident that the distribution of bad traffic is not even around the world, the problem is focal, let's ignore the Netherlands and it's a problem in scale about 30 economies which means potentially it could be fixed because it's really quite small by economic distribution.
I think the economies do understand the problem and are looking at it but perhaps there is something we can do. Thank you for listening and I'm happy to talk about this if we have time. The problem here is that DASH is a service for logged‑in entities at this time. You can't see the data directly because we restricted access to the AS holders in our community but we are interested in bringing this data to the wider world and I would love to know if you think that would be useful. I will update the pack to include the full economy list so that you can all see all of data we have in this talk. Thank you very much.
BRIAN TRAMMEL: Thank you, George. We will go quickly to questions. Brian.
BRIAN NISBET: Hi, Brian Nisbet from HEAnet. And I think, George, you have already answered my first question, which is: If someone who lived in an economy which turned up on a slide was interested in talking to their national CERT and other people about such things, what is the data that one could show them in relation to that? So you said you are going to add that to some of the packs but is there, different DASH is a login only is there more that I could turn around and show them or talk to them about?
GEORGE MICHAELSON: Well that's actually a very interesting question question. The problem here is in fact the CERTs can see this information about going into their own access to the Honeynet system, what they are going to be showing there is the non‑aggregate view of the list of IPs and so it's about digging through data that is in the public eye. What we have constructed is a reduction over this and that question is this potentially useful to people was kind of floating out there. If we can come up with a way to take the aggregates and say these are acceptable to share and we could provide plots that are finer‑grained in date and time, they'd appear in a service we have called REX, which is the Internet Explorer, it's the statistical basis of reporting on things seen in the internet as distributions but also use of IPv6, and potentially we could include this data there.
BRIAN NISBET: Cool. I will have a word with them anyway and see if they are already aware of this and things like that
GEORGE MICHAELSON: I am sure the Irish CERT is aware of this problem, Brian.
BRIAN NISBET: It has been very distracted for the last six months.
GEORGE MICHAELSON: Can I say if I floated you to the top three it is an intermittent problem within Ireland, it's not a persistent problem, you are not Dutch.
BRIAN NISBET: Tall and all as I am, I am not Dutch.
BRIAN TRAMMEL: I think we are directly at time. I have a couple of questions but I will follow up with you off‑line, this was a fascinating talk, thank you so much for bringing it to us.
GEORGE MICHAELSON: Thank you
BRIAN TRAMMEL: Next up, Robert Kisteleki from RIPE NCC with the tools update. Welcome back, Robert, this time with slides.
ROBERT KISTELEKI: Hopefully so, yes. All right. Welcome everyone to RIPE 83 in the famous city of local host, country local domain probably. This is your usual tools update from the RIPE NCC from your usual Robert.
First of all, things around the research that we do. As most of you probably know we do research activities, try to make sense of the data of Internet events for the benefit of the community and all of you basically. So here are a bunch of articles that we presented on RIPE Labs recently, you can go there at any time you want to and read them in full detail. And I am going to highlight some of them. Latency into your network and the Facebook event that we probably all know, we also have some guest post in labs which is highly useful, you probably know that we also produce some kind of country reports or regional reports every now and again, there was one about Mediterranean Europe and some others as well. One highlight is about the latency into your network so this is the work of Emile, Augustine and Jasper trying to use the recently introduced query data platform that is relative recent. We pool all of our Atlas data into Google bit query for the benefit of all of you and of course also our researchers, you can go there and dig in which is exactly what Emile and his friends did and produced a lot of interesting maps and visualisations based ‑‑ so basically from from all the probes we tried to see how far you are from those probes in aggregate, median and average and so on. And Emile delivered a presentation on Monday at the lightning talk, I recommend you watch that, it's a really good introduction, or you can go and read the article as well.
Sample images, these are all the probes and the visualisation of how far they are from the Google network and also from Amazon Cloud.
We also published a different article about the event that we saw about Facebook, you probably know that Facebook disappeared from the Net for a couple of hours the other day and initially people had lots of different ideas about what was really going on, whether that was DNS or not, but what we did here is basically, through the tools and through the data that we publish, we tried to explain what we saw and tried to show people how the tools can be used in order to figure this out yourself if you want to, in particular we used BGPlay which can visualise how the announcements happened in realtime when the event happened, so if you are interested in what BGPlay can do for you and what kind of data is behind it and how is that useful to monitor your own network or actually watch the events around your own network I recommend you check out this article.
Okay. Happenings around RIPE Atlas. What is most interesting, this is just highlights from the last half a year or so, we see a growing population of software probes which is really good, about a year ago we had 1,000 of them up and running, now it's approaching more than ‑‑ around 2000, I hope by the end of the year we are going to reach 2000. Which is really nice to see.
We are working on new hardware probes as we speak they are in the making so I am certainly hoping that early next year we can start delivering them to people who are really, really interested in plugging them in. We are still working on the UI, I am hoping next year we can catch up with the things we wanted to do this year as well.
You may have heard that we are offering new sponsorship opportunities which in reality reality means we put a better definition of what we offer to sponsors and what they get in return around RIPE Atlas. If you are so inclined and installing anchors or you would like to support the RIPE Atlas software and service, then please step up as a sponsor and you will find all the details behind the link.
As you can imagine we are still working on the infrastructure, there is always a lot to do, and in this particular case, we, together with the other teams on the RIPE NCC, are looking at the possibility to put the actual data, real data back into the Cloud and see if that makes sense if, that's going to make things more flexible, easier and so on and so forth. And following that, we will probably look at whether infrastructure should or could be on the Cloud as well, we don't yet, we will figure it out.
About RIPEstat, many of you know that RIPEstat received a new UI very recently, or actually I think that as the RIPE meetings come and go it's probably not going to be that recently any more, we call it the new UI, UI 2020 which was released in 2021 for your convenience. This is much more mobile friendly than it used to be, it has a slightly different layout and it is basically aimed at people digging into the particular sets of information, instead of just dumping everything on the users in term of widgets and letting them figure out what they want. We are still supporting the old UI for the foreseeable feature, foreseeable in this particular case probably means half a year or year and we are looking at what features people still need when from UI and features should be migrated into the new one in some shape or form so we can deprecate the old UI in time.
The team is also working on a revamp of the data API, this is a son sol addition of parameters and how present data and the infrastructure behind it so it's actually more useful to people, we hope that this is actually going to be an improvement from the users' perspective as well. RIPEstat is one of the first applications that started to use a new kind of documentation layout, you can expect that other applications will follow this as well, it has full text search and everything that helps you to figure out to use RIPEstat. So please go ahead and try that.
Recently stat embarked on a block list project, we reached out to the community to ask, this is a feature that seems to be useful but we would like to tap into the common knowledge about what data sets we should be using in UI and what makes sense and what doesn't so this is an ongoing work, we expect to finish it soonish, a team of community members together with the RIPEstat team have been working on this recently so I expect that this will come to fruition soon.
And then we have operations, as you can imagine however many queries a a day, I think it's floating around 70 million or so does require some infrastructure that we can trust so the team has been working on a dedicated cluster for the new UIs so basically the back‑end operations for that and then they are putting also energy to making sure that the whole service is up and running and hopefully overloading one sense of queries will not affect the others and so on.
User engagement. RIPEstat is using feature upload, so if you would like to give us feedback then you can do that within basically, we are starting from the UI, so to speak, and then using these features you can let us know what you are interested in and then the team will follow up and see which ones are the most popular.
Okay. IPmap. RIPE IPmap is our geolocation infrastructure, infrastructure geolocation service. This did not really get the love and care, but in the last six months so or we stepped up our efforts to make it something that we actually wanted to do in the future, so, what happened is that we revamped the landing page to give you a bit more of a feel of what this data is about and where it is useful, so now there is a new feature where you can cost and paste a random looking trace route and put it in and the system will try to figure out the hops along that and try to visual that on a map. Obviously, the success of this depends on how good quality and how much data we have behind the whole engines, so we have been working and we will continue to work on the so‑called engines in the back‑end that actually provide geolocation to the IP addresses that we get. So one of those engines is a latency engine which is tapping into the RIPE Atlas trace route data sets and try to figure out, based on that, the adjacencies from known points around the globe and tries to map where those individual addresses are. The different back end is Reverse‑DNS back end where, based on the names of the IP addresses, so basically the Reverse‑DNS names, we are trying to tap into that and figure out what names are mentioned there, and give a different opinion on whether IP addresses, which may or may not correlate to the latency engine but it is basically if they agree with each other that's a stronger signal. For the moment we are using a relatively simple approach for the DNS but we are going to make it stronger over time.
We will also add more techniques, we have some more ideas about how to incorporate these, including crowd sourcing and others, if you are interested please let us know and we are happy to explain it to you as well.
As a bi‑product this can be used to verify the location of RIPE Atlas probes and anchors and we can spot some of the errors so we can each out to host and say can you please fix these things, which is quite useful. As I mentioned with stat, the documentation is now in a new form, it's more accessible, it's easier to use, so we are doing that. If you are interested in this effort you can sign up to our mailing list, that is the mailing list itself and behind that link you can find the signingup form. The development team behind IPmap will actually send a mail very soon now I think, even today or tomorrow, if you are really interested sign up now because you will get the latest updates from the horse's mouth very soon.
Where are we with RIS? We have introduced a new peering coordinator role and, as you can imagine, that is someone who is actually dealing with the peerings behind RIPE RIS, so that's great. We added two new route collectors, NC 25 and 26. 25 is a multihop in Amsterdam and 26 in Dubai, if you would like to peer with them let us know.
Finally, I think this is finally, finally, there is going to be a RIS JAM very soon now, about half an hour from now, if you are interested in RIS or you want to talk to the peering coordinators, please join us there, the team will be there to support you and answer your questions. And I think that was it. Are there any questions?
BRIAN TRAMMEL: Don't be shy.
ROBERT KISTELEKI: This time we actually have time for questions. That's new for me. This is a shiny new experience.
BRIAN TRAMMEL: Going once, going twice. Robert, thank you very much.
ROBERT KISTELEKI: Okay, you know where to find us, if you do have questions come to our mailing list or talk to us. Thank you very much.
BRIAN TRAMMEL: Thank you. All right. Let us see if we see, I do not see Massimo in the chat so I will ask the host to go ahead and queue up the video and we will get started with this and hopefully have Massimo join us at the end.
MASSIMO CANDELA: I am happy to be here, it's always a pleasure, I am really sorry that you are watching this prerecorded which means my flight is probably delayed and I am not yet in my hotel room. I hope to be online before the end of the presentation to answer all your questions and of course you can always contact me me by e‑mail.
So, I work for NTT, which is a big company doing a lot of things; among this, we are a tier 1 provider and I work on the automation and monitoring of our global IT network.
NTT has been supporting Open Source projects for quite some time. Myself, I also like Open Source, I have been developing various tools for the analysis and consolidation of network data and my latest effort is a tool called BGPalerter. I have already presented this on previous occasions and at this presentation I am going to do today is the first time I do it, and so I developed BGPalerter because we needed something in NTT to do monitoring, hijack, disability ‑‑ RPKI and many, many more monitoring that have been developed on top of it and we, after released it Open Source and now several operators, hundreds of operators worldwide are using it, so my previous presentations were targeted to that type of audience where I explain how to use it, how to install it. But today I am going to show you how to use it for research. I collaborate with researchers and while I was developing this tool I thought to keep it flexible and open so I can use it for my research and, since it's Open Source, everybody can do it also. And I am pleased to announce that, in the last year, I have received more e‑mail off people and researchers and colleagues in general they are implementing their own analysis of data, so I said good to do a presentation about it and maybe bootstrap some knowledge about how the system works. You can find the URL in the slide of the GitHub and this is a sort of quick tutorial ‑‑ you will be able to follow also the real implementation, we will implement a quick data analysis of BGP data in the next minute.
So, for this quick tutorial I wanted to find a case that was covering various aspects like BGP data analysis, RPKI validation, but, at the same time, that there was a time dimension for the data analysis and there was something that we could do easily in the next few minutes so I came up with this use case. So, we mentioned we want to monitor BGP announcements and we want to detect RPKI invalid announcements and we want to detect especially burst of RPKI invalid announcements and we want to record for each burst the original AS, the number of RPKI invalids that are announced and store the BGP announcement, messages, analysis somewhere.
I think it's interesting, this burst thing, because it happens quite some time, that at some point autonomous systems start to propagate, perhaps they were not supposed to, sometimes this is due to ‑‑ supposed to be optimise routing and after, so I thought it was interesting in this case. And of course we have to define "burst", so I just said burst is at least 20 RPKI invalid announcements from the same system in 6 minutes. So this number 20 and 6, they are of course random, just an example, to justify maybe dynamic threshold, something for this quick demo, it's good enough.
Each sentence has parentheses: filter condition, monitor condition, group condition and squash condition, we will see what these are.
So to understand how you could benefit from it, from using BGPalerter, I should explain you a bit what you will get or I will implement it and how this architecture works, so the architecture is composed of various components but the main pipeline is mainly composed, the connectors, the monitoring and the reports. The connectors, they connect to frequency for the data, they take the BGP data and they pull it in the entrance of the pipeline, by the full BGPalerter connects to the amazing RIPE ‑‑ that is managed by the RIPE NCC where hundreds of peers they provide to you in realtime, BGP updates. But you can use also, I don't know, for example, you can use BGPdump, another Open Source project really important that you can use to read MRT files or stream and you can act with our repositories with route use or as several people are doing recently you can provide, for example, BMP data, okay.
We will not talk about connectors because the fourth one is good enough and many of the analysis you will do, you will probably not have to implement anything and if you will have to, it's relatively easy.
Just noticed there are two files, prefixes for yml is where you declare what you ‑‑ what practices you are in seeing on the platform and we just sell it for our use cases to everyone, not IPv4, and IPv6, which is quite some data and it's also way to show you how you can efficiently operate on such a flow of data with BGPalerter.
And the other files are yml we will change.
The report and the last one of the pipeline, once an analysis has an output, the reports, they make sure that output is ‑‑ or any way shared so essentially there are records implemented and they can send to or ‑‑ we use the fourth one which is to store on file, that's easy and it solves many of the use cases that you will have.
Instead, we will focus on the centre file, which is the monitor file, and monitors are where the analysis, and the logic of the analysis happen and you will inherit a lot of already stuff already working and the three ‑‑ only three methods you will need to implement is filter, monitor and squash, and this is essentially a zoom‑in on that process, so each monitor has this filter function that, where the end point of the BGP message and it's a synchronised method you will have to return, then the BGP goes to the next step which is the monitor, otherwise it's disregarded and no more are used. This filter function is used to discard BGP messages that you are sure are not going to be relevant for your analysis.
The monitor method instead is a synchronised method which means it can process in parallel several BGP messages, and what you do here is you implement inside the real logic of the analysis and you can connect to external resources like ‑‑ in our case we will connect to RPKI validator. After, when the condition that you are trying to verify is met, you will have to create an alert, an alert is just a data ‑‑ I call it alert but it's a data structure that contains a description of the verified condition and the list of BGP messages that are verified that connection.
Automatically, BGPalerter is able to manage the invalid in buckets. These buckets are there because you, at some point, will almost surely link a group by closing your analysis. And for example, in our use case this was one part for autonomous systems which means group by autonomous system, one of the things they inherent from the system. The last method, which is the squash alert, which essentially receives an input from bucket and provides a single unit, a summary of the bucket, essentially is a data ‑‑ more familiar with that term, with that name.
Now, let's go back on the use case and now you understand this condition, filter condition, monitor condition, group condition, squash condition, you see they are the various steps and a valid step to do and let's start with calling our filter condition, which is we want to monitor BGP announcements, okay? So the first thing I do is I do a ‑‑ from the repository, after I go here, SRC, and after monitors, and you will see a list of files which are the monitors already implemented and used by the folk, by the operators, and in this case we are not going to use those, we want to implement a new one, so we create a file which I call monitor RPKI burst and, in this case, we create a class monitor of RPKI burst and we extend monitor. Monitor has ‑‑ must be extended. That is how you best could get for free a lot of things already implemented and you have to focus on filter and monitor and squash out as methods.
So once we do that, I also put here on bottom right you see the BGP message format that is used by the BGPalerter. You do not have to worry about that, it's for your convenience to know what's the format. It is JSON. And the entire pipeline is the format that is the connector already inserted in the pipeline, BGP updates in that form.
So now, the filter function has to be something done on fly, just know if the message is of interest or not and we said we want to analyse announcements or the simple thing we will check if the type is of type announcements which means we discover updates of the session and whatever, we just get the announcements. So, when this condition is through, the message will reach the monitor and so basically what we have to implement now is the monitor condition and the group condition. The monitor condition is that we have to identify RPKI invalids, so we have to do RPKI validation and we also need to group by this ‑‑ again valid by autonomous system.
So, we start implementing the monitored function and the first thing is that to do RPKI validation we need two things, the prefix and the origin AS in the autonomous system and we see, we extract those prefixes from the message and the next thing I do is I take I pass the prefix and the origin AS.
Now, this .rpki, it's something that you don't have to worry, it's one of the many things that are already implemented in the architecture and then you will obtain already there, so basically the BGPalerter has already RPKI validator inside and ‑‑ like the one managed by the RIPE NCC which is nice to do some quick analysis but of course the best way to do for you to do the RPKI validation and provider the BGPalerter, the file, you can do that in yml, just provide where the file is and BGPalerter will manage the entire and if that file changes you will pre‑load the FPs.
So when the validation is done, then you receive the result and the result you want to check if it's valid or not. We are only interested in RPKI invalid which means that we will strictly check if it's false and remember, I would like to remind you that validation can be true, false, but also null, there has to be strict equality across.
So when you are inside here, it means that the condition that you want it to identify is met and then, in that case, we create the other, the data structure and we have to implement our, let's say define our group by strategy and in our case, it's just origin AS. The key of the bucket is what defines the grouping and you can put whatever you want, for our case use only the autonomous system is the only thing we need to group for. And we do this publish alert where we put the key and the second one is prefix but that is just, what is the target of the alert, it doesn't really matter in this case but the target of the alert in this case is the prefix. The third parameter is can ‑‑ we don't need it for this analysis. So the fourth parameter is the BGP message that actually verified the condition so we pass the message inside and we know also last parameter. Once we do that, that goes in the automatically‑managed bucket and the only thing that we have to worry now is the last thing that we have to implement and it's the squash condition. So, when do we provide an answer, so when the final result is produced, only when the burst is composed of at least 20 RPKI invalids in 6 minutes. So we go implement squash alert and we see the alert, which I remind you is from a bucket, and what you want to do, I mean, the simple thing I did is, like, I want the list of valid test to be of at least 20 items and this is what I validate here, that alerts length is at least 20. Of course, this is just an example. I am not doing ‑‑ there can be multiple announcements of the same prefixes or it could be the alerts are not all unique, but it's okay, just have an example here, I want to keep it simple. But it would not be difficult to do all prefixes. What you do here is, if the condition is met, you squash. So you return something, you return, in this case it's a string that says these autonomous system announce blah‑blah invalid, okay? If the condition is not met, there is no return and best way return point means that the bucket is not ready, cannot be squashed, okay?
So, here we have implemented at least 20. The last thing is we need to implement is the 6‑minute thing, and we could implement it actually in the codes, but there is no need to do that, BGP has to do things which are called fade‑off seconds and I set it to 360 seconds, 6 minutes, have 6 minutes and they will automatically ‑‑ after 6 minutes if they are not squashed.
Another thing I would suggest you change is the environment into research, this means that you will remove some limitations that are instead in production that are there to avoid, for example, excessive use of memory, or whatever, and in that way you will have ‑‑ on the entire pipeline. Of course, you have to declare in the configuration which monitoring you want to use as usual.
So, now, the only thing that you need to do is do NPN run CERT and you are running the code and a logs directory will be created and inside there is a file record for the log where, after some time, you will start seeing these are the log system announced in blah‑blah‑blah, invalid prefixes according to the log we did before. Here it is more sense, just because this is a sample. Now, the presentation is over. So people who implemented my stuff, please contribute back, that would be really nice, not receiving so much requests at the minute, but that would be great and thank you very much for your attention and, of course, if you want to send me e‑mail with feedback or whatever, you can do it here. So thank you very much.
BRIAN TRAMMEL: Thank you very much, Massimo, who, in the 20 minutes while we are we were playing that video, has managed to make it into the room. That's amazing, it's actually a slightly different ‑‑ you could almost tell this has actually happened. Have we got any questions for Massimo? We have time.
MASSIMO CANDELA: I was saying I had to move some furniture to recreate... anyway, I am sorry that I had to pre‑record the video but I hope it was ‑‑
BRIAN TRAMMEL: Thank you very much for pre‑recording the video that made this possible. Going once, going twice, the questions?
MASSIMO CANDELA: Guys, I had to move furniture.
BRIAN TRAMMEL: A general question from Pavlos Sermpezis: "Any insights for building Open Source tools for network operators?" Just general from this experience.
MASSIMO CANDELA: So, are you asking for, like, a general suggestion or, like, insights in general, that's the question?
BRIAN TRAMMEL: Insights in general.
MASSIMO CANDELA: So, well, in my first insight at something ‑‑ especially if you are targeting network operators, you have to, I see that the best insight, the thing they realise is that a lot of people they really don't have the time and so you have to make it really something like one‑liner copy and paste has to work immediately, otherwise everybody is ‑‑ but, yes, that is one insight. But, in general, I think it is a good ‑‑ I mean, having this Open Source tools, I mean, that are ‑‑ it is a great thing. They really help to, for example, the monitoring part, so some companies they have the energy and amount of people to develop things in‑house or install them, others are smaller, the network is not their core business, they are a little less reluctant in doing, for example, monitoring or things like this, so one option is of course is to pay a service, which is of course a good option, which is ‑‑ but others, instead, they just don't do it at all so instead having something easy Open Source, it could help in at least constructing this article, let's say. So a lot of the effort I put actually is also in the ‑‑ making it like easy that, self‑contributes and stuff like this because I see that it increases the number of people installing it. But that's a suggestion I have for you.
BRIAN TRAMMEL: Cool. Thank you very much. We have got one more: "Have you run this on historic data or is this streaming off existing stuff and, if so, are there any insights over the longer time period?"
MASSIMO CANDELA: So we have been using this for more than two years now at NTT and we use it mostly for our day‑to‑day monitoring and for this reason we are basically using, only the realtime version, but as I also said in the presentation, I received various e‑mails from people that are actually running it with, sort of doing historical experiment and they are reading the file so I don't have any personal experience about it, but I hope some of these researchers can call and ‑‑ can maybe, at some point, write something on something that we can all use.
BRIAN TRAMMEL: Cool. Thank you very much. And with that, we have used our time. Thanks again, Massimo, for rushing to your hotel room and moving the furniture for being able to talk to us.
MASSIMO CANDELA: It's always a pleasure, thank you very much.
BRIAN TRAMMEL: Thanks again to all of our speakers and we hope to see you soon. Have a great one everyone, thanks a lot.