RIPE NCC Services Working Group
Wednesday, 24 November
15:30 (UTC + 1)
KURTIS LINDQVIST: So, welcome back, everyone. And it was a very short break, I hope you all managed to get something. The next session we have a new scribe, which is Rosanna from the RIPE NCC, so thank you for scribing this part. And we are going to move over to the RIPE Operational Update, so it's back to you, Felipe.
FELIPE VICTOLLA SILVEIRA: Hello. Let me share my slides. All right. So, welcome back, everyone. I hope you had time to refresh a little bit. So now I'd like to change topics and present you the operational update and the main topics that I would like to look over today are the workload challenges that we have in relation to services, compliance and resilient infrastructure.
So I'll start with Registry Services and the challenges with the workload.
To start, I'd like to paint a picture where exactly are this increase in the workload is coming from and what exactly it looks like. So, as you can see here in this chart, it's basically a number of transfer tickets over the last four years, so each of the lines here represent a different year and the purpose line represents 2021.
So you can see here that the numbers in 2021 has been consistently above all the previous years. So, actually, a bit more specific now, so you can see that, for example, in August, we had a total of 906 transfer tickets, which represents a growth of 53% when compared to August 2020. In September this year, we had 1,444 transfer tickets, and that's a growth of 68% when compared to September 2020. And in October this year, we had 1316, which corresponds to a growth of 85% when compared to October from the previous year.
So, on average, what you are seeing is that around 70% growth from the transfer tickets when compared to 2020, and 150% when compared to 2019. So that's basically two and a half times more than we had, basically, two years ago.
Another trend that you are seeing was a huge increase in the number of new LIR applications. So, what is in this chart is basically the number of new LIR accounts activated per month over the last ten years, and you can see that, around 2019, there is this huge peak, and that's related to the IPv4 run‑out that we had back then. And around that time we had 600‑plus new LIR accounts being activated per month, and the workload was so extreme that we had a backlog and had waiting times up to a month. So from the moment that you submitted your new LIR application, from the moment that we actually start working on it, that would take up to a month during that time.
So after the run‑out, the number really dropped and went back to almost 2012 levels, and then, since recently, we are seeing a second spike again, and the main reason for that is that the price for a /24 in the transfer market became more expensive than opening a new LIR account with the RIPE NCC and holding it for two years. We also observed that the majority of these clients were new LIRs, they were actually opening additional LIR accounts.
So, since last week, we have a waiting list to get the /24, so we expect that these numbers go down once again. So we should have the same effect that we had post run‑out. And also important to mention that there are no major blocks that are currently in quarantine; it's actually released, say, over the next six months.
So now I'd like to share what the fact that this high workload has in our services.
So this metric measures the number of tickets that we didn't respond within one business day, so that's the internal target that we have, and it measures the ticket response time. And you can see here that the numbers really don't look very great. So, we start in September, around 14%, and it got a bit better in March, and then, since then, it has not been very good, and it got actually bad around summertime, and it was a combination of reasons. One was an unusually high number of tickets, as I explained before, where all these tickets were coming from, and, at the same time, we had many colleagues going on holidays, and I'll talk more about that later.
After the holidays, the numbers stabilised, so it went down from 35% down to 17%, and in October it's much better, it's back to 6.6%.
And it's also important to mention that there was no backlog formed during this period, like in a similar way when we had in 2019 during the run‑out. So the tickets would take maybe one day longer, so we can start our work on it, but then, after that, it was processed more or less quickly.
Even though I considered the previous KPI an important metric, I think this one paints a better picture of how much the waiting time actually was. So what you see here is basically the policy transfer tickets lead time. So this is from the moment of submission of a ticket to the moment that the ticket is actually completed. So it's closed. And it's measured in business hours. So, for instance, in October, we have 72% of all the transfer tickets completed within two business days, 87% within a week and 92% within two weeks. So, we have this kind of thing of around 80% of tickets take longer than that and these cases are usually due to extended sanction chats, and I am going to explain more about that later, or complex cases where they have missing documentation.
Another thing that you might notice here is that there is a pattern that, by the end of the year, the numbers look pretty good, and then after, in January, they drop, and if you go back in time and you look at 2020, 2019, it's exactly the same. So the numbers get worse through the year and they drop around January.
And the reason for that is that, by the end of the year, we receive a lot of these transfer tickets, that they call it the consolidations, that are people that open LIR accounts, say, three years ago, most that you get is actually two, and then now, the resources are not being held any more so they can transfer it away, and these tickets tend to be relatively easy to process. So all the documents are in place and tend to move relatively quickly.
After that, they are basically left with tickets that are complex and they are difficult to process. So that's the main reason why we see this drop here.
So to summarise:
The time to resolution has been influenced by four main factors. The first one was a significant increase in the number of transfer tickets when compared to previous years. It's mostly related to LIR accounts consolidating after the two‑year waiting period.
The second reason was the significant increase in the number of new LIR applications, almost at pre‑run‑out levels. And then, on top of that, we combined with a lot of extra work to make sure that we remain compliant with European Union sanctions, and I have some slides about that part specifically later today.
And on top of that, there was really a need for our staff to get some holidays. There were people getting warn out with the whole Covid situation, plus very high number of tickets, so a very generous approving vacation requests during summertime and that was a conscious decision by myself and by my colleagues and I fully stand behind that decision.
And I think another very important thing to say is that we have prioritised correctness over fast responses. An accurate registry is more important than a quick ticket resolution. Of course, we're doing our best to improve the ticket resolution time, but when both things are at odds, we are going to prioritise an accurate registry.
So, what are we doing to improve the situation? As Hans Petter already mentioned, we have hired extra staff for RS. They have been trained and start picking up tickets at the beginning of September, so that's also why you see an improvement in the numbers after September and October. We have also on board four temps and they are assisting with compliance checks automation.
And one thing is important to say is that, in the short‑term, one thing was very important to make sure that the queues are properly staffed, like the queues for transfer tickets and so on, because we did expect a high number of tickets, mostly in the last quarter. What caught us a bit off guard was the new LIRs, that we didn't see coming, and that, unfortunately, required parking some projects, and one of the things that we had to park or reduce was ARCs.
As Hans Petter also mentioned, long‑term solution involves automation and decentralisation of the registry. One thing we have been very careful about is not to hire too much staff as the strength, I believe, is temporary, so I think it would be irresponsible with membership money
I'd now like to change topics and talk about compliance, and what we are doing to enhance our internal controls.
As Hans Petter already mentioned about the sanctions transparency report, and I would like to go a bit more in detail here. The main idea of this transparency report is to provide data on how our members and users and legacy holders are affected by sanctions while, at the same time, respecting confidentiality and privacy. This report is published in this link over here, and in this table you can see kind of the summary of the report, the report in the link is a bit more extensive than that, contains more information, but this is the report in a nutshell. So it had one member on the 9th November that has been restored, it was frozen before, it's from Iran. And we currently have two members, one from Iran, another one from Syria, and they had been frozen since 1st April 2020.
I explained before that we had a lot of extra checks for sanctions, so let me go into a bit more the details here. We have been strengthening our sanctions process I'd say over the last year‑and‑a‑half. One of the main changes we made was to use a third party, or two third parties, and also screening end users as well. So all these extra checks, they added to the existing workload. And the only way forward here in terms of operations is to fully automate all those checks, so that's what we have been working on now. We have to screen regularly about 40,000 illegal entities and natural persons and that's something that's just not possible to do without having this fully automated. So that's why we have temps being on board to assist on building a portfolio in these tools and, once this work is completed, it will remove the need to do checks under the transfers so our process should be way more simplified.
Another area that we have been improving concerning due diligence automation is improvements in our ID verification process. We have been using a third party, which is called Identify, and the main goal for that is to ensure consistency in our checks, more reliability and also efficiency, because we don't have to do these checks ourselves any more. So this work has been deployed in October, so all the processes require that ID verification, they are linked to Identify and they have been done since, yeah, since the beginning of October.
And we have published a Labs article. It was early last month, I believe, where we explained the usage of all this third party tooling automated the due diligence and that includes Identify and it includes the others. If you interested in that, I strongly recommend you read it.
I now like to talk about registry accuracy, and what are we doing to measure the accuracy to automate and to ensure quality services.
So first of all, I'd like to show some metrics concerning registry and I have explained in the last Services Working Group how we measured that, it's a bit tricky how to measure accuracy of the registry so I'm going to explain very quickly now how we do it.
So the general idea is that we calculate how old the registration information in the registry is. Like, in other words, when was the last time that we had verified the registered data either for a certain member or for a certain end user against an official source like an online registry, a company registration papers and so on. So at that point in time we know for sure that that record is accurate.
The first time we measured it was in May, and the numbers, they didn't look that impressive, so we had around half of our members with what we consider recent registration information, so less than two years old. Then we had around a quarter of our members with, let's say, older registration information, let's say more than five years old. So, since then, we have been targeting the members with all the registration information, and we managed to put this number down from 25% to 20.7%, so there was an improvement there, and our aim is to have a more substantial change early next year when our ARC capacity will go back to normal after I explained before, that we are making sure that the queues for transfers are properly staffed while somebody that were doing ARCs actually have to do a transfer now. In January, we should go back to normal and then we should see a big improvement here. Well, I hope so at least.
Now, concerning end users, the numbers, they are really didn't look very good. So just a quarter of our end users, they have recent registration information. And 55% of them for more than, they have all the registration information. And since then, the number has improved significantly. So it went down from 55% with all the registration information down to 32.9%, and for the recent registration information went up from 26% to 51.7%.
And the main reason for such a significant improvement was actually a by‑product of automating our sanctions checks. As I explained before, we hired temps to assist with the compliance checks automation and part of that work included fixing the administration for end users in our internal systems, which basically required them to be reverse‑identified, so, whilst we expect to have further improvement in these numbers as we are going through all the end users new.
So that brings me to the next topic, which is active registry monitoring. So the general idea behind this project is to use all the third parties for monitoring changes in the registry, and that's to make sure that we have a compliant, more accurate registry, and also ease of processing requests, like the records are 100% accurate, then there is much less work from our side to process that request.
This work has already started for sanctions, as I explained, monitoring for members has largely been implemented, so we have 97%‑ish of the members already being monitored. But for end users, it's just in the beginning, so we have around 3% only. And we expect to have that done at some point in Q1 or Q2 next year.
The idea that we use the same tooling, not only to monitor for sanctions, but also for any changes in the registry. So things like M&As, legal name changes and so on, so we get the notification and then we can follow up from that. That required some integration with our internal tooling and we are working on that right now. There is a project that is being implemented now by software engineering. And also that we have enough staff to be able to handle all these incoming requests. So even if the tooling would be ready now, it would be impossible for us to start doing this monitoring, but early next year we really hope that we can start doing that.
Another thing that we are doing to improve our service delivery, first of all is to align our metrics on membership satisfaction across the RIPE NCC. That's a project that Fergal is leading, to have one unified way to measure membership satisfaction across all our different services, and one of the service that we want to monitor is the tickets satisfaction and are really aiming to get that done before the end of the year.
And the idea that we also integrate this feedback response in our work flows.
Another thing that we are doing is to publish quarterly road maps. We started with RPKI, so we had the Q3 we published for RPKI, and then Q4, we included the RIPE Database and business applications as well, and planning to continue publishing this regularly.
One of the things that he see there is for Q4, for business applications, is improvements in the ticketing system. So what we want to do there is address the concerns they've been addressed in the mailing list, mostly to do with integration in our LIR portal. We are also planning some improvements in our SSO for early next year, and that will include things like more options for 2 Factor Authentication, improved security, also additional profiles, that was one the things that were mentioned in the mailing lists, to have those read‑only profiles, so we are taking all this feedback into account. So please look at these road maps and let us know your feedback.
And finally, the more long‑term goal is of digitising trust with digital IDs and digital signatures.
So, that's a much more long‑term goal. We won't have something down by, say, next quarter or even next year, but we are already starting some pilots. It's a plan. If it's not being done now, it's going to be done in Q1 next year and they are we are going to continue from that.
Now, my next topic, which is RPKI, and what are we doing to harden our infrastructure and our processes. So, starting with processes, so since late last year, we have been defining a control framework for RPKI, which is based on the SOC 3/ISO audit report and, based on that, we have identified a total of 182 controls. So it's very broad controls, also includes areas like availability, security processing integrity and so on. So, based on this control framework, we have performed a gap analysis and we identified a total of 49 missing controls. So basically comparison between the control framework and what we have in our work instructions and in our internal procedures.
By now, most of the control gaps, they have been closed. We still have seven of them missing, and we aim to have all gaps closed before the end of the year. And by doing that, we would be in a position to perform a SOC 3 audit report by Q2 next year. Still deciding whether we're going to do SOC 3 or perhaps another kind of audit report, but we are going to do an audit report. The beauty of the SOC 3 is that it can be shared publicly, so it's a report that doesn't contain any confidential information.
I'm now changing topics to our infrastructure. These gave a really nice presentation yesterday in routing, going into all the details of what exactly they are doing with the RDP repositories, so if you are interested in that, I strongly recommend to watch the recording.
What I'm going to do now is just a very quick explanation of what we're doing here. The main thing is that we have moved to a new setup in our two data centres in Amsterdam, so this work has been completed last week successfully and we're going to keep the existing AWS, reports that we had in the past, but, as a warm node, so it's being kept up to date so it could be switched relatively quickly, but it's not taking any production load. So in case it's needed and Amsterdam fails completely, we can go over to AWS. And we'll continue using CDN, basically, to increase availability and reduce latency.
And our next step here is, basically, how can we increase the resiliencey by adding some extra nodes? And that can be done in line with our Cloud strategy, so what we want to do is basically demission AWS and replace that with something else, like to have as a reserve infrastructure.
Speaking now to RPKI core, and perhaps this is more important than availability of the repositories, as Randy was explaining in the previous session for the Cloud strategy.
Here, we want to ensure that we have resiliency in our core and one of the things that we are doing is replacing our offline HSM this year, so the replacement schedule is at some point in December and our online HSMs are reaching the end of life and we are planning a replacement for late next year. Then the costs are already been budgeted.
We have also just done a penetration testing, there were so the findings there and the teams are working to close all the findings and, once that's done, we can publish the report.
And finally, we're planning a red team testing for early next year. So for the ones of you who do not know, a red team testing is when you have a team, like an external team, basically trying to breach the security of your systems, and that includes not only trying to hack the code, but also trying to use a source engineering, try to bypass physical defences, have access to the office, have access to the data centres and so on. So it's pretty exciting and I am very curious about what results are going to find from this exercise.
So that's it what I had, and now I open the floor for questions.
KURTIS LINDQVIST: Thank you, Felipe. We do have quite a few questions, so I am going to try to read them out to you.
The first one is from Rudiger Volk: "Is there any rough estimate on the total cost for NCC for the sanctions compliance?"
FELIPE VICTOLLA SILVEIRA: That's a very good question, actually. I believe we included this in the ‑‑ I see Hans Petter can take that one.
HANS PETTER HOLEN: Sure, I can do that. So, we haven't created a number for that yet, but thank you for the question, and I'll look into if we should provide some numbers on that in the annual report. I think that is worth some thinking and see if we should be more transparent about it, but I don't have a solid number offhand, sorry about that.
KURTIS LINDQVIST: Thank you. Next one is Yvonne: "If we know about nonexistent organisations and the NCC members, what should I do, how does the NCC respond?"
FELIPE VICTOLLA SILVEIRA: Yes, please report to us and we're going to trigger an investigation. I don't know exactly which e‑mail address you can use, some of my colleagues are in the chat and they know exactly the e‑mail address, I ask that you post there, but yeah, for sure, it will trigger an investigation and we are going to see what exactly is going on.
KURTIS LINDQVIST: And then we have a question from Elvis Velea: "Regarding due diligence, if only three members were affected by all those extra checks, isn't delaying every request an overkill?"
FELIPE VICTOLLA SILVEIRA: The problem is the ones that are confirmed case, the problems are the investigations. We get a lot of false positives under investigation, so these are tools that we are using for sanctions compliance, they are not exact so they say, well, there might be sanctions violation here and then that triggers an investigation and this investigation might take a very long time to complete. So it's just a few cases that are affected but then they can take a very long time. Most of the other transfers are not affected by this and they are processed quite quickly, as I showed in the chart. So 72% are processed within a couple of days, and just a few of them, they take a very long time.
KURTIS LINDQVIST: Next question is also from Elvis. "ARIN has assigned an AS number to Amazon in a weekend. If ARIN has enabled services to members that offer urgent evaluation at weekends, would the RIPE NCC consider the same thing, a member to pay for VIP services year long or for specific [something] for example?"
HANS PETTER HOLEN: I can try to answer that. I know that ARIN has introduced a Premier programme and I think this is for members that pay more than 30K a year. Now, for the RIPE NCC members, we have had the model based on equality, so everybody pays the same and everybody gets the same service. Now, if the members want to change this significantly, of course we can discuss that, but my feeling is that the membership has not asked for premium support to pay more to get ahead of the queue. I think also, from the discussion on Twitter, on the one ASN that was assigned during a weekend, I don't think that was a systemic thing, I think that was just a staff member that worked at the weekend, which may or may not happen for the RIPE NCC as well.
KURTIS LINDQVIST: The next question again from Elvis. "When will you restore realtime statistics and response times and the other metrics you choose to add now?"
HANS PETTER HOLEN: We don't have a timeline on that yet.
KURTIS LINDQVIST: Then we have Elvis again: "The RIPE NCC RS department is, for many members, the only department they interact with." And the first question was: "Was the RIPE NCC SLA ever a thing or was this just a figment of my imagination?"
HANS PETTER HOLEN: Since I have consulted with Legal, I have not found any formal SLA, service level agreement, in place. So I will not comment on the alternative.
KURTIS LINDQVIST: Okay. The second question is: "When was it removed or cancelled? Hans Petter and Christian e‑mail, singular, one e‑mail sent by Hans Petter signed on behalf of himself and Christian, a board member."
HANS PETTER HOLEN: As I said, I believe this was an internal objective, and for some reason, it's been called service level agreement. Now, in my book, a service level agreement is more than just one metrics and a target. So I believe this was an internal target. I am definitely working in getting services described and setting service level objectives for these services, that is an ongoing process, and we will continue to work on that and to improve the performance. And I don't think it's particularly useful to go back what was said and decided 10, 15 years ago and whether or not this still applies. I am spending my time on looking forward to improve.
KURTIS LINDQVIST: With that, I think you answered the third question, but, for the audience, that was about if now is time to create an SLA, but I think you just answered that as well. So I'll move on then.
James Kennedy had a question: "Please continue to work on the identifying and reporting on unusual member behaviours and trends regarding LIR accounts and resources. This helps the community to optimise related policies."
And then we had a comment from Marco Schmidt from the RIPE NCC: "Regarding the question where the report closed organisations, the best way is to use our contact form on https//www.ripe.net/contactform and select "bankruptcy, liquidation and insolvency".
Thank you, Marco.
Then we have a question from Erik Bais: "Seeing as, last year, the RPKI validator was put end of service and support, are there stats available about how many instances are still out there especially since there have been new CDEs published last month?"
FELIPE VICTOLLA SILVEIRA: Thanks for the question. We do have the stats and there are a few still using, actually, our validator. I don't have the stats off the top of my head now. I will ask Nathalie to have a look, she knows the numbers.
KURTIS LINDQVIST: And the last question we have is from Dmitry Kohmanyuk: "While member data verification and monitoring is important, what part of support staff, time, resources is spent on answering requests rather than doing data checks?"
FELIPE VICTOLLA SILVEIRA: That's in the activity plan and budget. You can see the exact numbers. We have, off the top of my head, I think we have 12 FTEs working on either investigations or ARCs or audits and data checks. And we have 16 FTEs working on actually processing requests, to us around three that works on the resource requests, so around 20‑ish, 19, on processing the tickets and then plus you have people in Member Services as well that deal with on‑boarding of members and so on. So that's all explained in the Activity Plan and Budget.
KURTIS LINDQVIST: Okay. Thank you, Felipe. That was all the questions we had.
So, we had one presentation left from before the break, but we ran out of time, for which is the update from the dB task force, by Bijal, so I am going to hand over to Bijal to do that.
BIJAL SANGHANI: Okay. Hello. I am Bijal Sanghani of the RIPE Database Requirement Task Force, and I am going to give you a very quick update on the work that the task force has been doing.
This is the task force, the members of the task force, and obviously we had a lot of support from the RIPE NCC, so I want to take this opportunity to thank the RIPE NCC for their support over the last couple of years while we have been working on the report.
The task force has now published the document. There has been a number of updates during RIPE 83 already. James presented in the Address Policy, Peter presented in the Cooperation Working Group and Shane presented earlier on today in the Database Working Group. So, there's been a number of updates from the task force members as well, and I am pleased to say that RIPE‑767 has been published a few weeks ago.
Our work: We met quite a lot. Most of it was actually remote due to Covid. Our minutes are all publicly available. We conducted a number of surveys, we held a number of BoFs and we also gave updates during the RIPE meetings, and, in the end, we published three drafts and, like I said, the final draft was published recently.
There were a few obstacles, and, you know, Covid, as I mentioned, being one of them, the scope of the document, once we started working, I think we started off very ambitious and when we started looking at what we had in front of us, it was quite a large task, and so what we tried to do was take a real high‑level view so that the details can be discussed within the Working Groups and the community after the report has been published.
There are also, you know, other different views on some of the recommendations and some of the discussions we had. We talked about the IPAM earlier, the legal address was something that came up again and again, and also was addressed in the NCC Services Working Group a couple of years ago. And obviously technical topics were hard to steer away from. So, what we tried to do is actually just stay away from the implementation details and leave that for the community.
What went well: We had an excellent team, and like I said, the strong support from the NCC Services was fantastic. The feedback that we got from the community, you know, during the various talks that were given, through the BoFs, through the surveys, even just one‑to‑one conversations with some of the community who had ‑‑ who wanted to, you know, specifically speak to the task force on specific topics.
I'm not going to go through the principles and purposes. We have talked about this in the Database Working Group and obviously we're a little bit short on time here, but what I do suggest is, I really recommend everyone read the report. It's not very long. And, you know, so it's quite a high level, so it's a good read and it will give you some good information on what what are our recommendations. And like I said, they will need to be discussed. They are only recommendations. These will need to be discussed and agreed. And anything that is agreed, will have to follow the usual RIPE processes and the PDP, if it comes to that.
This is a list of our ‑‑ the recommendations and the requirements. They have already been discussed in a number of Working Groups, as I said before, and these will continue to be discussed in those Working Groups.
Up next: So the RIPE Chair team will coordinate with the relevant Working Groups, and then the community and the NCC will work together to implement the recommendations. And really, we look forward to seeing the discussions and what comes out of the recommendations next.
So that's it with my update. Are there any questions?
KURTIS LINDQVIST: No. That's it. Thank you, Bijal. With that, we will move onto the second to last agenda item, which is the Working Group Chair selection process. I am going to hand over to Rob on that one.
ROB EVANS: Thank you, Kurtis. So, we have one of our regular cycles through the Chairs, as we are mandated to do, and this time it was Bijal's turn to stand down by rotation. There were no volunteers for quite a while, and by that I mean no volunteers for quite a while, until the incumbent was encouraged to consider restanding. No other volunteers came forward before the time was up. I sent that to the list a couple of weeks back and there's been very little, actually, feedback since then, so, I mean, I don't know whether we're suffering from co‑chair reselection fatigue or what, but unless there are any objections now, I'd like to welcome Bijal back to the team. And we'll have a think about what to do for the next co‑chair rotation in May.
KURTIS LINDQVIST: So, that brings us to ‑‑ thank you Rob. That brings us to the last agenda point, which is the open mic, which is where, if you have anything, you have four minutes, but if there is anything you want to raise that's not on the agenda regarding the NCC or the NCC Services, this is the membership's opportunity to bring those up or ask questions or thoughts. I see none.
So, with that, we are actually going to finish on time, and actually four minutes early, and that gives us all 15 minutes before the GM. And as usual, I would say that that's in this room, so please leave. But it's not in this room, but you have to leave this room to join the GM. So we'll see you in the GM and hopefully we'll see you in the hybrid in Berlin.