Geekzone: technology news, blogs, forums
Guest
Welcome Guest.
You haven't logged in yet. If you don't have an account you can register now.


View this topic in a long page with up to 500 replies per page Create new topic
1 | 2 | 3
8 posts

Wannabe Geek


  Reply # 2121238 7-Nov-2018 08:03
Send private message quote this post

It was an actual DDOS attack, apparently the biggest they've ever had.

 

 

 

I'd be interested to see a more detailed post-mortem as well.


175 posts

Master Geek
+1 received by user: 57


  Reply # 2121242 7-Nov-2018 08:20
Send private message quote this post

VintageGnu:

 

It was an actual DDOS attack, apparently the biggest they've ever had.

 

I'd be interested to see a more detailed post-mortem as well.

 

 

 

 

+1 on the root cause analysis (of the voice platform outage post DDoS). But of course we'll never get to see it.

 

Interesting that it is one of the biggest DDoS's they've experienced, yet there is very little sign of it at all on their local throughput graphs at AKL IX:

 

https://metrics.ix.nz/d/000000009/ix-peer?orgId=2&var-Customer=AS56030%20Voyager&var-polling_interval=1m&var-Device=All&var-ASN=56030&from=now-14d&to=now

 

Just a small increase there over the last couple of days. Maybe the attack wasn't very distributed...

 

 

 

 


 
 
 
 


8 posts

Wannabe Geek


  Reply # 2121254 7-Nov-2018 08:45
Send private message quote this post

That's what they told us at least.

The targeted servers are in the Christchurch datacentre but they blackholed the IPs fairly quickly.

54 posts

Master Geek
+1 received by user: 40

Trusted
Voyager
Lifetime subscriber

  Reply # 2121259 7-Nov-2018 08:50
3 people support this post
Send private message quote this post

speed:

 

VintageGnu:

 

It was an actual DDOS attack, apparently the biggest they've ever had.

 

I'd be interested to see a more detailed post-mortem as well.

 

 

+1 on the root cause analysis (of the voice platform outage post DDoS). But of course we'll never get to see it.

 

Interesting that it is one of the biggest DDoS's they've experienced, yet there is very little sign of it at all on their local throughput graphs at AKL IX:

 

https://metrics.ix.nz/d/000000009/ix-peer?orgId=2&var-Customer=AS56030%20Voyager&var-polling_interval=1m&var-Device=All&var-ASN=56030&from=now-14d&to=now

 

Just a small increase there over the last couple of days. Maybe the attack wasn't very distributed...

 

 

It's not the biggest we've ever seen - far from it. (insert quip about 'seeing bigger' :P)
It had a reasonably high PPS around the 1mill PPS mark, but bandwidth wise, it was roughly 4G.
Our core network took it like a champ, with no issues at other sites/transits/backhauls. Unfortunately the target was behind a smaller router that was unable to take that volume.

 

There has already been a full post-mortem provided yesterday (06/11/18 - 2:28pm) which outlines the cause, the fallout, resolution, timeline .. the works.
If you're a Voyager Voice customer, you should have received it already. If not, let me know and I'll get it sent to you (PM me your account details). Otherwise, I'm happy to post it here it that suits.

 

Voyagers a huge advocate of being open and honest about outages/events.
No point trying to pull the wool over peoples eyes - they will just see through it :)





Voyager Internet - Network Monkey

494 posts

Ultimate Geek
+1 received by user: 56

Trusted

  Reply # 2121264 7-Nov-2018 08:55
Send private message quote this post

 

It's not the biggest we've ever seen - far from it. (insert quip about 'seeing bigger' :P)
It had a reasonably high PPS around the 1mill PPS mark, but bandwidth wise, it was roughly 4G.
Our core network took it like a champ, with no issues at other sites/transits/backhauls. Unfortunately the target was behind a smaller router that was unable to take that volume.

 

There has already been a full post-mortem provided yesterday (06/11/18 - 2:28pm) which outlines the cause, the fallout, resolution, timeline .. the works.
If you're a Voyager Voice customer, you should have received it already. If not, let me know and I'll get it sent to you (PM me your account details). Otherwise, I'm happy to post it here it that suits.

 

Voyagers a huge advocate of being open and honest about outages/events.
No point trying to pull the wool over peoples eyes - they will just see through it :)

 

 

Id be keen for here we are via an excellent reseller and would be nice to see the information

 

 


8 posts

Wannabe Geek


  Reply # 2121269 7-Nov-2018 09:06
Send private message quote this post

Perhaps it was the biggest the datacentre had seen? I'm not sure on the exact wording as they weren't talking directly to me.

 

 

 

Not a Voice customer, but the admin of the target/s, posting the details here would be great.


54 posts

Master Geek
+1 received by user: 40

Trusted
Voyager
Lifetime subscriber

  Reply # 2121989 8-Nov-2018 09:16
2 people support this post
Send private message quote this post

Here's the incident report, sent 06/11/18 - 2:28pm (the next day after the outage)

 

 

Incident Report

 

Summary:
At 09:23 Voyager experienced a Distributed Denial of Service (DDoS) attack to one of our web-hosting customers in our Christchurch data-centre.

The 4 Christchurch CVS database nodes became unresponsive as the network “flapped” resulting in them crashing. Once the attack was mitigated, the databases were restored. Registrations started to return to normal but calls were not proceeding. Investigations identified the cause as being a configuration file which become updated as part of the databases being unavailable, and which was replicated across all 8 CVS database servers as they came online. To resolve this issue, a configuration file corrupted as part of the cascading database failures, so a backup version of the corrupt file was reverted to.

With the database nodes back online and registrations increasing, each was attempting to synchronise across the cluster causing heavy load which in turn caused individual nodes to crash. The Voyager engineering team decided to close down and restore the database nodes individually with all voice services disabled. Each database server was shut down and then restarted one by one, allowing each node to synchronise and replicate before the next node was started. This was a lengthy process, which was completed at 13:10. Engineers then began the process of restoring the voice service to all databases.

As this was being processed, some calling started to work again on CVS – however as the service was coming back a second and third DDoS attack was experienced. The decision was taken to move voice services to a new internal network with significantly more capacity to protect from further DDoS attacks.
 
At 14:45 majority of the service was restored. All tollfree inbound calls (0800 and 0508) were still failing. This was investigated and it was discovered this was due to an outdated configuration having been restored – this was identified and resolved. There was also an issue with any callflows updated during the morning of 5/11/18, whereby these callflows were not working as the numbers had set them to being unassigned. A manual process was undertaken to identify these numbers and resolve the issue. By 16:15 all issues are resolved and service is operating normally. 
 
 
Time line:
09:23 – First notification received of Auckland and Christchurch SBC being down.
09:27 – Issue identified as a DDoS attack across Voyager network.
09:55 – DDoS issues are resolved but the voice platform remains down.
10:00 – Vendor engaged for investigative help in determining issue.
10:16 – All databases are stopped and brought back – appear stable.
10:24 – Registrations are coming back but calls are failing
10:55 – The issue related to call failures is identified as a configuration file corrupted as part of the cascading database failures.
10:57 – Earlier version of the configuration file is located. Restore process commences.
11:07 – While updating, database failure happens again which takes down Auckland and Christchurch SBC.
11:20 – Databases continue to be restarted and crash.
11:30 – The database nodes are attempting to synchronise across the cluster and are under heavy load which causes individual nodes to crash.
12:05 – The database cluster is still unresponsive. A decision is undertaken to close down and restore the database nodes individually with voice service disabled.
12:30 – Each database server is shut down and then restarted one by one, allowing each node to synchronise and replicate before the next node is started.
13:10 – All database nodes have been restored.
13:14 – Database cluster is stable, engineers start process of restoring voice services.
13:30 – Some outbound and inbound calls are being processed but due to high load there are significant delays in processing times.
13:35 – As the voice service is being restored another DDoS attack occurs.
13:50 – DDoS is resolved.
14:00 – Network traffic returns to normal but high load is resulting in call failures.
14:15 – A further DDoS attack occurs and is resolved promptly. This effects registrations on CVS.
14:30 – Network traffic returns to normal but high load is resulting in call failures.
14:38 – The decision is taken to move voice services to a new internal network with significantly more capacity to protect from further DDoS attacks.
14:45 – The move to a new network is completed.
14:46 – Normal service is resumed for all calls.
15:10 – An issue is identified where Tollfree numbers are not working for inbound calls.
15:20 – Tollfree issue is identified as being due to the earlier corrupt configuration file.
15:45 – A secondary issue where customers who updated callflows during the morning (09:22-10:30) will have prefixed callflows with +1, resulting in inbound calls not working for a very small number of customers.
15:55 – The issue with Tollfree inbound calling is resolved.
16:13 – The issue with prefixed callflows is resolved for all accounts known to have the issue.
16:15 – All systems should now be operating normally.

Post Event Actions:
Voyager Senior Management and Engineers are conducting an ongoing review into events and formulating an action plan to address resilience of the voice service. We will be communicating these outcomes in due course.

 





Voyager Internet - Network Monkey

54 posts

Master Geek
+1 received by user: 40

Trusted
Voyager
Lifetime subscriber

  Reply # 2121992 8-Nov-2018 09:19
3 people support this post
Send private message quote this post

There was also a mass-email sent out to voice customers by Seeby (Founder, CEO, big-cheese!) Monday afternoon (05/10/18 - 4:25pm - same day as the outage).
Here's the contents of that email also:

 

 

Voyager Voice outage - Monday 5th November 2018 This morning, starting at 9.20am, Voyager began experiencing serious ongoing issues with part of our voice network. 
 
Voyager has several parts to our voice network, these are:
CVR - Call Voice Router - This is the part of our network that connects to other Telco’s and routes calls. CVR is UNAFFECTED and has only had one outage in the last four years
CVS - Cloud Voice Switch - This is our current Cloud PABX offering, and currently all calls in and out of CVS are AFFECTED AND DOWN. Most of our customers are on CVS
KVX - This is our new Cloud Voice offering. It has been operational for 1 year, and is due to replace CVS, however most customers have not yet been migrated to it. UNAFFECTED
SSST - Soft Switch Sip Trunk - This is our service that high-volume calling (usually Wholesale) connect to us via SIP trunks. This service is UNAFFECTED
 
What’s happening?

 

  • At 9.20am, Voyager received a DDOS (Denial of service) attack to one of our web-hosting customers in our Christchurch data-centre.
  • Whilst CVS runs on 8 database servers in different locations around New Zealand for redundancy, the DDOS attack meant that a corrupted update was propagated between all the database servers. So instead of 8 good database servers, we had 8 corrupted database servers.
  • We have backups of the database servers, however this takes time to restore.
  • After restoring the database servers, we experienced loading issues, because so many customers IP phones are trying to register against the SIP servers.
  • Because of the loading issues, we took the opportunity to re-start all the database servers, because while many had been up for 900+ days and have been reliable, we wanted to rule out any unknown memory issues.
  • We restored services briefly, but then unfortunately received a second and third DDOS attack to the same customer on Christchurch on different IP addresses.
  • Whilst our Voice network and Data networks are running on separate infrastructure, the reason this particular DDOS attack was so damaging, is that the customer being DDOSed and our Voice databases were on the same 1Gbit trunk port. Whilst we have never had any loading issues on this Trunk port before, in this particular instance, traffic levels were able to cause cascading issues.
  • We have completed moving Voice services in Christchurch to a separate 10Gbit trunk port.

What are we going to do?

 

  • We realise that this outage, along with the Friday 26th October unrelated outage, is COMPLETELY UNACCEPTABLE to our customers.
  • After ensuring that the voice system is stable, we will be having a “War room” meeting to brainstorm ANY and EVERY possible solution to make the system more reliable.
  • The stability and recent outages of CVS are very frustrating for us (AND YOU), because we are already building its successor, KVX, with much newer hardware, software, systems and better features.  However there is still further development and testing required before the platform is production ready.
  • From next week, we will have additional human resource to put WHATEVER additional redundancies in place are necessary, however we do not believe that this outage was due to a lack of skills or Human Resources. 

The company will have NO other priority than investigating this issue and defending against future issues from this moment forward.
 
Please feel free to call me personally if you would like to discuss. 
 
Regards
Seeby Woodhouse
 
Voyager CEO 
Mobile xxx xxx xxx (number removed for security!)





Voyager Internet - Network Monkey

5188 posts

Uber Geek
+1 received by user: 1681


  Reply # 2122249 8-Nov-2018 16:04
2 people support this post
Send private message quote this post

@VygrNetworkMonkey thanks for such a detailed explanation.


3831 posts

Uber Geek
+1 received by user: 2180

Trusted
Lifetime subscriber

  Reply # 2122356 8-Nov-2018 18:21
Send private message quote this post

Fantastic response 

 

John





Ex JohnR VodafoneNZ 17 years 4 days

3679 posts

Uber Geek
+1 received by user: 1389

Subscriber

  Reply # 2122380 8-Nov-2018 18:58
2 people support this post
Send private message quote this post

Putting your mobile number on a mass communications speaks volumes. Pretty impressive to see a CEO doing that!!


175 posts

Master Geek
+1 received by user: 57


  Reply # 2122388 8-Nov-2018 19:22
2 people support this post
Send private message quote this post

RunningMan:

 

@VygrNetworkMonkey thanks for such a detailed explanation.

 

 

 

 

Here here. I retract my statement about never seeing a report and give credit where it's due.

 

 


8 posts

Wannabe Geek


  Reply # 2122392 8-Nov-2018 19:43
Send private message quote this post

That's great, thanks @VygrNetworkMonkey


54 posts

Master Geek
+1 received by user: 40

Trusted
Voyager
Lifetime subscriber

  Reply # 2122428 8-Nov-2018 21:57
Send private message quote this post

chevrolux:

 

Putting your mobile number on a mass communications speaks volumes. Pretty impressive to see a CEO doing that!!

 

 

Seeby is quite at happy talking directly to customers, be it hearing troubles or receiving praises - it's always been his 'thing'.
I'm sure he'll draw the line at answering the phone at 3am though ... can't say I've ever been game enough to try ;)

 

He's easily available via Twitter, and even lurks around Geekzone ever now and again!





Voyager Internet - Network Monkey

18743 posts

Uber Geek
+1 received by user: 5376

Trusted
Lifetime subscriber

  Reply # 2122430 8-Nov-2018 22:11
Send private message quote this post

Interesting, we are a reseller and I didn't get that email. 

 

 


1 | 2 | 3
View this topic in a long page with up to 500 replies per page Create new topic



Twitter »

Follow us to receive Twitter updates when new discussions are posted in our forums:



Follow us to receive Twitter updates when news items and blogs are posted in our frontpage:



Follow us to receive Twitter updates when tech item prices are listed in our price comparison site:



Geekzone Live »

Try automatic live updates from Geekzone directly in your browser, without refreshing the page, with Geekzone Live now.



Are you subscribed to our RSS feed? You can download the latest headlines and summaries from our stories directly to your computer or smartphone by using a feed reader.

Alternatively, you can receive a daily email with Geekzone updates.