Closed Bug 675712 Opened 13 years ago Closed 13 years ago

Set up Elastic Search indexing on two production processors

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: laura, Assigned: jason)

References

Details

adrian/rhelmer can give config instructions, but let's get this underway. We're going to set two processors to index reports into ES, so we can watch performance of those nodes compared to the others. We also need the ES VIP set up (bug 673507).
Blocks: 651279
Until the VIP is set, feel free to point on all of the 4 node-ids in round robin fashion. hp-node6[1-4].phx1.mozilla.com:9999/queue/tasks/ : 9200
The above URLs are part of a cluster, there shouldn't be any delay in indexing if you hit just one node..
Ping?
Jason should have the vip up soon. I'd also really strongly recommend not doing this until we have the new dedicated ElasticSearch hardware. It feels like we have enough problems with hadoop on the existing hp-nodes and introducing even more production services on non-redundant-by-design hardware feels like a bad idea. Not to mention we are waiting for the other hardware so that we can claim hp-nodes61-70 as stage hadoop nodes, otherwise this will block the staging environment too. This is just my opinion.
Assignee: jdow → jthomas
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → mrz
Version: 1.7 → other
I understand that the hardware is non-redundant-by-design, but ES is a clustered application with redundancy built into it, so suffering the lose of a node won't cause any noticeable impact. We also have a few extra nodes available so we can handle the loss of more than one node if it comes to that. If we have some sort of catastrophic failure, it won't have a major impact on Socorro production as we are not looking to cut over to using ES exclusively right now. We need to get some high volume real world use of the cluster to validate the performance testing we have done so far, especially since we want extremely specific criteria for ordering the new hardware rather than just throwing "more than enough" at it. We can't get that detailed performance data without having something that looks very much like real world to base it on.
So, just 2 hours after I posted comment 4, hp-node64 dropped offline due to a failed /dev/sda disk. Given our SLA on those servers, etc. I expect a 2-3 weeks before it is back online, maybe longer since one of our ops guys will be on vacation. If these SLAs and failure scenarios are ok with Laura and the socorro team, then we can proceed with setting up the VIP and the service. Jason is point on this, and he's also working on the new stage build-out, so we'll have to prioritize accordingly.
They are fine for Metrics. We have 5 nodes still part of the cluster. We could likely bring three more into the cluster (67 68 69) depending on their health. Immediately upon the loss of a node, the cluster will begin rebalancing any under-replicated shards to the remaining nodes until the health of the cluster is green again. If we lose so many of the nodes that we run out of disk space on the remaining nodes, then we can reconfigure the processors to not send data to the ES cluster until we have the new hardware ordered and in place.
@adrian/rhelmer could you please provide configuration instructions?
Whiteboard: allhands
Whiteboard: allhands
Please reopen when we have more information.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.