<a class="header-button" href="https://bugzilla-dev.allizom.org/home" title="Go to home page"> Bugzilla

Assignee

Comment 1

•

12 years ago

bld-centos6-hp-001 now has the same problem.

Summary: possible hardware issues with bld-centos6-hp-003.build.scl1.mozilla.com → possible hardware issues with bld-centos6-hp-001.build.scl1.mozilla.com and bld-centos6-hp-003.build.scl1.mozilla.com

Assignee

Updated

•

12 years ago

Blocks: b-linux64-hp-0020

Comment 2

•

12 years ago

catlee is current user of both of these slaves (per slavealloc), keeping him in loop.

Whiteboard: [reit]

Updated

•

12 years ago

Whiteboard: [reit] → [reit-hp]

Chris AtLee [:catlee]

Updated

•

12 years ago

Depends on: b-linux64-hp-0023

Chris AtLee [:catlee]

Updated

•

12 years ago

Summary: possible hardware issues with bld-centos6-hp-001.build.scl1.mozilla.com and bld-centos6-hp-003.build.scl1.mozilla.com → possible hardware issues with bld-centos6-hp-{001..004}.build.scl1.mozilla.com

Chris AtLee [:catlee]

Comment 3

•

12 years ago

This has affected staging slaves only so far. We've been doing a lot of work on these in preparation for moving some jobs over to the production slaves. The failures on these staging slaves seems correlated with the increased load we've put on them during this development process.

Updated

•

12 years ago

Blocks: b-linux64-hp-0021, b-linux64-hp-0022, b-linux64-hp-0023

No longer depends on: b-linux64-hp-0023

Comment 4

•

12 years ago

If there are HP tools that should be installed here, I can do so in bug 741249. I'm not seeing much that's helpful, though.

Comment 5

•

12 years ago

Sorry, bug 733648.

Updated

•

12 years ago

Blocks: b-linux64-hp-0030

Comment 6

•

12 years ago

the same issue with bld-centos6-hp-013

Summary: possible hardware issues with bld-centos6-hp-{001..004}.build.scl1.mozilla.com → possible hardware issues with bld-centos6-hp-*.build.scl1.mozilla.com

Comment 7

•

12 years ago

dcops: this looks to be affecting every machine releng uses for staging. The staging runs are prep to do the same thing in production. So, when you're back in town, this should be a pretty high priority, especially since we may have to RMA drives or something like that.

Comment 8

•

12 years ago

In the iLO web interface's Information > Integrated Management Log, there is POST Error: 1785-Drive Array not Configured which appears to be a reasonable way to confirm this issue (assuming the timestamp is fresh). bld-centos6-hp-007 is also in that state.

Updated

•

12 years ago

Blocks: b-linux64-hp-0026

Assignee

Comment 9

•

12 years ago

bld-centos6-hp-012 just ran into this issue as well

Comment 10

•

12 years ago

I think it's safe to say that any HP we put load on will fail within a few days. Unless that's production load, there's probably not much sense in continuing to take out systems until we get to the bottom of this.

Van Le [:van]

Comment 11

•

12 years ago

These hosts are showing no raid configuration during POST. Checked the raid bios and it shows that the logical drive is missing. We're recreating the logical raid 0 drive. This appears to be some kind of firmware issue.

Comment 12

•

12 years ago

Please let us know if the contents of the drive have been lost in this process.

Vinh Hua [:vinh]

Comment 13

•

12 years ago

After recreating the logical raid drives, all 6 hosts have booted up to the log in screen. DCOps does not have the credentials to log in to verify the contents.

Comment 14

•

12 years ago

Hal, drive contents were not lost. Can you (releng) put load back on these six hosts? If they stay up, we should perform the same maintenance on the remaining hosts.

Comment 15

•

12 years ago

I just installed hpacucli on bld-centos6-hp-001, and it doesn't find a controller. Yet hitting F8 after the BIOS startup shows a HP Smart Array B110i SATA RAID Cont in slot 0. I looked on bld-centos6-hp-015, which wasn't repaired earlier in this bug (and hasn't failed yet). Its RAID config looks the same - a single logical drive, 232.9GB. No license keys are installed. Option ROM Config version 8.20.60.00. So, I can't square the obvious presence of a RAID controller, which actually recommends using HPACU online, with the tool not finding any controllers: ---- [root@bld-centos6-hp-001 ~]# hpacucli HP Array Configuration Utility CLI 8.60-8.0 Detecting Controllers...Done. Type "help" for a list of supported commands. Type "exit" to close the console. => controller all show Error: No controllers detected. => controller slot=0 show Error: The controller identified by "slot=0" was not detected. ---- (same on bld-centos6-hp-015) So, a bit of a mystery. Let's get these loaded back up and see what happens.

Assignee

Comment 16

•

12 years ago

According to Van in bug 782640: "I've ran across this issue before while working at Yahoo. It turned out to be a firmware issue with the raid controller and drives when there was a lot of load. The hardware we used, although not the same are related (HP raid controller, HP branded drives). We had to manually update the firmware on both drives and controller." I went looking through HPs firmware for the DL120, and I didn't see anything obvious about this. Van, do you happen to remember the HP bug id, working firmware page, anything?

Comment 17

•

12 years ago

for those following along at home, these are DL120G7's (E3-1220, 8GB RAM)

Assignee

Comment 18

•

12 years ago

I'm going to move this over to the SRE queue since they interface with HP over hardware issues. SRE folks, to sum up, a number of centos 6.2 machine machines that had been running fine for months suddenly lose their array configurations and will no longer boot off of disk. This has happened to about 10 machines already. Going into the BIOS/RAID config tool and recreating the RAID (it's only one disk, so RAID 0), allows the machine to boot back up again without issue. We don't have a way to reproduce this issue on demand, but we do have one machine that's down because of it right now (bld-centos6-hp-015.build.scl1.mozilla.com). We checked out HP's site for obvious firmware patches, but didn't see anything that looks like it talks about this specific case. There were some OS level packages for linux that talked about RAID corruption, but that doesn't seem to fit the bill here. I've left bld-centos6-hp-015.build.scl1.mozilla.com booted up in the RAID configuration screen for further troubleshooting. Thanks in advance for the help.

Component: Server Operations: DCOps → Server Operations

QA Contact: dmoore → jdow

Comment 19

•

12 years ago

At the first glance, I didn't see anything obvious to cause this. My suspicion is that there is buggy BIOS or some firmware. I will run the Smart Update Firmware DVD on the node that's down. The Smart Update Firmware DVD delivers a collection of firmware for your ProLiant servers and options. Update your ProLiant firmware using one of the following methods; HP Smart Update Manager, ROMPaq (iLO only), or Online ROM flash components.

Assignee: server-ops → dgherman

Comment 20

•

12 years ago

bld-centos6-hp-015.build.scl1.mozilla.com had only two upgrades: iLO and BIOS. Fixing the array now and please let me know if the problems persist.

Assignee

Comment 22

•

12 years ago

Dumitru: how invasive is it to install this on the other machines? Does it require a reboot, downtime, etc? I'm not sure it's going to be a fix, but I'd like a sample size of a few machines, at least.

Comment 23

•

12 years ago

Yeah, it does require a reboot. You need to mount the DVD .iso from your local computer (maybe there's a better way to do it for you remoties, but from MV office this is the fastest way for me). Then let it automatically scan the system and apply all the updates. The download link for the DVD is: http://tinyurl.com/8joxqyr Make sure to get the latest version. As of today, this is 10.10 (4 Jun 2012).

Assignee

Comment 24

•

12 years ago

To clear up some misconception here, we do not know if the patches fixed the issue. I haven't seen the issue on *any* more machines since we patched this one. Are all of the HPs under load right now, and has load been added back to bld-centos6-hp-015?

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 25

•

12 years ago

To summarize: * These systems have been running for months with no discernible issues. * The presentation of this issue was that the system in question lost its RAID configuration between reboots. * As far as I know, we are unable to reproduce this error at will nor do we understand what causes it. * There was some speculation that this might be a firmware issue, but there is nothing that we've found via HP that mentions this error mode. * The following machines lost their raid config and had it rebuilt by relops or dcops: bld-centos6-hp-001 bld-centos6-hp-002 bld-centos6-hp-003 bld-centos6-hp-004 bld-centos6-hp-007 bld-centos6-hp-012 bld-centos6-hp-013 * This machine was patched with the latest general firmware/ilo patches and had its raid rebuilt: bld-centos6-hp-015 * The other HP machines have not presented with issues so far. * None of the HP machines have presented with issues since bld-centos6-hp-015 was patched. So at this point, we have three classes of HP machines running, those that have never had a problem, seven that have had an issue and had the RAID config rebuilt, and one that has been patched with a general ilo and firmware patches and had its RAID config rebuilt. * We are not sure if these patches will solve the problem. * We also want to make sure that these patches did not impact the building process (they shouldn't have, but we want to be sure). * The machine that we patched has not been put back into service yet (bhearsum was working on that now) We are more than happy to add the patch to more servers but can not offer any proof that this is going to solve the problem.

Comment 26

•

12 years ago

Well, 015 has been offline for I don't know how long, but I've got it back in the production pool now and it should be relatively loaded from here on in. I'm not 100% sure if that helps us figure out what's next here. Please let me know if there's anything else I can do to help out.

Assignee

Comment 27

•

12 years ago

bld-centos-hp-16 is also showing this issue now. I'm going to fix the raid config, boot it back up, then patch the ilo and firmware. I'll re-enable it after I'm done.

Assignee

Comment 28

•

12 years ago

I've patched bld-centos-hp-16 by using the virtual media option and booting off of the firmware ISO on the kickstart server: http://10.12.75.25/FW1010.2012_0530.49.iso I've now put the slave back into rotation.

Comment 29

•

12 years ago

Plan of record: releng is fine with applying these as they occur (we don't yet know if they are a fix)

Assignee

Comment 30

•

12 years ago

I've patched bld-centos-hp-018 by using the virtual media option and booting off of the firmware ISO on the kickstart server: http://10.12.75.25/FW1010.2012_0530.49.iso The slave is back into rotation and updated slavealloc with a comment about the firmware being patched.

Assignee

Updated

•

12 years ago

Assignee: dgherman → arich

Component: Server Operations → Server Operations: RelEng

QA Contact: jdow → arich

Assignee

Updated

•

12 years ago

colo-trip: scl1 → ---

Assignee

Comment 31

•

12 years ago

I premptively told the first 10 machines to do their next boot off of the firmware CS and patch. 007 and 008 are still listed at ilo 1.26, despite trying to get them to patch several times. dumitru, could you take a look at those two? I've left them disabled, so feel free to do whatever necessary to beat them unto submission. Thanks!

Assignee

Comment 32

•

12 years ago

Manually resetting the ilo on 7 and 8 seems to have "fixed" the firmware update failure.

Assignee

Comment 33

•

12 years ago

At this point, all of the bld-centos6-hp machines have had their firmware upgraded except bld-centos6-hp-034 which is waiting on an ilo reset. The following machines still need to have their firmware patched (and may, unlike the bld machines, require downtimes): talos-w8-hp-001.releng.ad.mozilla.com talos-w8-hp-002.releng.ad.mozilla.com talos-w8-hp-003.releng.ad.mozilla.com foopy25.build.scl1.mozilla.com foopy26.build.mtv1.mozilla.com foopy27.build.mtv1.mozilla.com foopy28.build.mtv1.mozilla.com foopy29.build.mtv1.mozilla.com foopy30.build.mtv1.mozilla.com foopy31.build.mtv1.mozilla.com foopy32.build.mtv1.mozilla.com

Assignee

Comment 34

•

12 years ago

Per my conversatoin with Armen about the state of the w8 hps, all of the HP machines have had their firmware patched now except those that were originally repruposed as foopies: foopy25.build.scl1.mozilla.com foopy26.build.mtv1.mozilla.com foopy27.build.mtv1.mozilla.com foopy28.build.mtv1.mozilla.com foopy29.build.mtv1.mozilla.com foopy30.build.mtv1.mozilla.com foopy31.build.mtv1.mozilla.com foopy32.build.mtv1.mozilla.com Hal/Callek, do we need a downtime to patch those, or are they not yet in production?

Justin Wood (:Callek)

Comment 35

•

12 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #34) > Per my conversatoin with Armen about the state of the w8 hps, all of the HP > machines have had their firmware patched now except those that were > originally repruposed as foopies: > > foopy25.build.scl1.mozilla.com > foopy26.build.mtv1.mozilla.com > foopy27.build.mtv1.mozilla.com > foopy28.build.mtv1.mozilla.com > foopy29.build.mtv1.mozilla.com > foopy30.build.mtv1.mozilla.com > foopy31.build.mtv1.mozilla.com > foopy32.build.mtv1.mozilla.com > > Hal/Callek, do we need a downtime to patch those, or are they not yet in > production? They are in production, but do not need a tree-closing downtime, but do need at least 24-48 hours for releng to coordinate our end of the "downtime" for this.

Assignee

Comment 36

•

12 years ago

callek: okay, when do you want to schedule these? We can easily do them one at a time.

Comment 37

•

12 years ago

Since you're both waiting for the other to make a move: > foopy25.build.scl1.mozilla.com > foopy26.build.mtv1.mozilla.com > foopy27.build.mtv1.mozilla.com Friday at noon pacific > foopy28.build.mtv1.mozilla.com > foopy29.build.mtv1.mozilla.com > foopy30.build.mtv1.mozilla.com Monday at noon pacific > foopy31.build.mtv1.mozilla.com > foopy32.build.mtv1.mozilla.com Tuesday at noon pacific

Comment 38

•

12 years ago

update per IRC: let's do them all Friday at noon pacific.

Comment 39

•

12 years ago

update per IRC: (note out of order) > foopy27.build.mtv1.mozilla.com > foopy28.build.mtv1.mozilla.com > foopy29.build.mtv1.mozilla.com > foopy30.build.mtv1.mozilla.com > foopy31.build.mtv1.mozilla.com > foopy32.build.mtv1.mozilla.com Friday at noon pacific > foopy25.build.scl1.mozilla.com > foopy26.build.mtv1.mozilla.com Monday at noon pacific

Justin Wood (:Callek)

Comment 40

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #39) > > foopy25.build.scl1.mozilla.com > > foopy26.build.mtv1.mozilla.com > Monday at noon pacific Joel/Clint, FYI we'll temporarily lose these foopies at ~ that time, do you want/need me to handle prepping for this on monday, or will you?

Assignee

Comment 41

•

12 years ago

Well, so much for this patch fixing the issue. bld-centos6-hp-004 got both the ROM and ilo patch, and still lost it's RAID configuration yesterday. Going to pass this back to dumitru for further investigation directly with HP. Dumitru, I've pulled this machine out of production, so do with it what you will.

Assignee: arich → dgherman

Component: Server Operations: RelEng → Server Operations

QA Contact: arich → jdow

Assignee

Comment 42

•

12 years ago

bld-centos6-hp-013 is also now in the same state. Patched, still lost it's RAID config.

Assignee

Updated

•

12 years ago

Severity: normal → major

Assignee

Comment 43

•

12 years ago

The following have been patched: foopy27.build.mtv1.mozilla.com foopy28.build.mtv1.mozilla.com foopy29.build.mtv1.mozilla.com foopy30.build.mtv1.mozilla.com foopy31.build.mtv1.mozilla.com foopy32.build.mtv1.mozilla.com That leaves these for monday: foopy25.build.scl1.mozilla.com foopy26.build.mtv1.mozilla.com

Comment 44

•

12 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #41) bld-centos6-hp-001 appears to have gotten back into this state too.

Assignee

Comment 45

•

12 years ago

Dumitru: bumping this up to critical since it's taking out production machines and we don't have a fix.

Severity: major → critical

Updated

•

12 years ago

Whiteboard: [reit-hp] → HP 4642125689

Updated

•

12 years ago

Whiteboard: HP 4642125689 → HP 4642125689 [reit-hp]

Comment 46

•

12 years ago

Nagios is saying bld-centos6-hp-017 and 18 are down. Is that from work here or have they fallen over ? 18 is at a BIOS screen, 17 isn't showing me any video in the remote console

Comment 47

•

12 years ago

Hello Dumitru, My name is Gill from the Proliant L2 team and I took ownership of your DL120 G7 issue loosing the disc drive configuration. I was looking at the reports you sent us and I would like to share few things with you. At the first glance, I did look at the version of firmware for the system, disc controller and disc drive as well. I am not sure about the OS you are using on this system but from the report you sent us, it seems to be CentOS 6 but I would need confirmation? I did look at the dics drive firmware and do see HPG0. I don't know if all your DL120G7 discs are the same but if this is the case, it might explain why you are getting this issue. [ Physical Drive 1I:1:1 ] Physical Drive Status SCSI Bus 0 (0x00) SCSIID 0 (0x00) Block Size 512 Bytes Per Block (0x0200) Total Blocks 250 GB (0x1d1c5970) Reserved Blocks 0x00010000 Drive Model ATA VB0250EAVER <<<======= Drive Serial Number Z2A3TVS4 Drive Firmware Revision HPG0 <<<============ SCSI Inquiry Bits 0x00 The latest Firmware for those discs is HPG7, as you can see in the link below: http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=329290&prodSeriesId=3690351&swItem=MTX-703620254a70407c96d79c5faa&prodNameId=4134173&swEnvOID=4103&swLang=13&taskId=135&mode=5 You can download it from: http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareIndex.jsp?lang=en&cc=us&prodNameId=4134173&prodTypeId=329290&prodSeriesId=3690351&swLang=13&taskId=135&swEnvOID=54 It contains the following fixes: Version: HPG7 (B) (4 Sep 2012) Fixes Upgrade Requirement: Online firmware flashing of drives attached to an HP Smart Array controller running in Zero Memory (ZM) mode or an HP ProLiant host bus adapter (HBA) is NOT supported. Only offline firmware flashing of drives is supported for these configurations. -------------------------------------------------------------------------------- Problems Fixed Resolved an issue where the drive was not recognized when the power was turned on. <<=== Improved drive performance when booting at cold temperatures. Properly set the drive write cache to off as the default power-up setting. So, I would suggest to update the disc firmware to HPG7 and see if it helps. Please let me know if you have any questions. Thanks, Gilles Lucier

Updated

•

12 years ago

Blocks: b-linux64-hp-0028

Comment 48

•

12 years ago

I spent most of the day figuring out an automated way to perform those firmware upgrades on the drives, but looks like the only choice we have is to use the USB bootable image. We cannot use other methods because: a) the RAID controller is one with zero memory, thus online upgrades are not possible (running the upgrade software from the OS) b) offline upgrades using HP Service Pack for Proliant DVD are not possible for this controller. Even if the latest SPP contains this upgrade, when selected it says to use the USB key method. When I left the office I let the HP USB Key Creator Utility to run and finish the creation of this bootable image from an USB drive. I'll test it tomorrow with DCOps help.

Assignee

Comment 49

•

12 years ago

Any luck with the patching?

Comment 50

•

12 years ago

The first USB I created didn't work at all, so I made a new one. This one worked, and the system booted with it, but unfortunately the upgrade is not seen by the HP SUM. I exhausted all my Google-fu and troubleshooting, and emailed back our HP level 2 engineer for support. Waiting on his advice.

Updated

•

12 years ago

Blocks: b-linux64-hp-0029

Comment 51

•

12 years ago

Hi Dumitru, I do see what you mean. I will need to research on this and probably will need to try to reproduce it in our lab. Thanks, Gilles Lucier CPRQ GCC ISS/SW Engineer

Updated

•

12 years ago

Blocks: b-linux64-hp-0031

Updated

•

12 years ago

Blocks: b-linux64-hp-0034

Comment 52

•

12 years ago

Hello Dumitru, I want to give you an update on this. I was able to reproduce the issue in our lab but I do not have a solution for you, yet. I will work on this and will get back to you when I'll have new info. Thanks, Gilles Lucier CPRQ GCC ISS/SW Engineer

Updated

•

12 years ago

Blocks: b-linux64-hp-0027

Updated

•

12 years ago

Blocks: b-linux64-hp-0032

Comment 53

•

12 years ago

Hello Dumitru, I tought I had a solution but finally, it turned out that it doesn't work. I am thinking about something else, I will need to verify if this could work and I will let you know. I have been out of the office for few days last week and I appologize for the delay but I should be able to get back to you by the end of this week, on this. Regardss, Gilles Lucier

Comment 54

•

12 years ago

I upgraded the HDD firmware on bld-centos6-hp-001 and bld-centos6-hp-015 and the method[1] works. Unfortunately we can't automate this, thus please send me a list of hosts that can be taken offline for 10 minutes for this upgrade. We can also do this via IRC, just ping me. [1]https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=29329401

Comment 55

•

12 years ago

Unfortunately, we'll need to coordinate down times. The HP's deployed as foopies need approx 2 hours to bring down cleanly, and we can only have a small number per batch, due to impact on test pool. The HP's deployed as mock builders can be likely be batched, but need about the same time to be taken offline cleanly. Add all that together, and we're likely going to do 2 batches a day max. We'll work out rest of the details on IRC, and log progress here.

Comment 56

•

12 years ago

Alright, let me know on IRC when you guys are ready to do these, thanks!

Status: NEW → ASSIGNED

Aki Sasaki (not active)

Comment 57

•

12 years ago

bld-centos6-hp-013 seems to be down. Now disabled in slavealloc as well. If you can get to it, fire away.

Comment 58

•

12 years ago

(In reply to Aki Sasaki [:aki] from comment #57) > bld-centos6-hp-013 seems to be down. Now disabled in slavealloc as well. > If you can get to it, fire away. Fixed this one, too.

Comment 59

•

12 years ago

bld-centos6-hp-002 fixed (requested on IRC by jhopkins and hwine)

Aki Sasaki (not active)

Comment 60

•

12 years ago

bld-centos6-hp-012 is down, which makes it a candidate, but bld-centos6-hp-013 is back down again post-fix.

Comment 61

•

12 years ago

I may know why bld-centos6-hp-013 is sad, so I re-flashed it. bld-centos6-hp-012 flashed with the new firmware.

Comment 62

•

12 years ago

013 and 001 are still experiencing the same issue even with the new HDD driver. I updated HP.

Comment 63

•

12 years ago

New logs sent to HP, waiting on their input now...

Comment 64

•

12 years ago

Hi Dumitru, I was looking at this and I found an issue with drive not recognized when serial port is enable in BIOS. This was seen on a different server but in the same family and with the same array controller B110i. I can not tell if this is related or not but if you reboot your server, you can check the BIOS setting for the serial port configuration to see if it is enable or disable. This is a quick test that can be done. Also, there is a new BIOS that should come out this week-end for the DL120G7. It talks about the serial console but, again, I do not know if this could help or not since I do not see any mention about the B110i controller in this new BIOS. However, looking at this other issue reported about the serial port, it could make me think there might be a relation between both. You can try to look at the serial port config and toggle it, if it doesn't affect your operations, and we will see if it changes something. Thanks, Gilles Lucier CPRQ GCC ISS/SW Engineer -- I disabled serial ports on bld-centos6-hp-013.

Rick Bryce [:rbryce]

Comment 65

•

12 years ago

This alert should not page oncall. <nagios-releng> Thu 15:50:08 PDT [459] bld-centos6-hp-018.build.scl1.mozilla.com:disk - / is CRITICAL: Timeout while attempting connection

Rick Bryce [:rbryce]

Comment 66

•

12 years ago

disregard comment 65 - wrong bug

Comment 67

•

12 years ago

I've repaired bld-centos6-hp-016, 018 and 019 to see how fast they fall over again. If it's not too quick I think it would be worth picking up machines as they fall, while HP are working on a permafix.

Comment 68

•

12 years ago

Hi Dumitru, Thanks for the reports and answers to my questions. One other thing I forgot to ask you because I found something concerning the B110i array controller. It seems that those are using the driver to load the "firmware", as you can see in the following advisory: http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02862729&lang=en&cc=us&taskId=101&prodSeriesId=5075933&prodTypeId=15351 You will see: "This occurs because the B110i firmware runs from the operating system driver instead of the Option ROM. " This is completly different than the other array controllers. So, the reports you sent me are not containing any driver version info and since we do not "officially" support driver installation on CentOS, knowing that CentOS is quite similar to RHEL, the only recommendation I can tell you is to make sure you have the latest hpahcisr driver version for RHEL, according to your CentOS equivalence, which is probably RHEL 6. Just pay attention at the Update#. Maybe you can verify the version of hpahcisr driver you have and see if it is the latest available, as described on the following site: http://h20000.www2.hp.com/bizsupport/TechSupport/SoftwareDescription.jsp?lang=en&cc=us&prodTypeId=15351&prodSeriesId=5075933&swItem=MTX-02e812787510460e8422d5bb65&prodNameId=5075937&swEnvOID=4103&swLang=8&taskId=135&mode=5 Make sure you read the Installation instructions tab to use the appropriate version of the driver. Please let me know what HPAHCISR driver version you are using. Since you are on the latest disc drive firmware, the controller FW needs to be examined to make sure we are on the latest also. BTW, did you try to install the latest BIOS on one of the DL120, which is 2012.08.10, that came out last friday. We may want to try it if we do have the latest array controller driver already installed. Do we have an equivalent to the RHEL "sosreport", on CentOS? If so, can you send it to me, please? Thanks, Gilles Lucier

Comment 69

•

12 years ago

Hi Gilles, Looks like CentOS has sosreport integrated, attached the file. I found something interesting: [root@bld-centos6-hp-001 ~]# lspci | grep -i raid 00:1f.2 RAID bus controller: Intel Corporation 6 Series/C200 Series Chipset Family SATA RAID Controller (rev 05) Apparently, the OS thinks this is an Intel controller! I have found several blog posts online about CentOS and not seeing this controller properly, sometimes even seeing the drives directly (people reporting that they have configured RAID 1 with two drives, but CentOS didn't see the logical drive, but two disks instead). I am imaging now one of the servers with RHEL 6 to verify how the controller appears, and then I will have more to think at. Thanks, let me know what your thoughts are.

Comment 70

•

12 years ago

Hello Dumitru, Like you said, very interesting. I did look at the sosreport you sent and it doesn't contains all details about the driver but there is enough to say the driver is the correct one, and when I say correct, I mean this is the AHCI driver but not the one from HP, made specially for this B110i controller. This means it doesn't contain all the necessary code to fix the controller issues, which might lead us to strange controller behaviours. That will be nice if you can effectively mount a DL1230G7 with RHEL and see how it will detect it. However, for the best controller performance, I would recommand to install the HP version of the AHCI driver (hpahcisr), because it will contains the necessary code to optimize your B110i controller. Usually the HP supplied driver, under the RHEL equivalent version, from the HP download site, is working pretty well with CentOS, even, like I said earlier, we do not "officially" support it. So, if it does work correctly on your RHEL system, you may want to try it on one of your CentOS system that is not critical, for test purpose. Thanks, Gilles Lucier

Comment 71

•

12 years ago

Gilles, Here are some reports from a host running RHEL 6. [root@bld-centos6-hp-009 ~]# lsscsi [0:0:0:0] disk ATA VB0250EAVER HPG0 /dev/sda [root@bld-centos6-hp-009 ~]# lspci | grep -i raid 00:1f.2 RAID bus controller: Intel Corporation 6 Series/C200 Series Chipset Family SATA RAID Controller (rev 05) [root@bld-centos6-hp-009 ~]# cat /etc/issue Red Hat Enterprise Linux Server release 6.3 (Santiago) Kernel \r on an \m Very interesting, right? Instead of seeing the logical volume, it detects the drive itself. This puzzles me. I am attaching a sosreport from this system, running RHEL 6. Funny enough: [root@bld-centos6-hp-009 ~]# rpm -Uvh kmod-hpahcisr-1.2.6-14.rhel6u3.x86_64.rpm Preparing... ########################################### [100%] ####################################################### # Hpahcisr is currently not controlling any storage. # # Loading this driver could displace the current # # storage driver causing filesystem corruption. # # Exiting! # ####################################################### error: %pre(kmod-hpahcisr-1.2.6-14.rhel6u3.x86_64) scriptlet failed, exit status 1 error: install: %pre scriptlet failed (2), skipping kmod-hpahcisr-1.2.6-14.rhel6u3 [root@bld-centos6-hp-009 ~]# cat /etc/issue Red Hat Enterprise Linux Server release 6.3 (Santiago) Kernel \r on an \m

Assignee

Comment 73

•

12 years ago

These machines have been broken for 3mo now. Who can we escalate to to get this resolved?

Comment 74

•

12 years ago

Remember that you are using an officially unsupported OS on them. HP engineer sent me this article to try: http://linuximagination.blogspot.com/2011/04/centos-installer-wasnt-detecting-sata.html His last email that he sent me yesterday was: "I did some researches but wasn't able to find any answer yet. I will try to look at it tomorow but I might need to reproduce it if I do not find anything documented, which may take more time. It would have been interesting to know if the procedure described in this post would work ... Let me check on my side what I can find and I will keep you posted." I just don't have the time to do it, working on my other quarterly goals currently. If you can find the human resources to do what that article says, that'd be great.

Assignee

Comment 75

•

12 years ago

I've gone a different route since we don't seem to be able to get a fix for this bug. At the moment, I'm switching from RAID to AHCI support to see if that solves the issue. I'm tracking the machines that I update in slavealloc. So far, that's: bld-centos6-hp-007 bld-centos6-hp-008 bld-centos6-hp-013 bld-centos6-hp-015 bld-centos6-hp-016 bld-centos6-hp-017 bld-centos6-hp-018 Since fixing 7 and 8 yesterday, I haven't seen them go back down yet, so I am cautiously optimistic. I will also re-kickstart bld-centos6-hp-009 and apply the same change.

Comment 76

•

12 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #75) > I've gone a different route since we don't seem to be able to get a fix for > this bug. At the moment, I'm switching from RAID to AHCI support to see if > that solves the issue. I'm tracking the machines that I update in > slavealloc. Just for the record, can you confirm we were running RAID 0 previously? I.e. we were never depending on raid for any disc corruption/recovery, so moving away from raid does not change anything about using, monitoring, or managing this box. (I hope I'm correct in that statement.)

Comment 77

•

12 years ago

I had the same idea yesterday, and after discussing with various people, they advised to try it. I enabled it on 009 last night to see if the change breaks anything, and indeed didn't. I put it back to RAID because I installed the hp driver on RHEL, but that made the system to kernel panic. So 009 can be all yours, since we want to try this on it too. In the meantime I also emailed Rich and he said he will try and get us more HP resources on the case.

Shyam Mani [:fox2mike]

Updated

•

12 years ago

QA Contact: jdow → shyam

Comment 78

•

12 years ago

(In reply to Hal Wine [:hwine] from comment #76) > (In reply to Amy Rich [:arich] [:arr] from comment #75) > > I've gone a different route since we don't seem to be able to get a fix for > > this bug. At the moment, I'm switching from RAID to AHCI support to see if > > that solves the issue. I'm tracking the machines that I update in > > slavealloc. > > Just for the record, can you confirm we were running RAID 0 previously? I.e. > we were never depending on raid for any disc corruption/recovery, so moving > away from raid does not change anything about using, monitoring, or managing > this box. (I hope I'm correct in that statement.) Correct, it was RAID 0, but even with the SATA controller in RAID mode, it was detected wrong by the OS and it shows up using the "ahci" drivers, go figure.

QA Contact: shyam → jdow

Shyam Mani [:fox2mike]

Updated

•

12 years ago

QA Contact: jdow → shyam

Assignee

Comment 79

•

12 years ago

At this point, I think I've disabled RAID on all of the slaves and switched them to AHCI. Interestingly, I found a few of them were set to Legacy SATA instead of RAID or AHCI SATA. Assuming that this does prove to be the fix, we still need to do this to the handful of machines that have been repurposed as things other than mock slaves. Here's the list of machines that are done (and marked as such in slavealloc): bld-centos6-hp-001 bld-centos6-hp-002 bld-centos6-hp-003 bld-centos6-hp-004 bld-centos6-hp-005 bld-centos6-hp-006 bld-centos6-hp-007 bld-centos6-hp-008 bld-centos6-hp-009 bld-centos6-hp-012 bld-centos6-hp-013 bld-centos6-hp-015 bld-centos6-hp-016 bld-centos6-hp-017 bld-centos6-hp-018 bld-centos6-hp-019 bld-centos6-hp-024 bld-centos6-hp-025 bld-centos6-hp-026 bld-centos6-hp-027 bld-centos6-hp-028 bld-centos6-hp-029 bld-centos6-hp-030 bld-centos6-hp-031 bld-centos6-hp-032 bld-centos6-hp-035

Assignee: dgherman → arich

Component: Server Operations → Server Operations: RelEng

QA Contact: shyam → arich

Assignee

Comment 80

•

12 years ago

No RAID lossage since the fix last week. Callek, when do you want to do the ones that were retasked as foopies, and how many do you want to do at once? foopy25 - foopy37 are the ones that need a quick reboot.

Flags: needinfo?(bugspam.Callek)

Assignee

Comment 81

•

12 years ago

No RAID lossage since the fix last week. Callek, when do you want to do the ones that were retasked as foopies, and how many do you want to do at once? foopy25 - foopy37 are the ones that need a quick reboot.

Assignee

Comment 82

•

12 years ago

per callek on irc yesterday, we're going to shoot for fixing the foopies on Nov 20th. SPecific times forthcoming.

Justin Wood (:Callek)

Comment 83

•

12 years ago

(In reply to Amy Rich [:arich] [:arr] from comment #82) > per callek on irc yesterday, we're going to shoot for fixing the foopies on > Nov 20th. SPecific times forthcoming. Sorry meant to comment here on the 15th/16th, Lets shoot for 20th, window for IT from 9am PT to 11am PT. IT work can begin right at 9, just ping me in IRC, and should complete no later than 11am PT [given my understanding and chat with :arr, thats much more than enough time] I'll be on point for bringing the software on the systems back up properly when IT is done.

Flags: needinfo?(bugspam.Callek)