Closed Bug 836808 Opened 12 years ago Closed 11 years ago

mac address collisions found in panda pool

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

While investigating troubled pandas at scl1, I found some boards which had identical mac addresses. Some were just incorrectly entered in the inventory k/v store but others were actually collisions. We may need to revisit the kernel patches but in the meantime I think we should check the switch port/mac address tables to determine if this is an actual collision or false inventory entry. If it is a real collision, we will replace the highest panda in the collision set. A quick peek into the mozpool DB reveals 5 sets of collisions: mysql> SELECT mac_address, COUNT(*) c FROM devices GROUP BY mac_address HAVING c > 1; +--------------+---+ | mac_address | c | +--------------+---+ | 0e6092094e01 | 2 | | 2e6093784e01 | 2 | | 2e60b8754e01 | 2 | | 2e60c63d4e01 | 2 | | 2e60f64e4e01 | 2 | +--------------+---+ 5 rows in set (0.01 sec) mysql> select name, mac_address from devices where mac_address='0e6092094e01'; +------------+--------------+ | name | mac_address | +------------+--------------+ | panda-0670 | 0e6092094e01 | | panda-0822 | 0e6092094e01 | +------------+--------------+ 2 rows in set (0.00 sec) mysql> select name, mac_address from devices where mac_address='2e6093784e01'; +------------+--------------+ | name | mac_address | +------------+--------------+ | panda-0531 | 2e6093784e01 | | panda-0574 | 2e6093784e01 | +------------+--------------+ 2 rows in set (0.00 sec) mysql> select name, mac_address from devices where mac_address='2e60b8754e01'; +------------+--------------+ | name | mac_address | +------------+--------------+ | panda-0158 | 2e60b8754e01 | | panda-0465 | 2e60b8754e01 | +------------+--------------+ 2 rows in set (0.00 sec) mysql> select name, mac_address from devices where mac_address='2e60c63d4e01'; +------------+--------------+ | name | mac_address | +------------+--------------+ | panda-0479 | 2e60c63d4e01 | | panda-0482 | 2e60c63d4e01 | +------------+--------------+ 2 rows in set (0.00 sec) mysql> select name, mac_address from devices where mac_address='2e60f64e4e01'; +------------+--------------+ | name | mac_address | +------------+--------------+ | panda-0486 | 2e60f64e4e01 | | panda-0771 | 2e60f64e4e01 | +------------+--------------+ 2 rows in set (0.00 sec) mysql>
I pulled these 2 from the chassis so I could check (at a later time) if the CPU die ids were identical also. +------------+--------------+ | name | mac_address | +------------+--------------+ | panda-0479 | 2e60c63d4e01 | | panda-0482 | 2e60c63d4e01 | +------------+--------------+
I suspect this will be come more important as we add more pandas. Assuming the die ids are not the same, how hard is it to hack the kernel patch to use more of the available address space? As is, it only changes two(?) of the hex digits.
(In reply to Amy Rich [:arich] [:arr] from comment #2) > I suspect this will be come more important as we add more pandas. Assuming > the die ids are not the same, how hard is it to hack the kernel patch to use > more of the available address space? As is, it only changes two(?) of the > hex digits. As long as the die id don't happen to be identical it shouldn't be terribly difficult to fix. And I'm fairly sure the ids are wider than 6 bytes.
The patch XOR's the TAP_IDCODE and the die ID. I don't know what those are, but if they both differ from device to device, then it's certainly possible to generate collisions. In practice, numbered from left to right, only bits 4, 17-32, and 37-40 differ from panda to panda, although those last four bits are always either 1110 or 0101. So effectively, there are 18 bits of entropy here. A better algorithm would be to pick a range of reserved or unused vendor codes, then generate the rightmost 24 bits using some fast hash with a reasonable diffusion factor (taking 24 bits from md5 would do). But we have what we have. If there are only a few MAC collisions, then we can probably just ensure they're not in the same VLAN. We could add a quick script to run from cron on one of the imaging servers to alert us to any same-VLAN conflicts.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
Blocks: panda-0479
Blocks: panda-0482
Jake: assuming we are going with Dustin's suggestions in comment 4, what is left to do here?
Flags: needinfo?(jwatkins)
(In reply to John Hopkins (:jhopkins) from comment #5) > Jake: assuming we are going with Dustin's suggestions in comment 4, what is > left to do here? Since there are no other conflicting macs in our current pool and we have no plans to purchase more, I figure we can safely close this bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(jwatkins)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.