Closed
Bug 641644
Opened 14 years ago
Closed 12 years ago
Reject duplicate personas automatically
Categories
(addons.mozilla.org Graveyard :: Developer Pages, enhancement, P4)
addons.mozilla.org Graveyard
Developer Pages
Tracking
(Not tracked)
RESOLVED
FIXED
2013-06-06
People
(Reporter: clouserw, Assigned: kngo)
References
Details
(Whiteboard: [monarch])
A recurring problem on getpersonas is that duplicate personas are uploaded and unless you have someone that has seen every persona ever, it's hard to flag these.
We should be tracking unique values for personas (a hash in the db is fine) and comparing on upload and rejecting if they are the same.
Updated•13 years ago
|
Blocks: greater-percona
Comment 1•12 years ago
|
||
(In reply to Wil Clouser [:clouserw] from comment #0)
> A recurring problem on getpersonas is that duplicate personas are uploaded
> and unless you have someone that has seen every persona ever, it's hard to
> flag these.
>
> We should be tracking unique values for personas (a hash in the db is fine)
> and comparing on upload and rejecting if they are the same.
If the same image is uploaded twice, the hash will probably will be different since we use PIL to resize the images always. We can store the hash upon upload before resizing though.
Severity: normal → enhancement
Priority: P2 → P4
Target Milestone: Q3 2011 → ---
Comment 2•12 years ago
|
||
(In reply to Chris Van Wiemeersch [:cvan] from comment #1)
> (In reply to Wil Clouser [:clouserw] from comment #0)
> > A recurring problem on getpersonas is that duplicate personas are uploaded
> > and unless you have someone that has seen every persona ever, it's hard to
> > flag these.
> >
> > We should be tracking unique values for personas (a hash in the db is fine)
> > and comparing on upload and rejecting if they are the same.
>
> If the same image is uploaded twice, the hash will probably will be
> different since we use PIL to resize the images always. We can store the
> hash upon upload before resizing though.
Wil, how would you suggest we do this?
Reporter | ||
Comment 3•12 years ago
|
||
Once I think about it, I'm not sure the PIL resizing is the problem. Hypothetically, resizing the same image twice is going to deliver the exact same image both times, right? So, I'd say:
- Calculate hashes of final, full sized images (both header and footer) for all themes
- On upload, we calculate the hashes like normal and put them in the db. New themes go in the review queue as normal.
- We alter the review queue with a place for a message to the reviewer, and if the hash is the same as one already in the db (please add an index to that column!), then a short message to the reviewer is all that is needed. Something like "This theme might be a duplicate of _$themename_". If it is a dupe, they should reject/delete it like any other disallowed material.
Comment 4•12 years ago
|
||
What if somebody uploads somebody else's persona that has been published elsewhere, and then the author decides to upload it to AMO?
Comment 5•12 years ago
|
||
(In reply to Aleksej [:Aleksej] from comment #4)
[nm, I should have read comment #3 first]
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → ngoke
Assignee | ||
Comment 6•12 years ago
|
||
Should we calculate these hashes for all currently existing 360k+ personas? With the images being hosted on a static server, it would take 720k+ HTTP requests to pull the images (header and footer).
I suggest we just start calculating hashes for personas from here on out? I'd assume most duplicate personas come from an artist accidentally submitting a theme twice in a row.
Reporter | ||
Comment 7•12 years ago
|
||
We have the personas on disk, we wouldn't need to pull over http. They are all in the files/ dir organized by add-on ID.
We can handle the huge load of back processing with celery. Off the top of my head:
> def:
> themes = select all themes where hashes="" limit 1000
> for theme in themes:
> theme.hash = calculate_hash() or null
We just call that 360 times in a bash loop (sleeping 60s in between) and eventually we have our hashes - doesn't matter if it takes days or weeks. Null would be different than "" so those rows wouldn't be returned again. If we do have a null (problem calculating the hash) it should be logged though so we know what's up.
Actually, there are a bunch of themes which aren't themes (.doc files, pdfs, exes, etc.) back from the days before we checked those things. It'd be awesome if you made this script open each image and make sure it was a real image. I think that's just a matter of x = Image.open(). If x has a width and height, it's an image.
We have code in AMO for celery tasks and image processing which can be copied for this. Let me know if you've got concerns about it but I think it's totally possible and happy to help.
Reporter | ||
Comment 8•12 years ago
|
||
> Actually, there are a bunch of themes which aren't themes (.doc files, pdfs,
> exes, etc.) back from the days before we checked those things. It'd be
> awesome if you made this script open each image and make sure it was a real
> image. I think that's just a matter of x = Image.open(). If x has a width
> and height, it's an image.
There is also potential for a file to simply be missing (see 861234). It's all part of the same handling/reporting but another case to make sure we hit.
Again, I'm happy to help out here.
Assignee | ||
Comment 9•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•12 years ago
|
Target Milestone: --- → 2013-05-16
Assignee | ||
Comment 10•12 years ago
|
||
Still need to get that migration script working.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 11•12 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 12•12 years ago
|
||
This is a 2 year old bug. Awesome to get it closed! Thanks.
Target Milestone: 2013-05-16 → 2013-06-06
Updated•9 years ago
|
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•