Closed Bug 1483667 Opened 6 years ago Closed 6 years ago

Pocket personalization V2: add text tagger

Categories

(Firefox :: New Tab Page, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Firefox 64
Iteration:
64.1 - Sep 14
Tracking Status
firefox64 --- fixed

People

(Reporter: nanj, Assigned: jkoren)

References

()

Details

### Description This PR adds the ability to classify text. We define two different classifiers, a Naïve Bayes (NB) classifier, and a multiclass nonnegative matrix factorization (NMF) classifier. Both use a bag of words, TF-IDF vectors as features. The purpose of this code is to allow Firefox to classify pages into topics, by examining the text found on the page. This code is part of the Pocket Personalization v2 experiment which uses content analysis to locally build interest profiles. This code is dark. ### Testing Unit tests This code has no current consumers. ### Related We reviewed this internally on PR https://github.com/Pocket/activity-stream/pull/1 ### See Also https://docs.google.com/document/d/12OtUZywivIvBnO3hmMNjptmIQ8cQLFOIOzJO4bRLqdg/edit https://en.wikipedia.org/wiki/Naive_Bayes_classifier https://en.wikipedia.org/wiki/Non-negative_matrix_factorization https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Assignee: nobody → jkoren
Hey :jkoren, I have a few questions about this feature. Could you clarify? * What's the input of this tagger? I assume it's Places, correct? If so, what tables/columns will be used? How far we want to look back at user's browsing history? * Are we going to use both NB and NMF? Or just try them first to see which one performs better? * How often do we want to conduct this tagging? As you might know, in personalization V1, we build the site affinity profile in the browser daily-idle handler to minimize the impact of calculation. I guess we'd use the same strategy for text tagging as well.
Flags: needinfo?(jkoren)
0) This is only the first PR. It's meant to be a bit general. The code that uses it it is coming under another PR. We're trying to queue the PRs up so each part can be looked at effectively, and won't overwhelm anyone. That said... 1) We are currently planning for two sources of input into the tagger: The title and description fields from places db. The source is the title and description fields from the items received from the Pocket servers. In our prototype we've been going back 30 days. Without any sort parallelization, the prototype takes 9 to 10 seconds to process. I would not characterize this as "fast". We don't want small time ranges, because it doesn't capture enough long term interests. 2) We're using a two tier ontology. We've found that the NB algorithm works better on the top level, while the NMF algorithm works better on the lower level. (I don't know why, but it's like a lot better.) So we do use them both. Later, we'll probably replace both these algorithms with something else, but that's what we're using on the Pocket servers now. 3) We've been thinking every 24 hours or so. We could do incremental updates (which would be faster), but it's easier just to recalculate the whole thing while continuing to use the old one until it finishes.
Flags: needinfo?(jkoren)
Iteration: --- → 64.1 (Sep 14)
Priority: -- → P1
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Blocks: 1489962
Backout by btara@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/d2e41f2f964d Backed out changeset 8dde92f89a24 for browser_asrouter_cfr.js failures. a=backout Relanded: https://hg.mozilla.org/mozilla-central/rev/581019e9ea70
Component: Activity Streams: Newtab → New Tab Page
You need to log in before you can comment on or make changes to this bug.