Closed
Bug 1301717
Opened 8 years ago
Closed 8 years ago
Store extracted website meta data
Categories
(Firefox for Android Graveyard :: General, defect, P1)
Tracking
(firefox52 fixed)
RESOLVED
FIXED
Firefox 52
Tracking | Status | |
---|---|---|
firefox52 | --- | fixed |
People
(Reporter: sebastian, Assigned: Grisha)
References
Details
(Whiteboard: [MobileAS])
Attachments
(4 files)
In bug 1301715 we are going to write some code to extract website metadata. We will need to store it somewhere. It might be useful to store it in a way that it can be used in queries ("Select highlights with images").
Reporter | ||
Updated•8 years ago
|
Comment 1•8 years ago
|
||
Here's the desktop add-on's schema:
https://github.com/mozilla/activity-stream/blob/master/addon/MetadataStore.js
We'll be storing similar in Tofino using Datomish, eventually.
Updated•8 years ago
|
Updated•8 years ago
|
Assignee: nobody → gkruglov
Updated•8 years ago
|
Priority: P2 → P1
Assignee | ||
Comment 2•8 years ago
|
||
There are a few approaches here. One of the core assumptions I'm making is that we don't yet fully know what metadata information we want to store - and once we'll know, it's guaranteed that metadata will evolve over time; we want to optimize for "ease of iteration".
We already have two tables for storing metadata-like data for URLs - in fact, one is even called "metadata".
sqlite> .schema metadata
CREATE TABLE metadata (id INTEGER PRIMARY KEY, url TEXT NON NULL UNIQUE, tileImage STRING, tileColor STRING, touchIcon STRING);
sqlite> .schema urlannotations
CREATE TABLE urlannotations(_id INTEGER PRIMARY KEY AUTOINCREMENT, url TEXT NOT NULL, key TEXT NOT NULL, value TEXT, created INTEGER NOT NULL, modified INTEGER NOT NULL, sync_status TINYINT NOT NULL DEFAULT 0 );
"metadata" is used from gecko to store some icon information, but urlannotations is interesting because it's just a KV store, which is currently being used to track an assortment of information (feed subscriptions, screenshots, home screen shortcut interactions, etc).
So we could piggy back on top of urlannotations, and either store individual pieces of metadata information for websites as kv entries, or store the whole blob as JSON. Along with a creation timestamp, either approach should give us an easy migration path should our metadata schema change (which is to be expected), without requiring writing any db schema migrations. Worse case, we'll have to switch our keys to something like "metadata-v2", or just kill off all "metadata" rows.
This approach also has a benefit of being able to track metadata over time - if that's a desired property.
Another approach is to define a concrete schema for metadata, and go with that. This has a benefit of being able to query for individual properties (but that's also possible with a simple KV approach, and harder with a json-blob approach), and will give us overall faster queries (no need to either join a bunch of rows together as in the KV approach, or if metadata is stored as one JSON blob, parse it further) - although the difference might not be significant, especially if compared to using JSON blobs.
Downside to defining a schema and using a separate table is that we now have to maintain it, migrate it, etc., which could be painful and might be just unnecessary friction. On the other hand, perhaps our page metadata won't evolve over time as much as I think it might - so longer term maintenance cost becomes less of a concern. Somehow I doubt this, given how very early we are in the design of this thing for mobile.
I think that using urlannotations strikes a good balance here between "quick to implement", "easy to change", "could be made fast", and it doesn't corner us into any one direction.
Assignee | ||
Updated•8 years ago
|
Priority: P1 → P2
Assignee | ||
Updated•8 years ago
|
Priority: P2 → P1
Reporter | ||
Comment 3•8 years ago
|
||
JSON blobs: That's definitely the least attractive one: For most things (except simple reads) we will need to perform a full table scan and parse every blob. We can't easily select websites with features (Select websites with big header images) and migration is really painful to do (Read, Parse, Write).
URL annotations: Easy to extend and fairly easy to migrate. We only need to store something if we actually have something available. Selecting multiple values requires multiple joins. That's a little bit annoying to write. However it's super flexible.
Separate table: Requires schema changes whenever we introduce a new field. However it is simple to query and join. We will need to store a lot of null values for websites without all metadata. But this should be fairly cheap.
Overall I see the following use cases currently:
* Joining with existing tables to have richer data to display (Join with bookmarks/history)
* Selecting based on metadata ("Get history with nice images")
* (Maybe) performing (partial) ranking of highlights in SQL - The desktop add-on currently performs the ranking after selecting some potential highlights but it would be interesting to perform parts of it in SQLite directly.
We shouldn't be too fixated on not requiring a schema migration. After all it is a straight-forward process supported by the platform (even though always a bit scary). Even if we store JSON blobs then we might end up in a situation where we want to store data differently or drop data and then we have to migrate some kind of schema; even though not formally.
Anyways, right now I prefer the url_annotations path. However we never really thought about how we want to get rid of outdated/old data. So far we just insert/update and never remove anything.
Comment 4•8 years ago
|
||
(In reply to Sebastian Kaspari (:sebastian) from comment #3)
> URL annotations: Easy to extend and fairly easy to migrate. We only need to
> store something if we actually have something available. Selecting multiple
> values requires multiple joins. That's a little bit annoying to write.
> However it's super flexible.
IIRC the URL anno table is stringly-typed. That's not space-efficient for high-volume storage, and self-joins on string-typed unindexed data is very slow, quite apart from the horror of writing the query.
You should also think about how you plan to expire/evict/delete these things, which is hard with annos.
> Separate table: Requires schema changes whenever we introduce a new field.
> However it is simple to query and join. We will need to store a lot of null
> values for websites without all metadata. But this should be fairly cheap.
Nulls are essentially free. However, this approach still either relies on slow table scans (few indices) or takes a lot of space (indexed).
E.g., you want to find sites with a big header image? Do you index on the header image columns, or do you walk the entire table to find out?
> * (Maybe) performing (partial) ranking of highlights in SQL - The desktop
> add-on currently performs the ranking after selecting some potential
> highlights but it would be interesting to perform parts of it in SQLite
> directly.
If you try to solve this problem in the general case in SQLite, you'll end up building a storage system (DAMHIK!).
Trying to build an extensible, queryable, general store with computed ranking is the kind of hard problem that motivated CQRS. (I'll send you a link to a blog post.)
I recommend you _not_ use URL annos, and instead store JSON blobs with some lookup FKs. E.g.,
history_id (for URL)
date_added
has_rich_images
json
Any more than that -- e.g., you want FTS lookup, you want efficient indexed querying of properties, you want to do indexed search over properties you don't know yet -- and you're definitely well into Datomish's territory.
If you want to build a strong app around this data model, with all that stuff, it will involve a lot of work, and might converge with Datomish's reimplementation in Rust… and so thinking in terms of a stop-gap might not be a bad idea.
Reporter | ||
Updated•8 years ago
|
Blocks: as-android-metadata
Reporter | ||
Comment 5•8 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #4)
> Trying to build an extensible, queryable, general store with computed
> ranking is the kind of hard problem that motivated CQRS. (I'll send you a
> link to a blog post.)
Thanks for that! Interesting read and now I have a bunch of follow-up articles in my reading list.. :)
> I recommend you _not_ use URL annos, and instead store JSON blobs with some
> lookup FKs. E.g.,
>
> history_id (for URL)
> date_added
> has_rich_images
> json
>
> Any more than that -- e.g., you want FTS lookup, you want efficient indexed
> querying of properties, you want to do indexed search over properties you
> don't know yet -- and you're definitely well into Datomish's territory.
That's pretty much the direction we are going in. From the mocks for the MVP we have a rough idea what we need. However we will go through some iterations, so some properties will come and go.
What is the advantage of the JSON blob over columns that are either indexed or not (depending on whether we need to use it in some fancy query). We won't need schema migrations for just adding simple properties but if we need to move a property from the JSON to an unindexed column then this seems to be quite expensive?
Looking at the requirements for the MVP all we need right now is just a way to enrich existing queries with metadata (history with metadata, bookmarks with metadata, ..). So for now even the JSON blob should be good enough for that.
Comment 6•8 years ago
|
||
(In reply to Sebastian Kaspari (:sebastian) from comment #5)
> What is the advantage of the JSON blob over columns that are either indexed
> or not (depending on whether we need to use it in some fancy query). We
> won't need schema migrations for just adding simple properties but if we
> need to move a property from the JSON to an unindexed column then this seems
> to be quite expensive?
The advantage is that you will never need to change your schema just to save a new piece of metadata from a page, and your data ingress and egress becomes much simpler.
My assumption is that you have a big, growing, undefined, sparse set of properties that you'll pull out of a page as JSON with Fathom, and will want to pull back out of the database isomorphically for later use.
A very small, perhaps fixed, subset or function of those columns will be used to _find_ the data: which page, modified time, some measure of computed quality, and the foreign key into history and bookmarks.
If you think that you'll be wanting complicated searches over the metadata -- "find me page metadata with two large images that include a product costing less than $100" -- then you will need to do this the hard way, with a very large table and lots of SQL migrations, splitting up parsed JSON and validating it for saving, and reconstructing JSON from the contents of the table.
IMO if you can avoid that, do, because you will grow a lot of code for maintenance. You might even find that it's more efficient to do a larger sampling query and filter on the JSON in memory.
> Looking at the requirements for the MVP all we need right now is just a way
> to enrich existing queries with metadata (history with metadata, bookmarks
> with metadata, ..). So for now even the JSON blob should be good enough for
> that.
Yup!
Updated•8 years ago
|
Iteration: --- → 1.7
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 7•8 years ago
|
||
I'm going with "store it as a JSON blob" approach. Current schema:
- _id
- history_guid FK (cascading delete, cascading update)
- has_image TINYINT (assuming we'd want to find history items with a nice image to display in highlights, etc)
- json TEXT
We should put some sanity limits on all of the extracted data before we attempt to store it. No sense in storing tons of SEO spam that people often stuff into description/keywords tags. As a first pass on this, adding a character limit is probably enough - that puts an approximate upper limit on how much data we'll store.
Currently we extract the following things out of a page:
- description
- keywords
- icon_url
- image_url
- title
- type
- url
I think our current needs primarily require us to store... "image_url", "type", perhaps "description"?
We already have "title" and "url" of course in the history record, though these might be different as they're coming from "og:*" tags. "icon_url" is most likely the same as what we'll already have as a favicon.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Reporter | ||
Comment 16•8 years ago
|
||
mozreview-review |
Comment on attachment 8805370 [details]
Bug 1301717 - Schema migration: add page_metadata table
https://reviewboard.mozilla.org/r/89062/#review88844
::: mobile/android/base/java/org/mozilla/gecko/db/BrowserContract.java:186
(Diff revision 2)
> public static final String IS_LOCAL = "is_local";
> }
>
> + public interface PageMetadataColumns {
> + public static final String HISTORY_GUID = "history_guid";
> + public static final String DATE_ADDED = "date_added";
nit: It looks like this introduces a third style for the timestamp. So far we have:
* TIME_CREATED = "timeCreated"
* CREATED = "created"
::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:227
(Diff revision 2)
> + "FOREIGN KEY (" + Visits.HISTORY_GUID + ") REFERENCES " +
> + TABLE_HISTORY + "(" + History.GUID + ") ON DELETE CASCADE ON UPDATE CASCADE" +
> + ");");
> +
> + // Establish a 1-to-1 relationship with History table.
> + db.execSQL("CREATE UNIQUE INDEX page_metadata_history_guid ON " + TABLE_PAGE_METADATA + "("
> + + PageMetadata.HISTORY_GUID + ")");
> + // Improve performance of commonly occurring selections.
> + db.execSQL("CREATE INDEX page_metadata_history_guid_and_has_image ON " + TABLE_PAGE_METADATA + "("
This is the part I'm not sure about. Right now we are extracting metadata whenever we visit a site. However going forward we will have "highlights" consisting of pages that we might not have visited before. It's hard to tell where this data is coming from and where/how we are going to store them though. But without a history entry we couldn't store them here, right?
The favicons and thumbnails table seem to be more relaxed and just use the URL to join.
Are you using the GUID instead of the ID so that this is synchronizable in theory?
Reporter | ||
Comment 17•8 years ago
|
||
mozreview-review |
Comment on attachment 8805373 [details]
Bug 1301717 - Limit what we extract, store incoming page metadata
https://reviewboard.mozilla.org/r/89068/#review88848
::: mobile/android/base/java/org/mozilla/gecko/BrowserApp.java:1933
(Diff revision 2)
> + try {
> + hasImage = !TextUtils.isEmpty(metadata.getString("image_url"));
> + // Missing key, or wrong data type. We're really just catching a missing key here.
> + } catch (NativeJSObject.InvalidPropertyException e) {
> + hasImage = false;
> + }
You could use optString() here if you just care about handling the missing key.
::: mobile/android/base/java/org/mozilla/gecko/GlobalPageMetadata.java:79
(Diff revision 2)
> + // Don't bother with metadata that doesn't have anything we care about.
> + if (metadataJSON.equals("{}")) {
> + return;
> + }
Isn't this something we should do at the beginning of the method? Or on metadataToInsert instead of metadataJSON?
::: mobile/android/base/java/org/mozilla/gecko/GlobalPageMetadata.java:84
(Diff revision 2)
> + boolean inserted = db.insertPageMetadata(contentProviderClient, uri, hasImage, metadataToInsert);
> +
> + if (inserted) {
nit: final or maybe just "if (db.insertPageMeta.... ?
::: mobile/android/base/java/org/mozilla/gecko/GlobalPageMetadata.java:108
(Diff revision 2)
> + synchronized (queuedMetadata) {
> + if (!queuedMetadata.containsKey(uri)) {
> + return;
> + }
> +
> + bundledMetadata = queuedMetadata.get(uri);
> + queuedMetadata.remove(uri);
> + }
> +
> + insertMetadataBundleForUri(uri, bundledMetadata);
> + }
The read is synchronized but the write in doAdd() is not.
::: mobile/android/base/java/org/mozilla/gecko/GlobalPageMetadata.java:140
(Diff revision 2)
> + if (TextUtils.isEmpty(metadataJSON)) {
> + Log.e(LOG_TAG, "Metadata bundle contained empty metadata json");
> + return;
> + }
> +
> + // Insert!
I'm not sure if this is a helpful comment :)
::: mobile/android/base/java/org/mozilla/gecko/GlobalPageMetadata.java:153
(Diff revision 2)
> + /* package-local */ static String processMetadataJSON(@NonNull String json) throws JSONException {
> + final JSONObject unprocessedMetadata = new JSONObject(json);
> + final JSONObject metadata = new JSONObject();
> +
> + if (unprocessedMetadata.has(META_TYPE)) {
> + metadata.put(META_TYPE, unprocessedMetadata.get(META_TYPE));
> + }
> +
> + if (unprocessedMetadata.has(META_IMAGE_URL)) {
> + metadata.put(META_IMAGE_URL, unprocessedMetadata.get(META_IMAGE_URL));
> + }
> +
> + if (unprocessedMetadata.has(META_DESCRIPTION)) {
> + final String description = (String) unprocessedMetadata.get(META_DESCRIPTION);
> + final String descriptionToInsert;
> + if (description.length() > DESCRIPTION_CHAR_LIMIT) {
> + descriptionToInsert = description.substring(0, DESCRIPTION_CHAR_LIMIT);
> + } else {
> + descriptionToInsert = description;
> + }
> + metadata.put(META_DESCRIPTION, descriptionToInsert);
> + }
> +
> + return metadata.toString();
> + }
I wonder if we should do this in WebsiteMetadata.jsm - There's maybe no reason to send everything to Java and then throw most of it away.
In addition to that we should make sure that we only extract things we want to use now or later (and store now). But we could look at this in a follow-up too.
Right now we only need:
* image url
* provider name (Rule tbd - bug 1311164)
I can look at the trello cards to see what we might need in the future. But I think we can omit type and description for now.
Updated•8 years ago
|
Iteration: 1.7 → 1.8
Assignee | ||
Comment 18•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8805373 [details]
Bug 1301717 - Limit what we extract, store incoming page metadata
https://reviewboard.mozilla.org/r/89068/#review88848
> The read is synchronized but the write in doAdd() is not.
It's a synchronized map, so write is synchronized by definition. However, I synchronized on the map itself for compound access (lookup, get, write).
> I wonder if we should do this in WebsiteMetadata.jsm - There's maybe no reason to send everything to Java and then throw most of it away.
>
> In addition to that we should make sure that we only extract things we want to use now or later (and store now). But we could look at this in a follow-up too.
>
> Right now we only need:
> * image url
> * provider name (Rule tbd - bug 1311164)
>
> I can look at the trello cards to see what we might need in the future. But I think we can omit type and description for now.
I'll move this logic to WebsiteMetadata.jsm and limit extraction to just image_url for now, let's not spread the extraction logic across the js/java boundary :-) Once provider extraction is available, we'd only need to change the extraction code.
At that point, Java side of things can assume that it needs to store whatever is received.
Assignee | ||
Comment 19•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8805370 [details]
Bug 1301717 - Schema migration: add page_metadata table
https://reviewboard.mozilla.org/r/89062/#review88844
> This is the part I'm not sure about. Right now we are extracting metadata whenever we visit a site. However going forward we will have "highlights" consisting of pages that we might not have visited before. It's hard to tell where this data is coming from and where/how we are going to store them though. But without a history entry we couldn't store them here, right?
>
> The favicons and thumbnails table seem to be more relaxed and just use the URL to join.
>
> Are you using the GUID instead of the ID so that this is synchronizable in theory?
If highlights information isn't coming from our history, it's still coming from somewhere. We will probably end up with another table to house "never-visited" highlights, which seems like a reasonable assumption. Let's call that table "highlights". If that is the case, we have at least two paths to take:
1) Relax history_guid FK constraint (allow nulls), add nullable highlights_id FK, enforce that both can't be null (should be possible with a CHECK constraint).
2) Drop foreign keys in the metadata table, and move them to history table and to highlights table.
Both approaches will let us store metadata regardless where it's coming from in the same table, without having to duplicate URLs, and keeping benefits of things like database-level integrity checks and cascading updates/deletes.
First approach models our hypothetical data relationships better. We'll have [history/highlights 0..1 <--> 0..1 metadata]. 2nd approach will be [history/highlights 0..many <--> 0..1]. If our history/highlights are unique, that shouldn't be necessary.
I think either GUID or ID could be used here. I'm mostly following what we're doing in other cases when we're referring to history records (e.g. visits table).
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Reporter | ||
Comment 28•8 years ago
|
||
(In reply to :Grisha Kruglov from comment #19)
> If highlights information isn't coming from our history, it's still coming
> from somewhere. We will probably end up with another table to house
> "never-visited" highlights, which seems like a reasonable assumption. Let's
> call that table "highlights". If that is the case, we have at least two
> paths to take:
> 1) Relax history_guid FK constraint (allow nulls), add nullable
> highlights_id FK, enforce that both can't be null (should be possible with a
> CHECK constraint).
> 2) Drop foreign keys in the metadata table, and move them to history table
> and to highlights table.
>
> Both approaches will let us store metadata regardless where it's coming from
> in the same table, without having to duplicate URLs, and keeping benefits of
> things like database-level integrity checks and cascading updates/deletes.
One existing example is bookmarks. They show up in highlights but there might not be a history record for an existing bookmark. Right now this means we do not have any metadata either (because we only extract it on visit) - but eventually we might need some mechanism to grab metadata for new sites. Well, that's a problem we can tackle later and then adapt the schema.
Reporter | ||
Comment 29•8 years ago
|
||
mozreview-review |
Comment on attachment 8805370 [details]
Bug 1301717 - Schema migration: add page_metadata table
https://reviewboard.mozilla.org/r/89062/#review90060
::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:2116
(Diff revision 4)
> case 35:
> upgradeDatabaseFrom34to35(db);
> break;
> +
> + case 36:
> + upgradeDatabaseFrom35to36(db);
nit: break? Preventing the next one from forgetting it. :)
Attachment #8805370 -
Flags: review?(s.kaspari) → review+
Reporter | ||
Comment 30•8 years ago
|
||
mozreview-review |
Comment on attachment 8805371 [details]
Bug 1301717 - Allow querying, inserting and cleaning up page metadata via BrowserProvider
https://reviewboard.mozilla.org/r/89064/#review90080
::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:581
(Diff revision 4)
> - deleted = deleteHistory(uri, selection, selectionArgs);
> + deletePageMetadataForHistory(db, historyGUIDs);
> + deleted = deleteHistory(db, uri, selection, selectionArgs);
> deleteUnusedImages(uri);
Interestingly we delete favicons/thumbnails here without having a foreign key reference (just by URL).
Attachment #8805371 -
Flags: review?(s.kaspari) → review+
Reporter | ||
Comment 31•8 years ago
|
||
mozreview-review |
Comment on attachment 8805372 [details]
Bug 1301717 - Dispatch an event informing listeners that uri has been stored in history table
https://reviewboard.mozilla.org/r/89066/#review90084
Attachment #8805372 -
Flags: review?(s.kaspari) → review+
Reporter | ||
Comment 32•8 years ago
|
||
mozreview-review |
Comment on attachment 8805373 [details]
Bug 1301717 - Limit what we extract, store incoming page metadata
https://reviewboard.mozilla.org/r/89068/#review90068
::: mobile/android/base/java/org/mozilla/gecko/BrowserApp.java:1932
(Diff revision 4)
> - // TODO: Store metadata (Bug 1301717)
> + boolean hasImage;
> + try {
> + hasImage = !TextUtils.isEmpty(metadata.optString("image_url", null));
> + // Somehow we got the type wrong!
> + } catch (NativeJSObject.InvalidPropertyException e) {
> + Log.e(LOGTAG, "Type of image_url metadata is not String");
> + return;
> + }
> +
> + final boolean hasImageToPost = hasImage;
nit: I think now hasImage can be final and hasImageToPost is not needed anymore.
::: mobile/android/base/java/org/mozilla/gecko/BrowserApp.java:1935
(Diff revision 4)
> + // Somehow we got the type wrong!
> + } catch (NativeJSObject.InvalidPropertyException e) {
I wonder if it's in this case okay to just crash - that's something we notice and can fix.
::: mobile/android/base/java/org/mozilla/gecko/GlobalPageMetadata.java:77
(Diff revision 4)
> + // NB: This also means we won't override existing metadata with empty metadata.
> + if (preparedMetadataJSON.equals("{}")) {
This is interesting. Maybe "this website does not have any metadata" (= empty JSON object) is an information we are interested in. It at least tells us that we do not need to fetch metadata actively (if we are going to implement this).
Maybe it's too early for this optimization? Not clearing metadta that does not exist anymore /could/ be a problem.
::: mobile/android/base/java/org/mozilla/gecko/GlobalPageMetadata.java:82
(Diff revision 4)
> + // If we could insert page metadata, we're done.
> + if (db.insertPageMetadata(contentProviderClient, uri, hasImage, preparedMetadataJSON)) {
> + return;
> + }
> +
> + // Otherwise, we need to queue it for future insertion when history record is available.
> + Bundle bundledMetadata = new Bundle();
> + bundledMetadata.putBoolean(KEY_HAS_IMAGE, hasImage);
> + bundledMetadata.putString(KEY_METADATA_JSON, preparedMetadataJSON);
> + queuedMetadata.put(uri, bundledMetadata);
Isn't there a chance for a race condition to occur here? -> We can't add the metadata here and the history event comes in before we added the event to the queue.
Or is it guaranteed that both calls will happen on the same thread (and not concurrent)? If this is the case then let's add an ThreadUtils assert call.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 38•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8805373 [details]
Bug 1301717 - Limit what we extract, store incoming page metadata
https://reviewboard.mozilla.org/r/89068/#review90068
> This is interesting. Maybe "this website does not have any metadata" (= empty JSON object) is an information we are interested in. It at least tells us that we do not need to fetch metadata actively (if we are going to implement this).
>
> Maybe it's too early for this optimization? Not clearing metadta that does not exist anymore /could/ be a problem.
Agreed that this needs more thought and at this point it's an optimization we don't need. I've added code/tests to ensure that trying to save "{}" metadata is equivalent to deleting metadata for a given record. Also an optimization, but a straightforward one :-)
> Isn't there a chance for a race condition to occur here? -> We can't add the metadata here and the history event comes in before we added the event to the queue.
>
> Or is it guaranteed that both calls will happen on the same thread (and not concurrent)? If this is the case then let's add an ThreadUtils assert call.
They will both be called from the same background thread. I've added an assert call to ensure.
Reporter | ||
Comment 39•8 years ago
|
||
mozreview-review |
Comment on attachment 8805373 [details]
Bug 1301717 - Limit what we extract, store incoming page metadata
https://reviewboard.mozilla.org/r/89068/#review90308
Great! Let's see how it works in Nightly. :)
::: mobile/android/base/java/org/mozilla/gecko/GlobalPageMetadata.java:44
(Diff revision 6)
> + private static final GlobalPageMetadata instance = new GlobalPageMetadata();
> +
> + private static final String KEY_HAS_IMAGE = "hasImage";
> + private static final String KEY_METADATA_JSON = "metadataJSON";
> +
> + private final Map<String, Bundle> queuedMetadata = Collections.synchronizedMap(new HashMap<String, Bundle>());
The only thing I'm a little bit afraid of is that we queue up things here and (for whatever reason) do not get a "visited" call for some URLs.
In theory it would be enough to just queue the last metadata and wait for the history event. However visits / metadata parsing may happen faster than visits are reported (guess). But I think we could - to be safe - limit the size of the map, so that it doesn't grow endlessly, e.g.:
http://docs.oracle.com/javase/7/docs/api/java/util/LinkedHashMap.html#removeEldestEntry%28java.util.Map.Entry%29
In addition to that it would be nice to log whenever we reach the limit - if this happens often then we should revisit the approach.
Attachment #8805373 -
Flags: review?(s.kaspari) → review+
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 44•8 years ago
|
||
Pushed by gkruglov@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4339d7d42fc2
Schema migration: add page_metadata table r=sebastian
https://hg.mozilla.org/integration/autoland/rev/ee201a2c701f
Allow querying, inserting and cleaning up page metadata via BrowserProvider r=sebastian
https://hg.mozilla.org/integration/autoland/rev/03343ba4f573
Dispatch an event informing listeners that uri has been stored in history table r=sebastian
https://hg.mozilla.org/integration/autoland/rev/5063c78131ec
Limit what we extract, store incoming page metadata r=sebastian
Comment 45•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/4339d7d42fc2
https://hg.mozilla.org/mozilla-central/rev/ee201a2c701f
https://hg.mozilla.org/mozilla-central/rev/03343ba4f573
https://hg.mozilla.org/mozilla-central/rev/5063c78131ec
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox52:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 52
Updated•4 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•