Consider understanding the unexpected files insides the origin directories
Categories
(Core :: Storage: Quota Manager, enhancement, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox69 | --- | fixed |
People
(Reporter: tt, Assigned: tt)
References
Details
Attachments
(2 files, 1 obsolete file)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/plain
|
chutten
:
data-review+
|
Details |
We've landed some fixes based on the reports on Bugzilla. However, the number of having unexpected files on the origin directories goes up and down recently for no visible reasons [1].
We are considering the fix this issue a bit more aggressive rather than just waiting for files bugs from users. So, at least, inside origin directories, all the files are managed by the QuotaManager and it only allows file on the allow list. In this level, it shouldn't have a file with a sensitive filename (e.g. filename with origins or something). Unexpected files are expected to be created by OS or anti-virus application.
Therefore, maybe collected filename on this level (inside the origin directory) and only on the Nightly channel can work. If we can do that, that will help us fix this issue efficiently.
[1] https://sql.telemetry.mozilla.org/queries/61236/source#157855
Assignee | ||
Comment 1•6 years ago
|
||
Hey Chris,
We are considering getting the filenames of unexpected files in the origin directories (For instance, $profile/storage/default/https+++foo.com/UNEXPECTED_FILE).
In these directories, QuotaManager only allows or expect files like: clients' folder (e.g. idb/, cache/, ...), metadata files (e.g. .metadata, .metadata-v2), OS metadata (e.g. Desktop.ini, ...), and dot-files (e.g. .xxx). And, if there is any file aside from this allow-list, then it would just block the process of initialization. We haven't been bother with these for a long while.
Recently, [1] shows that we got more errors about unexpected files on origin directories. Since we haven't landed any relative code in to m-c, I believe these files are just some files haven't been included in our allow-list (instead of a regression).
Our current approach for dealing with this is a bit passive, we adjust the allow-list based on the reports on Bugzilla. I wonder if it's possible to take a little more aggressive approach: collecting these file name on Nighlty only for few versions.
So, do you have any suggestion about collecting these data (filename)? We lean to recokn that there shouldn't have any file with a sensitive filename in this level. I believe most of name we will get is OS-created files (e.g. thumb.db) which should be Category 1.
Furthermore, I don't think there would have sensitive filenames, but if you are worried about this, would you recommend collecting filename in the origin directories on Nightly only by Prio? Thanks!
[1] https://sql.telemetry.mozilla.org/queries/61236/source#157855
Comment 2•6 years ago
|
||
Prio encoding at this stage is only capable of recording boolean information, and reporting aggregate counts of the boolean values from the population. Recording unknown strings isn't one of its capabilities at the present time, so I don't think it'd be the right tool for the job.
If we have a list of potential candidate os-created file names, let's write them down and count them. (Maybe as a categorical histogram). If we're still left with too many unknowns we can broach the question of whether the names of files in that directory can be reported, and how we'd report them.
Or, if we'd like to skip right to that, we can pose a question to Marshall about which Category of collection file names in that obscure technical folder would be.
Assignee | ||
Comment 3•6 years ago
|
||
(In reply to Chris H-C :chutten from comment #2)
Thanks for the reply!
Prio encoding at this stage is only capable of recording boolean information, and reporting aggregate counts of the boolean values from the population. Recording unknown strings isn't one of its capabilities at the present time, so I don't think it'd be the right tool for the job.
I see, so the Prio is not an option for now.
If we have a list of potential candidate os-created file names, let's write them down and count them. (Maybe as a categorical histogram). If we're still left with too many unknowns we can broach the question of whether the names of files in that directory can be reported, and how we'd report them.
The thing is that we've put all reported or knowing os-created file in our allow-list and we still got errors for unexpected files on different directories.
So, I guess you are right. My question should be "whether the names of files in the origin directory can be reported?". Personally, I think we can, because:
- These directories are managed by QuotaManager, putting a self-defined file would just break the functionality of the QuotaManager. Users cannot get any benefit, so I don't think users intend to do that.
- If it's on the parents level of the origin directories, there are more chances to get the names of origin. However, in the origin directories level, we shouldn't get these data.
But, I have to also admit that there is a rather small chance to get a sensitive filename.
Or, if we'd like to skip right to that, we can pose a question to Marshall about which Category of collection file names in that obscure technical folder would be.
Could you point me out where should raise the discussion or just in this bug would be okay? Thanks!
Comment 4•6 years ago
|
||
For raising questions with Marshall a needinfo on a bug with a self-contained summary background+question comment usually does the trick.
I wonder, though, if maybe we should look into what we'd do with the data. If we received perfect data that half of the problematic files were path/to/filename.ext
... what then? What actions would we take?
What if it were only 1/4 of the problematic files? 1% of them?
Will our actions be different if we can intuit from the filename what its function is? Or will the response always be "whitelist anything that comes up"?
( I ask because sometimes we end up learning that a no-knowledge solution would be roughly equivalent to a perfect-knowledge solution so we can skip the phase of data collection, collation, and analysis and jump straight to solving the problem )
Assignee | ||
Comment 5•6 years ago
|
||
(In reply to Chris H-C :chutten from comment #4)
For raising questions with Marshal a needinfo on a bug with a self-contained summary background+question comment usually does the trick.
Got it! Thanks!
I wonder, though, if maybe we should look into what we'd do with the data. If we received perfect data that half of the problematic files were
path/to/filename.ext
... what then? What actions would we take?
If the filename doesn't make any sense (e.g. abc.ext) and its counter is quite low, we would consider removing them while initialization.
If the filename is probably created by the OS, we would add them into our allow-list.
I think, if there is a filename and it's counted myriad times, we would consider ignoring these files since that probably means a lot of users are doing that and we should maybe just ignore that.
What if it were only 1/4 of the problematic files? 1% of them?
The number of these unexpected files with the same name might be a factor for us to make the decision. And, that would be useful for us to check if it's a regression or not.
Will our actions be different if we can intuit from the filename what its function is? Or will the response always be "whitelist anything that comes up"?
Yes, basically, our major problem now is that we are not sure if we can either just ignore, remove while finding them if we don't know their name.
If we take an aggressive approach (always remove the unexpected files):
- we might drop our performance if these are os files.
If we just ignore all the unexpected files,
- we might not be able to find the problem if these are created by us accidentally.
- some users might complain about Firefox eats too much space while QuotaManager thinks it's okay (since we ignore these files).
( I ask because sometimes we end up learning that a no-knowledge solution would be roughly equivalent to a perfect-knowledge solution so we can skip the phase of data collection, collation, and analysis and jump straight to solving the problem )
It would be really helpful if we can collect data before making a decision even collecting them just for few releases. Also, we can apply the allow-list to different directories (e.g. repositories directories, this level contains files with origin names).
Comment 6•6 years ago
|
||
Great, thank you for so fully answering my questions. Let's see if I can craft a good Marshall comment :)
Marshall, we're looking for a decision on what Data Collection Category filenames within an internal, obscure Firefox folder would be. There should be no reason for anyone to put anything there on purpose unless by some automated means, but there's also nothing preventing someone from putting something meaningful there.
We need to know the names of these files and the frequencies they appear to learn if it would be better to ignore or to wipe out these files. If we can't figure that out, we'd have to treat them the same: either ignoring everything (in which case we might accidentally balloon our own storage without limit) or removing everything (in which case we might be fighting with an automated process about who can use the disk fastest).
If this ends up being Cat3 or above, can you think of a mitigation that would be acceptable to satisfy these requirements?
Chris, I can't think of a mitigation here that would help. In similar situations in the past, we use a white list to only collect non-sensitive fields. But if I understand this bug correctly, if we knew this correct whitelist, we wouldn't need the collection in the first place.
This does though strike me as category 1, with some very small (or possibly non-existent) possibility that the data is category 3. In that case, it seems reasonable to me to collect the data and it is within policy because there is no practical risk. If our assumptions are wrong, we could then make a correction and stop the collection.
Tom, can you plan on revisiting this once you have the data just to confirm that it is category 1 as expected?
Assignee | ||
Comment 8•5 years ago
|
||
(In reply to Merwin from comment #7)
Chris, I can't think of a mitigation here that would help. In similar situations in the past, we use a white list to only collect non-sensitive fields. But if I understand this bug correctly, if we knew this correct whitelist, we wouldn't need the collection in the first place.
This does though strike me as category 1, with some very small (or possibly non-existent) possibility that the data is category 3. In that case, it seems reasonable to me to collect the data and it is within policy because there is no practical risk. If our assumptions are wrong, we could then make a correction and stop the collection.
Tom, can you plan on revisiting this once you have the data just to confirm that it is category 1 as expected?
Thanks for the information! I'll definitely revisit this once I get the data.
My plan is only collect filename of unexpected files in the origin directories on Nightly Channel. Note that these filenames shouldn't contain information related to origin. Let's only collect that for 69 as our first baby step if it's possible. Then, I will discuss the result on our team meeting and make a decision if we want to keep collecting them in the future.
I'll work on a patch as soon as possible.
Assignee | ||
Comment 9•5 years ago
|
||
IIUC of the document, there seem to have two ways to collect unknown strings. One is Scalar (kind string), but a user might have multiple unexpected files on origin directories of their own profile. Another seems to be keyed Scalar (kind uint/boolean), but I'm not so sure if there will be more than 100 different filenames.
Chris, do you happen to know if there is another way to get unknown strings by Scalar/Histograms? Thanks in advance!
Comment 10•5 years ago
|
||
Your understanding is correct: we don't make it easy to collect batches of unknown strings in Telemetry.
Keyed scalars (uint or bool) might be the best way to transport this information, as awkward as that may be.
Assignee | ||
Comment 11•5 years ago
|
||
Assignee | ||
Comment 12•5 years ago
|
||
(In reply to Chris H-C :chutten from comment #10)
Your understanding is correct: we don't make it easy to collect batches of unknown strings in Telemetry.
Keyed scalars (uint or bool) might be the best way to transport this information, as awkward as that may be.
Thanks! I decided to try the keyed scalars (uint).
Assignee | ||
Comment 13•5 years ago
|
||
Comment 14•5 years ago
|
||
Assignee | ||
Comment 15•5 years ago
|
||
I'm going to split the patch into two pieces. One is implementation and another one is the test because I somehow cannot get Services.telemetry.getSnapshotForKeyedScalars("main", false)["parent"];
on platforms of try server (https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=252322639&repo=try&lineNumber=5742). I can get that on my Mac, though.
I will land the implementation part today.
Updated•5 years ago
|
Assignee | ||
Comment 16•5 years ago
|
||
Depends on D34721
Comment 17•5 years ago
|
||
Comment 18•5 years ago
|
||
bugherder |
Updated•4 years ago
|
Description
•