Draw a FSM about how the metrics ping should work
Categories
(Data Platform and Tools :: Glean: SDK, task, P1)
Tracking
(Not tracked)
People
(Reporter: Dexter, Assigned: brizental)
References
Details
(Whiteboard: [telemetry:glean-rs:m11])
Attachments
(2 files, 8 obsolete files)
We should document (with a finite state machine?) how the metrics ping scheduler should work. Let's look at the docs for this, not at the implementation :)
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
This is a first draft of the diagram: https://drive.google.com/file/d/1_2rLawKlCdeV3YZXT1ZSaEPkmJDQRm_X/view?usp=sharing
Two things I am not happy about:
- I don't like the date notation I used, if anyone knows how to make it clearer, let me know.
- I don't think the example about timezones in
**
helps much, what do you think?
One thing I think is a hole in the loop:
- Here is a situation:
- It is a fresh install and it is before 4am
- Collection get's scheduled for 4am the same day
- User closes the app before the collection is done
- User then opens the app after 4am
I believe that app will see that it is past scheduled time and collect as soon as the app is opened again, but I don't foresee this situation in the diagram. I always rely on the on app start to check if it is time to collect or to schedule, but in this situation there is no measurement window start date
yet.
Either one of these two options seem possible as far as I understand:
- There is actually a measurement window that is started when the user first opens the app from a fresh install before checking if is past 4am, but I didn't add this step to the diagram.
- The user will go the "fresh install" path everytime they open the app until they collect something. And then, maybe I should change the
is this a fresh install?
condition tohas this user ever collected the metrics ping?
, because that would be clearer. From what I understandis this a fresh install?
now means "is this the first time ever the user is opening firefox in this machine".
PS.: I will add more color to this diagram after discussion and changes.
Comment 2•5 years ago
|
||
This is a good start!
First, I think that your date formats aren't really necessary. Would it be more clear to just refer to the dates by what they represent? For instance, rather than saying "start new measurement windows at dd/mm/yy", you can say "Start new measurement window for current day", or "Is it past the end of the 4am window for the current day?"
The timezones example is a good specific case that deserves being called out, but rather than a footnote, maybe call it out directly as a special case that needs to be handled correctly.
In regards to the situation where you think there is a hole, when the app is first opened, regardless of whether it's before or after 4am, should create the window. So the next time the app is launched, after 4 am, the flow chart should have an active measurement window "start" to "close".
I think the fresh install terminology isn't necessary. Remember that there are 3 cases that we are covering when Glean first starts up:
- The ping was already sent today, so schedule one for the next 4AM interval (this wouldn't be true on a fresh install as we wouldn't have sent anything yet
- The ping was not sent today and it's past the due time, so collect and send immediately and schedule for the next calendar day (If this were a fresh install, then collection would return no data and the ping would be ignored)
- The ping was not sent today and it's before the due time, so just schedule for the next 4AM interval (If this were a fresh install, it would collect at the 4am window, and if there was no data it would ignore the ping, otherwise it would go ahead and send it)
So I think your flow chart can be cleaned up to show the above 3 paths more clearly.
It's also important to point out that there is another entry point: when the 4AM window occurs, WorkManager
wakes the app to send the ping and reschedules itself if there was data for that day.
I attached a png example that may help explain the flow I'm explaining.
Reporter | ||
Comment 3•5 years ago
|
||
Nothing to add to what Travis said :)
Comment 4•5 years ago
|
||
I thought of something to add, that might be sort of important to help understand the logic, now that I think about it: Whenever the collection is scheduled, it replaces any and all other scheduled metrics ping collection jobs. The replacing part is important because we don't want multiple jobs waking up at 4AM to try and do the same thing. So if there is already a job scheduled for the next 4AM interval, and the application starts and reschedules it, then the new one would just replace the existing one.
Assignee | ||
Comment 5•5 years ago
|
||
(In reply to Travis Long [:travis_] from comment #2)
Thanks for the detailed review, Travis!
I agree with the date formats being unnecessary, I was very caught up on that and needed someone to just shake it off me, thanks :)
I changed the diagram based on your observations and I think it is much clearer, but after I did it I realized that it is actually not exactly the same as the documentation describes. If you take the last example "The user changes timezone in an extreme enough fashion that they cross 4AM twice on the same date", if we just check if the metrics was collected today to know if we should collect it, that example will play out differently and the user will indeed collect on Feb8, since they will open the app and the app will realise it is past 4am and metrics ping hasn't been collected.
Unfortunately there was no way for me to recover the previous diagram so I just redrew it taking into account your observations as well. It is much more complex than the new one. I left the "fresh install" and "measurement window start date" stuff because that is what I understood from the documentation.
I think the new version makes it much easier to understand and also makes timezones not so confusing because there is no more comparing dates that might be from different timezones.
I left the two diagrams there for comparison.
About the WorkManager
stuff, it is very platform specific. Do you think it makes sense to add it to this diagram?
Reporter | ||
Comment 6•5 years ago
|
||
Some thoughts:
- can you please attach the diagram to this bug?
- I think it's ok to remove the "is there any data"/"send" parts, as they are not really part of the metrics ping scheduling. You can also rename "collect" to "submit", which is a more appropriate terminology
- the
WorkManager
is platform specific, but the 4am signal it sends it isn't. - Can you make the two diagrams two different files?
- Did you generate the diagram with graphviz?
Comment 7•5 years ago
|
||
I think I like the flow of your new diagram much better!
I'm still not sure we need to treat the fresh install any differently than the case of "We haven't sent the metrics ping today". Both of those situations get handled the same way: We check to see when the last ping was sent, and if it wasn't today (date = null or date = yesterday), then we collect the ping. Since we will only send if there is data, a fresh install will probably into the "no-data so don't send" category. I'm just not certain it deserves a special path in your diagram because it makes it seem like we check for a fresh install every time we start, and we don't we currently just check for the last ping sent date/time.
As for the timezone concern, if we are scheduling this for a specific time and date, then crossing the 4AM time boundary twice shouldn't matter, we would have already collected for that day so when the scheduler runs it should say "I've already collected today, just reschedule for the next day at 4AM". The side effect of this that I can see is that a metrics ping window may be larger or smaller than 24 hours because of timezone changes.
The WorkManager
stuff is very platform specific, you are right. Unfortunately that's our only platform with a Metrics ping scheduler so that's the only logical example we can examine for how this works, and right now WorkManager
is tied up in that logic because of it's behavior. If you can find a way to disentangle it from the logic for your diagram, that would be ideal, but I'm not sure we can for diagramming how it works right now. This may be something that needs to change as we move the logic into Rust.
Comment 8•5 years ago
|
||
Also, I recommend https://draw.io for drawing flow chart diagrams
Assignee | ||
Comment 9•5 years ago
|
||
Assignee | ||
Comment 10•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 11•5 years ago
|
||
Assignee | ||
Comment 12•5 years ago
|
||
Assignee | ||
Comment 13•5 years ago
|
||
Assignee | ||
Comment 14•5 years ago
|
||
Answering [:Dexter]:
Did you generate the diagram with graphviz?
I used draw.io, I found it easier than graphviz.
Answering [:travis_]:
I think the way you worded how it works vs. how it is worded in the documentation are very different, but ultimately both do the same thing. Sorry for my confusion. I also think the "REVISED" version of the diagram is much easier to follow.
To add the WorkManager
part, I think it would make sense to bring back the "Is there data?" part. Let me know what you think.
OBS.: Just realized I forgot to change "has the metrics ping been collected today" to "has the metrics ping been submitted today".
Assignee | ||
Comment 15•5 years ago
|
||
Okay, I was thinking about this and I still think, even though both diagrams accomplish basically the same thing, there are slight differences. Because the path they take to accomplish it is different.
If you take again the last example in the documentation "The user changes timezone in an extreme enough fashion that they cross 4AM twice on the same date" and change the first step from "The application is opened on Feb 7 at 3PM in timezone UTC+11, closed at 3:05PM" to "The application is opened on Feb 6 at 3PM in timezone UTC+11, closed at 3:05PM", it will play out differently in each path. In the revised path the app will collect on step 3., which it shouldn't according to documentation.
Assignee | ||
Comment 16•5 years ago
|
||
Assignee | ||
Comment 17•5 years ago
|
||
Assignee | ||
Comment 18•5 years ago
|
||
Assignee | ||
Comment 19•5 years ago
|
||
Reporter | ||
Updated•5 years ago
|
Comment 20•5 years ago
|
||
Assignee | ||
Comment 21•5 years ago
|
||
(In reply to Travis Long [:travis_] from comment #20)
No worries, [:travis_]! I really appreciate the discussion.
When I first started this task, the idea was to make a diagram only from what is in the documentation (see comment #1), that is one of the reasons I said I'd do it, because I have never even seen the implementation. After that, someone else would make a similar diagram only from looking at the implementation and we would see the differences, etc.
I agree with you on all your observations, about the WorkManager
and the "fresh install" and I even think that the diagram I drew after your explanations is much easier to follow than the one I drew from documentation.
Still, the new diagram is not accurate to what is in the documentation. So, if you take very literally the task as I understood it, the first diagram is what fulfills it, and not the latter. I think, if I would be to make a new diagram based on implementation, that would be a new task and would require me to go over the code and back to the drawing board.
What do you think [:Dexter]?
Reporter | ||
Comment 22•5 years ago
|
||
(In reply to Beatriz Rizental from comment #21)
(In reply to Travis Long [:travis_] from comment #20)
No worries, [:travis_]! I really appreciate the discussion.
When I first started this task, the idea was to make a diagram only from what is in the documentation (see comment #1), that is one of the reasons I said I'd do it, because I have never even seen the implementation. After that, someone else would make a similar diagram only from looking at the implementation and we would see the differences, etc.
Yes, this bug is specifically about getting the behaviour from the docs. After this bug is resolved, we will go through the implementation and look for discrepancies. This is not a task for this bug.
I agree with you on all your observations, about the
WorkManager
and the "fresh install" and I even think that the diagram I drew after your explanations is much easier to follow than the one I drew from documentation.
I agree with Travis about the importance of the 4AM signal, I don't agree we should mention the WorkManager
. We just need to add a new entry point somewhere called "4AM trigger".
I think it's ok to call out "fresh install", as this is specifically mentioned in the docs.
Still, the new diagram is not accurate to what is in the documentation. So, if you take very literally the task as I understood it, the first diagram is what fulfills it, and not the latter. I think,
if I would be to make a new diagram based on implementation, that would be a new task and would require me to go over the code and back to the drawing board.
What do you think [:Dexter]?
I think that this bug is very specific about "diagram from the docs" :) The rest will come as separate bugs. Please don't look at the implementation! As you said, the whole reason why you're doing this, is that you're new to the codebase and are not biased ;)
Assignee | ||
Comment 23•5 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #22)
I agree with Travis about the importance of the 4AM signal, I don't agree we should mention the WorkManager. We just need to add a new entry point somewhere called "4AM trigger".
If we are going to be consistent about the "only from current documentation" part of this task, then I disagree with adding the 4AM trigger. The documentation doesn't mention it, in fact all the examples in it describe the scheduler as "only works when the app is open", there is no mention of background tasks or anything.
Reporter | ||
Comment 24•5 years ago
|
||
(In reply to Beatriz Rizental from comment #23)
(In reply to Alessio Placitelli [:Dexter] from comment #22)
I agree with Travis about the importance of the 4AM signal, I don't agree we should mention the WorkManager. We just need to add a new entry point somewhere called "4AM trigger".
If we are going to be consistent about the "only from current documentation" part of this task, then I disagree with adding the 4AM trigger. The documentation doesn't mention it, in fact all the examples in it describe the scheduler as "only works when the app is open", there is no mention of background tasks or anything.
The reason why I think a 4AM trigger is needed, is that in your diagram you say "schedule .. at 4am": scheduling doesn't imply sending, necessarily :) So, to make that reflect the docs, you'd still need some way to represent the fact that something which was previously scheduled happens at 4am and collects the ping. Does this make sense?
If you think this is already captured by your diagram, then it's fine, but please call it out :)
Comment 25•5 years ago
|
||
r+ from me with [:Dexter] comment about 4AM trigger addressed. Great work!
Updated•5 years ago
|
Assignee | ||
Comment 26•5 years ago
|
||
[:Dexter] Here is the diagram now with the 4am trigger, as discussed.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 27•5 years ago
|
||
Assignee | ||
Comment 28•5 years ago
|
||
Assignee | ||
Comment 29•5 years ago
|
||
Sorry, there was an extra arrow in the previous one. This is the last change.
Should I close this bug as FIXED? What are the next steps here?
Reporter | ||
Comment 30•5 years ago
|
||
(In reply to Beatriz Rizental from comment #29)
Sorry, there was an extra arrow in the previous one. This is the last change.
Should I close this bug as FIXED? What are the next steps here?
Yes please, do close this as fixed. Can you please file a follow-up bug (depending on this one) about checking if the implementation matches what's highlighted in this diagram? :)
Assignee | ||
Updated•5 years ago
|
Description
•