Closed Bug 1590123 Opened 5 years ago Closed 5 years ago

Draw a FSM about how the metrics ping should work

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: brizental)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

Attachments

(2 files, 8 obsolete files)

We should document (with a finite state machine?) how the metrics ping scheduler should work. Let's look at the docs for this, not at the implementation :)

Priority: -- → P3
Whiteboard: [telemetry:glean-rs:m?]
Assignee: nobody → brizental
Priority: P3 → P1
Summary: Draw a FSM abou thow the metrics ping should work → Draw a FSM about how the metrics ping should work
Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m11]

This is a first draft of the diagram: https://drive.google.com/file/d/1_2rLawKlCdeV3YZXT1ZSaEPkmJDQRm_X/view?usp=sharing


Two things I am not happy about:

  • I don't like the date notation I used, if anyone knows how to make it clearer, let me know.
  • I don't think the example about timezones in ** helps much, what do you think?

One thing I think is a hole in the loop:

  • Here is a situation:
    • It is a fresh install and it is before 4am
    • Collection get's scheduled for 4am the same day
    • User closes the app before the collection is done
    • User then opens the app after 4am

I believe that app will see that it is past scheduled time and collect as soon as the app is opened again, but I don't foresee this situation in the diagram. I always rely on the on app start to check if it is time to collect or to schedule, but in this situation there is no measurement window start date yet.

Either one of these two options seem possible as far as I understand:

  • There is actually a measurement window that is started when the user first opens the app from a fresh install before checking if is past 4am, but I didn't add this step to the diagram.
  • The user will go the "fresh install" path everytime they open the app until they collect something. And then, maybe I should change the is this a fresh install? condition to has this user ever collected the metrics ping?, because that would be clearer. From what I understand is this a fresh install? now means "is this the first time ever the user is opening firefox in this machine".

PS.: I will add more color to this diagram after discussion and changes.

Flags: needinfo?(tlong)
Flags: needinfo?(alessio.placitelli)
Attached image Untitled Diagram.png (deleted) —

This is a good start!

First, I think that your date formats aren't really necessary. Would it be more clear to just refer to the dates by what they represent? For instance, rather than saying "start new measurement windows at dd/mm/yy", you can say "Start new measurement window for current day", or "Is it past the end of the 4am window for the current day?"

The timezones example is a good specific case that deserves being called out, but rather than a footnote, maybe call it out directly as a special case that needs to be handled correctly.

In regards to the situation where you think there is a hole, when the app is first opened, regardless of whether it's before or after 4am, should create the window. So the next time the app is launched, after 4 am, the flow chart should have an active measurement window "start" to "close".

I think the fresh install terminology isn't necessary. Remember that there are 3 cases that we are covering when Glean first starts up:

  • The ping was already sent today, so schedule one for the next 4AM interval (this wouldn't be true on a fresh install as we wouldn't have sent anything yet
  • The ping was not sent today and it's past the due time, so collect and send immediately and schedule for the next calendar day (If this were a fresh install, then collection would return no data and the ping would be ignored)
  • The ping was not sent today and it's before the due time, so just schedule for the next 4AM interval (If this were a fresh install, it would collect at the 4am window, and if there was no data it would ignore the ping, otherwise it would go ahead and send it)

So I think your flow chart can be cleaned up to show the above 3 paths more clearly.

It's also important to point out that there is another entry point: when the 4AM window occurs, WorkManager wakes the app to send the ping and reschedules itself if there was data for that day.

I attached a png example that may help explain the flow I'm explaining.

Flags: needinfo?(tlong)

Nothing to add to what Travis said :)

Flags: needinfo?(alessio.placitelli)

I thought of something to add, that might be sort of important to help understand the logic, now that I think about it: Whenever the collection is scheduled, it replaces any and all other scheduled metrics ping collection jobs. The replacing part is important because we don't want multiple jobs waking up at 4AM to try and do the same thing. So if there is already a job scheduled for the next 4AM interval, and the application starts and reschedules it, then the new one would just replace the existing one.

(In reply to Travis Long [:travis_] from comment #2)

Thanks for the detailed review, Travis!

I agree with the date formats being unnecessary, I was very caught up on that and needed someone to just shake it off me, thanks :)

I changed the diagram based on your observations and I think it is much clearer, but after I did it I realized that it is actually not exactly the same as the documentation describes. If you take the last example "The user changes timezone in an extreme enough fashion that they cross 4AM twice on the same date", if we just check if the metrics was collected today to know if we should collect it, that example will play out differently and the user will indeed collect on Feb8, since they will open the app and the app will realise it is past 4am and metrics ping hasn't been collected.

Unfortunately there was no way for me to recover the previous diagram so I just redrew it taking into account your observations as well. It is much more complex than the new one. I left the "fresh install" and "measurement window start date" stuff because that is what I understood from the documentation.

I think the new version makes it much easier to understand and also makes timezones not so confusing because there is no more comparing dates that might be from different timezones.

I left the two diagrams there for comparison.

About the WorkManager stuff, it is very platform specific. Do you think it makes sense to add it to this diagram?

Some thoughts:

  • can you please attach the diagram to this bug?
  • I think it's ok to remove the "is there any data"/"send" parts, as they are not really part of the metrics ping scheduling. You can also rename "collect" to "submit", which is a more appropriate terminology
  • the WorkManager is platform specific, but the 4am signal it sends it isn't.
  • Can you make the two diagrams two different files?
  • Did you generate the diagram with graphviz?
Flags: needinfo?(brizental)

I think I like the flow of your new diagram much better!

I'm still not sure we need to treat the fresh install any differently than the case of "We haven't sent the metrics ping today". Both of those situations get handled the same way: We check to see when the last ping was sent, and if it wasn't today (date = null or date = yesterday), then we collect the ping. Since we will only send if there is data, a fresh install will probably into the "no-data so don't send" category. I'm just not certain it deserves a special path in your diagram because it makes it seem like we check for a fresh install every time we start, and we don't we currently just check for the last ping sent date/time.

As for the timezone concern, if we are scheduling this for a specific time and date, then crossing the 4AM time boundary twice shouldn't matter, we would have already collected for that day so when the scheduler runs it should say "I've already collected today, just reschedule for the next day at 4AM". The side effect of this that I can see is that a metrics ping window may be larger or smaller than 24 hours because of timezone changes.

The WorkManager stuff is very platform specific, you are right. Unfortunately that's our only platform with a Metrics ping scheduler so that's the only logical example we can examine for how this works, and right now WorkManager is tied up in that logic because of it's behavior. If you can find a way to disentangle it from the logic for your diagram, that would be ideal, but I'm not sure we can for diagramming how it works right now. This may be something that needs to change as we move the logic into Rust.

Also, I recommend https://draw.io for drawing flow chart diagrams

Attached file metrics_ping_scheduler (obsolete) (deleted) —
Flags: needinfo?(brizental)
Attached file metrics_ping_scheduler-REVISED (obsolete) (deleted) —
Attachment #9107243 - Attachment description: metrics → metrics_ping_scheduler-REVISED
Attached image metrics_ping_scheduler.png (obsolete) (deleted) —
Attachment #9107240 - Attachment is obsolete: true
Attached image metrics_ping_scheduler.png (obsolete) (deleted) —
Attachment #9107245 - Attachment is obsolete: true
Attached image metrics_ping_scheduler-REVISED.png (obsolete) (deleted) —
Attachment #9107243 - Attachment is obsolete: true

Answering [:Dexter]:

Did you generate the diagram with graphviz?

I used draw.io, I found it easier than graphviz.

Answering [:travis_]:

I think the way you worded how it works vs. how it is worded in the documentation are very different, but ultimately both do the same thing. Sorry for my confusion. I also think the "REVISED" version of the diagram is much easier to follow.

To add the WorkManager part, I think it would make sense to bring back the "Is there data?" part. Let me know what you think.

OBS.: Just realized I forgot to change "has the metrics ping been collected today" to "has the metrics ping been submitted today".

Okay, I was thinking about this and I still think, even though both diagrams accomplish basically the same thing, there are slight differences. Because the path they take to accomplish it is different.

If you take again the last example in the documentation "The user changes timezone in an extreme enough fashion that they cross 4AM twice on the same date" and change the first step from "The application is opened on Feb 7 at 3PM in timezone UTC+11, closed at 3:05PM" to "The application is opened on Feb 6 at 3PM in timezone UTC+11, closed at 3:05PM", it will play out differently in each path. In the revised path the app will collect on step 3., which it shouldn't according to documentation.

Attached image metrics_ping_scheduler-REVISED(1).png (obsolete) (deleted) —
Attachment #9107247 - Attachment is obsolete: true
Attached image metrics_ping_scheduler(1).png (obsolete) (deleted) —
Attachment #9107246 - Attachment is obsolete: true
Comment on attachment 9107474 [details] metrics_ping_scheduler(1).png I think this is ready for review. The task was to draw a diagram from what I understood from the [the documentation](https://mozilla.github.io/glean/book/user/pings/metrics.html) and I believe this diagram is accurate to what is described there. I still think there is a hole for when the user opens the app from a fresh install before 4am and closes it before 4am. There is no path for that situation (which the other diagram has). Maybe this is a hole in the actual documentation? I also think we found some incoherences between the implementation and the documentation. See [comment #15](https://bugzilla.mozilla.org/show_bug.cgi?id=1590123#c15) for a better description of it. One other issue that was raised, was about adding the `WorkManager` to this diagram. That is not part of the documentation so I left it out, but I do think it makes sense to add it after discussion. To do that, I think we should bring back the steps of checking weather there is any data in the ping and the step of sending it (which are also not in the documentation), because it interferes in this part.
Flags: needinfo?(tlong)
Flags: needinfo?(alessio.placitelli)
Attachment #9107474 - Flags: review+
Comment on attachment 9107474 [details] metrics_ping_scheduler(1).png I think this is ready for review. The task was to draw a diagram from what I understood from the [the documentation](https://mozilla.github.io/glean/book/user/pings/metrics.html) and I believe this diagram is accurate to what is described there. I still think there is a hole for when the user opens the app from a fresh install before 4am and closes it before 4am. There is no path for that situation (which the other diagram has). Maybe this is a hole in the actual documentation? I also think we found some incoherences between the implementation and the documentation. See [comment #15](https://bugzilla.mozilla.org/show_bug.cgi?id=1590123#c15) for a better description of it. One other issue that was raised, was about adding the `WorkManager` to this diagram. That is not part of the documentation so I left it out, but I do think it makes sense to add it after discussion. To do that, I think we should bring back the steps of checking weather there is any data in the ping and the step of sending it (which are also not in the documentation), because it interferes in this part.
Attachment #9107474 - Flags: review?(tlong)
Attachment #9107474 - Flags: review?(alessio.placitelli)
Attachment #9107474 - Flags: review+
Flags: needinfo?(tlong)
Flags: needinfo?(alessio.placitelli)
Comment on attachment 9107474 [details] metrics_ping_scheduler(1).png Sorry for being so opinionated about this! I think that this is looking good and you are getting very close, I just need a little more convincing on a couple of things: > I still think there is a hole for when the user opens the app from a fresh install before 4am and closes it before 4am. There is no path for that situation (which the other diagram has). Maybe this is a hole in the actual documentation? There isn't a hole, and this is why I don't think we can ignore the `WorkManager` as a point of entry for this diagram. If the user used the app for the very first time before 4AM, and then closed the app before 4AM, currently the `WorkManager` would wake the application up (in the background) to send the ping at @ 4AM, which is different than if the user launches the application. If we are truly wanting to draw a diagram about how the MPS _should_ work like the title of the bug, then I totally agree that the `WorkManager` point of entry shouldn't be on here. If we are diagramming how it _currently_ works, then we need to consider the `WorkManager`. I'll admit, I'm not looking at the docs and how they describe the scheduling, I'm looking at the code and how this is working in practice. If we are diagramming the metrics ping scheduling as it currently works on Android, we have to treat the `WorkManager` point of entry differently because it doesn't check to see when the last ping was sent like when the user launches the app, it simply performs the collection and reschedules (if there was data) for the next 4AM interval. I know that this complicates things, especially as we try and port this to platforms that can't schedule things like this (iOS specifically will face this issue). I'm still opposed to putting the "Is this a fresh install?" step in the diagram because we don't really check for a fresh install every time the app opens. What we are really checking for is the last time the ping was sent. In the case of a fresh install, the last time the ping was sent happens to be "never", but we don't treat that any differently than if we had sent the ping before, but not on that particular day. We schedule it for immediate collection and reschedule for the next 4AM interval in both cases.
Attachment #9107474 - Flags: feedback-

(In reply to Travis Long [:travis_] from comment #20)

No worries, [:travis_]! I really appreciate the discussion.

When I first started this task, the idea was to make a diagram only from what is in the documentation (see comment #1), that is one of the reasons I said I'd do it, because I have never even seen the implementation. After that, someone else would make a similar diagram only from looking at the implementation and we would see the differences, etc.

I agree with you on all your observations, about the WorkManager and the "fresh install" and I even think that the diagram I drew after your explanations is much easier to follow than the one I drew from documentation.

Still, the new diagram is not accurate to what is in the documentation. So, if you take very literally the task as I understood it, the first diagram is what fulfills it, and not the latter. I think, if I would be to make a new diagram based on implementation, that would be a new task and would require me to go over the code and back to the drawing board.

What do you think [:Dexter]?

Flags: needinfo?(alessio.placitelli)

(In reply to Beatriz Rizental from comment #21)

(In reply to Travis Long [:travis_] from comment #20)

No worries, [:travis_]! I really appreciate the discussion.

When I first started this task, the idea was to make a diagram only from what is in the documentation (see comment #1), that is one of the reasons I said I'd do it, because I have never even seen the implementation. After that, someone else would make a similar diagram only from looking at the implementation and we would see the differences, etc.

Yes, this bug is specifically about getting the behaviour from the docs. After this bug is resolved, we will go through the implementation and look for discrepancies. This is not a task for this bug.

I agree with you on all your observations, about the WorkManager and the "fresh install" and I even think that the diagram I drew after your explanations is much easier to follow than the one I drew from documentation.

I agree with Travis about the importance of the 4AM signal, I don't agree we should mention the WorkManager. We just need to add a new entry point somewhere called "4AM trigger".

I think it's ok to call out "fresh install", as this is specifically mentioned in the docs.

Still, the new diagram is not accurate to what is in the documentation. So, if you take very literally the task as I understood it, the first diagram is what fulfills it, and not the latter. I think,

if I would be to make a new diagram based on implementation, that would be a new task and would require me to go over the code and back to the drawing board.

What do you think [:Dexter]?

I think that this bug is very specific about "diagram from the docs" :) The rest will come as separate bugs. Please don't look at the implementation! As you said, the whole reason why you're doing this, is that you're new to the codebase and are not biased ;)

Flags: needinfo?(alessio.placitelli)

(In reply to Alessio Placitelli [:Dexter] from comment #22)

I agree with Travis about the importance of the 4AM signal, I don't agree we should mention the WorkManager. We just need to add a new entry point somewhere called "4AM trigger".

If we are going to be consistent about the "only from current documentation" part of this task, then I disagree with adding the 4AM trigger. The documentation doesn't mention it, in fact all the examples in it describe the scheduler as "only works when the app is open", there is no mention of background tasks or anything.

(In reply to Beatriz Rizental from comment #23)

(In reply to Alessio Placitelli [:Dexter] from comment #22)

I agree with Travis about the importance of the 4AM signal, I don't agree we should mention the WorkManager. We just need to add a new entry point somewhere called "4AM trigger".

If we are going to be consistent about the "only from current documentation" part of this task, then I disagree with adding the 4AM trigger. The documentation doesn't mention it, in fact all the examples in it describe the scheduler as "only works when the app is open", there is no mention of background tasks or anything.

The reason why I think a 4AM trigger is needed, is that in your diagram you say "schedule .. at 4am": scheduling doesn't imply sending, necessarily :) So, to make that reflect the docs, you'd still need some way to represent the fact that something which was previously scheduled happens at 4am and collects the ping. Does this make sense?

If you think this is already captured by your diagram, then it's fine, but please call it out :)

r+ from me with [:Dexter] comment about 4AM trigger addressed. Great work!

Attachment #9107474 - Flags: review?(tlong) → review+
Attached image metrics_ping_scheduler.png (obsolete) (deleted) —

[:Dexter] Here is the diagram now with the 4am trigger, as discussed.

Attachment #9107473 - Attachment is obsolete: true
Attachment #9107474 - Attachment is obsolete: true
Attachment #9107474 - Flags: review?(alessio.placitelli)
Flags: needinfo?(alessio.placitelli)
Flags: needinfo?(alessio.placitelli)
Comment on attachment 9108431 [details] metrics_ping_scheduler.png This looks good to me.
Attachment #9108431 - Flags: review+
Attached image metrics_ping_scheduler(1).png (deleted) —
Attachment #9108431 - Attachment is obsolete: true

Sorry, there was an extra arrow in the previous one. This is the last change.

Should I close this bug as FIXED? What are the next steps here?

(In reply to Beatriz Rizental from comment #29)

Sorry, there was an extra arrow in the previous one. This is the last change.

Should I close this bug as FIXED? What are the next steps here?

Yes please, do close this as fixed. Can you please file a follow-up bug (depending on this one) about checking if the implementation matches what's highlighted in this diagram? :)

Flags: needinfo?(brizental)
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(brizental)
Resolution: --- → FIXED
Blocks: 1596436
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: