Closed
Bug 1311487
Opened 8 years ago
Closed 7 years ago
main_summary dataset should validate/clean dates
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P3)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: ddurst, Unassigned)
References
Details
Attachments
(1 file)
(deleted),
application/zip
|
Details |
Chutten and I came across this oddity when looking at s.t.m.o for some forensic analysis on orphan updating: your results may be different using different kinds of dashes (probably in error) as part of the date "format" (%Y-%m-%d).
So I ran a small test with https://sql.telemetry.mozilla.org/queries/1463/source#table.
I've attached the results of running that query with five permutations of the date format of '2016-10-20':
1. H: both dashes are hyphens
2. N1: first dash is a hyphen, second is an ndash
3. N2: both dashes are ndashes
4. M1: first dash is a hyphen, second is an mdash
5. M2: both dashes are mdashes
Diff'd the output and 1, 2, and 4 are all equal. 3 and 5 are equal. So this tells me that the character is not really the issue, but rather how that character is treated/translated.
1 and 3 are not equal: 1 contains results in 2016 and 3 only has results with year > 2016.
So what happens if someone erroneously formats a date with non-hyphens? How are hyphens being handled on the back-end?
Reporter | ||
Comment 1•8 years ago
|
||
robotblake made a good point in IRC via regex experiment: most likely, the activity_date is being compared as strings (duh), so xxxx— vs xxxx&hyphen diffs are probably due to just ASCII values.
But this begs a different question, then: if we need to query a specific date, what's the best (read: safest) non-ASCII way to do it?
Use a regex to remove non-numerics? try to cast to a date type (all such efforts have failed)? format as a date string?
Comment 3•8 years ago
|
||
Yes, we can do this, I've added an issue to our tracker: https://github.com/mozilla/redash/issues/11
Flags: needinfo?(rmiller)
Comment 4•8 years ago
|
||
robotblake is correct, the 'activity_date' column in this table is a string, not a date, and '>=' therefore does string comparison.
Here is a version of the query that parses the date column and compares it to a date value (after first using a regex to ignore rows with invalid date syntax):
SELECT *
FROM client_count
WHERE
normalized_channel = 'Other'
AND regexp_like(activity_date, '\d{4}-\d{2}-\d{2}')
AND from_iso8601_date(activity_date) >= date '2016-10-20'
AND country = 'CN'
AND os_version = '10.0'
AND app_version = '45.3.0'
ORDER BY activity_date ASC
Reporter | ||
Comment 5•8 years ago
|
||
(O hai, ashort!)
The only problem with that is the "first using a regex to ignore rows with invalid date syntax" -- if those rows have invalid date syntax, are we guaranteed that they have invalid date *data* (i.e. we are safe to exclude them)?
So, basically, your comment confirms the issue. Aside from the question of excluding rows based on syntax, is this a preferred way of working around this issue, or is some other solution going to be implemented that obviates it?
I understand the issue isn't *with* redash, but I'd expect either a function to safely cast to date types (which would force consistent handling of any of - – —) or something implicit that would do the same (without any regex conversions).
Flags: needinfo?(rmiller)
Flags: needinfo?(ashort)
Comment 6•8 years ago
|
||
The best thing to do would clearly be to fix the data source. Putting the burden of safe conversion and comparison on the querying tool is always going to be brittle; there's no way we're going to be able to account for all permutations of what might end up in the database. We can try to provide something as a band-aid, but is there a bug open anywhere for someone to turn this field into an actual date field so we fix the root of the issue?
Flags: needinfo?(rmiller)
Comment 7•8 years ago
|
||
(In reply to David Durst [:ddurst] from comment #5)
> The only problem with that is the "first using a regex to ignore rows with
> invalid date syntax" -- if those rows have invalid date syntax, are we
> guaranteed that they have invalid date *data* (i.e. we are safe to exclude
> them)?
The error I got when leaving out this regex was 'Cannot parse "22951-04-0"' so there are clearly some entries that just aren't recognizable as dates at all.
Flags: needinfo?(ashort)
Updated•8 years ago
|
Summary: Dashes in dates in s.t.m.o are not treated equally? → main_summary dataset should validate/clean dates
Comment 8•8 years ago
|
||
We should convert the data in the main summary dataset to use actual parquet "Date" types. Then we can treat them in a sane way, with any assumptions documented accordingly (like using null for invalid dates).
Comment 9•8 years ago
|
||
(In reply to Rob Miller [:RaFromBRC :rmiller] from comment #6)
> The best thing to do would clearly be to fix the data source. Putting the
> burden of safe conversion and comparison on the querying tool is always
> going to be brittle; there's no way we're going to be able to account for
> all permutations of what might end up in the database. We can try to provide
> something as a band-aid, but is there a bug open anywhere for someone to
> turn this field into an actual date field so we fix the root of the issue?
We tried using a proper date type in Parquet in the past but the version of Presto we were using at the time had some issues with it. We should verify that our current version of Presto still has this issue.
Updated•8 years ago
|
Points: --- → 3
Priority: -- → P3
Comment 10•7 years ago
|
||
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Updated•7 years ago
|
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•