-
Notifications
You must be signed in to change notification settings - Fork 393
Fix UnitParser.ParseUnit for multiple-worded units #254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
#253 1. Changed regex to accept multiple worded non-numeric units 2. Used string.Replace to remove all known separators. This way, we don't need to include it in our regex. Also the separator "," is already defined in NumberFormatInfo numFormat 3. Removed code block for <invalid> group since it will not go into this anymore. 4. Added Unit test for Parsing Mass "Short Tons" and "Long Tons"
Added: in/s yd/s yd/min
Added Speed Units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new units look good and could be merged, but since you have mixed that with the unit parsing change, you will first have to address the unit parsing comments or create a separate PR for the new units.
@"(?<unit>[^\s\d,]+)", // capture Unit (non-whitespace) input | ||
@"(and)?,?", // allow "and" & "," separators between quantities | ||
@"(?<invalid>[a-z]*)?"); // capture invalid input | ||
@"(?<unit>[^\d]+)"); // capture Unit (non-numeric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works with our current cases, but we might stumble onto problems here later if we get abbreviations with numbers in them. As an example, we currently only list m²
in Area.Squaremeter
, but for convenience sake we might want to also add m^2
so that they can parse both variants. The latter would not work with this implementation.
I'm fine with postponing that until we cross that bridge, but one option is to define the regex as a number immediately followed by any of the known unit abbreviations for the specified culture (IFormatProvider
).
Something like this should work:
string valueRegex = $"{numRegex}{exponentialRegex}";
string unitRegex = "(m²|m\^2|cm²|cm\^2)";
string valueUnitRegex = $"(?<value>{valueRegex})\s?(?<unit>{unitRegex})"`;
The unitRegex
string values can be enumerated by something like this:
var unitSystem = UnitSystem.GetCached(formatProvider);
string[] abbrevsForAllUnits =
Enum.GetValues(typeof(TUnit))
.Cast<TUnit>()
.SelectMany(unitSystem.GetAllAbbreviations)
.Select(Regex.Escape)
.ToArray();
And you'd need to refactor the UnitParser.ParseUnit()
method a bit:
- Add
where TUnit : struct, IComparable, IFormattable
(should have been added to begin with)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever I add where TUnit : struct, IComparable, IFormattable
, I get a bunch of errors.
I think you're mistaking TUnit in UnitParser.ParseUnit() as the Unit Enums (i.e. LengthUnit, MassUnit, etc) but TUnit is actually the UnitClasses (Length, Mass).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.. so I tried hacking into the UnitSystem class to get all the abbreviation given a UnitClasses type.
The problem is, using the format of regex you suggested. It will fail on test "1ft 1invalid" since it recognizes the "INvalid" as inches. It also fails the test "1ft monkey 1in" since it recognizes "FT monkey".
I thought about restricting it to match only exact abbreviation but how will it handle something like "m" and "mm"?
I think having a single regex or parse algorithm for all the Unit types will have its limitations.
If we want the parsing to handle special cases then maybe it will be easier to implement one parse method for each Unit type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you raise some good points here.
I think you're mistaking TUnit in UnitParser.ParseUnit() as the Unit Enums
You are right, I was probably too quick here, but UnitSystem
should be able to provide this information as you found out.
It will fail on test "1ft 1invalid" since it recognizes the "INvalid" as inches. It also fails the test "1ft monkey 1in" since it recognizes "FT monkey". It also fails the test "1ft monkey 1in" since it recognizes "FT monkey".
I thought about restricting it to match only exact abbreviation but how will it handle something like "m" and "mm"?
I think your suggestion should work, we just need ensure we don't match on text that is not part of a valid pair of {value}{unit}
or the accepted delimiters between pairs or between value and unit. I tried a few things and wound up with this: ^((?<value>\d+)\s?(?<unit>m²|m\^2|cm²|cm\^2|ft|in|m|mm)\s?)+$
and I added unit tests for the cases you listed above that you can view here: https://regex101.com/r/EMLIwt/1
With the start/end boundaries, a +
repeater to match on at least one pair and an optional whitespace \s?
between each pair, the regex now seems to properly match on full abbreviations, and it should not incorrectly match m
on 1mm
.
I think having a single regex or parse algorithm for all the Unit types will have its limitations.
If we want the parsing to handle special cases then maybe it will be easier to implement one parse method for each Unit type.
I hear you and I agree that might be something we eventually have to reach for, but I also do think my updated regex might just work. The upside of the complexity is that it allows for parsing any number of valid pairs of value+unit and that the implementation is uniform across units, so it is consistent and easier to reason about. The downside is the complexity of the regex itself, performance and the fact that it might not make sense or at least not be commonly used to parse all kinds of pairs of quantities from a string, like we are used to from 1ft 2in
and 1' 2''
.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I have been working on another project last week thus the late response. I'll get back to this once I finished the said project (maybe a couple weeks more). :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, thanks for posting an update! Did my proposal make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it allows for parsing any number of valid pairs of value+unit and that the implementation is uniform across units, so it is consistent and easier to reason about.
Yes! especially that sentence. I'll try using your regex and add unit tests for it when I get the time. 👍
[TestCase("333 long tn", Result = 333)] | ||
public double ParseMassLongTon(string s) | ||
{ | ||
return Mass.Parse(s).LongTons; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -68,6 +68,12 @@ public class SpeedTests : SpeedTestsBase | |||
|
|||
protected override double MillimetersPerHourInOneMeterPerSecond => 3600000; | |||
|
|||
protected override double InchesPerSecondInOneMeterPerSecond => 39.3700787; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you've mixed two pieces of work here. I saw you were working on the master branch, but it would probably be better for you to work on a feature branch to separate the two bulks of changes and create separate pull requests.
We can proceed like this if you are not familiar with how to do it, no biggie, but just a heads up for next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I messed up with the branches. I'm not familiar using Git yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I might as well ask you how should I have done the two pieces of work?
Should I have created two forks? one for the bugfix and one for the addfeature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend googling for some guides on git, github, forks and pull requests. I haven't read this one, but at first glance it seems to cover the basics: https://gist.github.com/Chaser324/ce0505fbed06b947d962
The basic gist of it is:
- Add official repo as remote so you can sync in changes, I prefer to call this
origin
and if I have a fork I call thatmy
, but that is just my preference:git remote set-url origin https://github.com/anjdreas/unitsnet
andgit remote add my https://github.com/gojanpaolo/unitsnet
- Create a local branch (based on
origin/master
) for any new work you want to create a pull request for,fix-unit-parsing
oradd-pascal-unit
:git checkout -b fix-unit-parsing origin/master
- Do your work as one or more commits
- When you are ready for review,
rebase
your branch on top of any new commits on the official repo, then push that branch up to your github fork:git rebase origin/master && git push -u my
- Visit github.com to create a pull request from that branch
It's a bit boilerplate the first time, especially setting up the remotes and stuff, but it gets easier with practice. You might also want to use some GUI clients to make things easier, but I personally prefer the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, for the second pull request, repeat all but the first step, this way both branches are based on origin/master and don't have any commits from the other branch.
|
||
protected override double YardsPerMinuteInOneMeterPerSecond => 65.6167979; | ||
|
||
protected override double YardsPerSecondInOneMeterPerSecond => 1.09361330; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked these values though, looks good!
Still waiting for you to address my feedback. Also, it needs to be rebased on top of latest master as there is a merge conflict right now. |
I think I will still be busy on projects at work for the next couple of months. :/ |
Fixed by #265. |
#253
don't need to include it in our regex. Also the separator "," is already
defined in NumberFormatInfo numFormat
anymore.