Skip to content

Fix UnitParser.ParseUnit for multiple-worded units #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Fix UnitParser.ParseUnit for multiple-worded units #254

wants to merge 3 commits into from

Conversation

gojanpaolo
Copy link
Contributor

#253

  1. Changed regex to accept multiple worded non-numeric units
  2. Used string.Replace to remove all known separators. This way, we
    don't need to include it in our regex. Also the separator "," is already
    defined in NumberFormatInfo numFormat
  3. Removed code block for group since it will not go into this
    anymore.
  4. Added Unit test for Parsing Mass "Short Tons" and "Long Tons"

#253

1. Changed regex to accept multiple worded non-numeric units
2. Used string.Replace to remove all known separators. This way, we
don't need to include it in our regex. Also the separator "," is already
defined in NumberFormatInfo numFormat
3. Removed code block for <invalid> group since it will not go into this
anymore.
4. Added Unit test for Parsing Mass "Short Tons" and "Long Tons"
Added:
in/s
yd/s
yd/min
Copy link
Owner

@angularsen angularsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new units look good and could be merged, but since you have mixed that with the unit parsing change, you will first have to address the unit parsing comments or create a separate PR for the new units.

@"(?<unit>[^\s\d,]+)", // capture Unit (non-whitespace) input
@"(and)?,?", // allow "and" & "," separators between quantities
@"(?<invalid>[a-z]*)?"); // capture invalid input
@"(?<unit>[^\d]+)"); // capture Unit (non-numeric)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works with our current cases, but we might stumble onto problems here later if we get abbreviations with numbers in them. As an example, we currently only list in Area.Squaremeter, but for convenience sake we might want to also add m^2 so that they can parse both variants. The latter would not work with this implementation.
I'm fine with postponing that until we cross that bridge, but one option is to define the regex as a number immediately followed by any of the known unit abbreviations for the specified culture (IFormatProvider).

Something like this should work:

string valueRegex = $"{numRegex}{exponentialRegex}";
string unitRegex = "(m²|m\^2|cm²|cm\^2)";
string valueUnitRegex = $"(?<value>{valueRegex})\s?(?<unit>{unitRegex})"`;

The unitRegex string values can be enumerated by something like this:

var unitSystem = UnitSystem.GetCached(formatProvider);
string[] abbrevsForAllUnits = 
    Enum.GetValues(typeof(TUnit))
    .Cast<TUnit>()
    .SelectMany(unitSystem.GetAllAbbreviations)
    .Select(Regex.Escape)
    .ToArray();

And you'd need to refactor the UnitParser.ParseUnit() method a bit:

  • Add where TUnit : struct, IComparable, IFormattable (should have been added to begin with)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever I add where TUnit : struct, IComparable, IFormattable, I get a bunch of errors.
I think you're mistaking TUnit in UnitParser.ParseUnit() as the Unit Enums (i.e. LengthUnit, MassUnit, etc) but TUnit is actually the UnitClasses (Length, Mass).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.. so I tried hacking into the UnitSystem class to get all the abbreviation given a UnitClasses type.
The problem is, using the format of regex you suggested. It will fail on test "1ft 1invalid" since it recognizes the "INvalid" as inches. It also fails the test "1ft monkey 1in" since it recognizes "FT monkey".
I thought about restricting it to match only exact abbreviation but how will it handle something like "m" and "mm"?

I think having a single regex or parse algorithm for all the Unit types will have its limitations.
If we want the parsing to handle special cases then maybe it will be easier to implement one parse method for each Unit type.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you raise some good points here.

I think you're mistaking TUnit in UnitParser.ParseUnit() as the Unit Enums

You are right, I was probably too quick here, but UnitSystem should be able to provide this information as you found out.

It will fail on test "1ft 1invalid" since it recognizes the "INvalid" as inches. It also fails the test "1ft monkey 1in" since it recognizes "FT monkey". It also fails the test "1ft monkey 1in" since it recognizes "FT monkey".
I thought about restricting it to match only exact abbreviation but how will it handle something like "m" and "mm"?

I think your suggestion should work, we just need ensure we don't match on text that is not part of a valid pair of {value}{unit} or the accepted delimiters between pairs or between value and unit. I tried a few things and wound up with this: ^((?<value>\d+)\s?(?<unit>m²|m\^2|cm²|cm\^2|ft|in|m|mm)\s?)+$ and I added unit tests for the cases you listed above that you can view here: https://regex101.com/r/EMLIwt/1

With the start/end boundaries, a + repeater to match on at least one pair and an optional whitespace \s? between each pair, the regex now seems to properly match on full abbreviations, and it should not incorrectly match m on 1mm.

I think having a single regex or parse algorithm for all the Unit types will have its limitations.
If we want the parsing to handle special cases then maybe it will be easier to implement one parse method for each Unit type.

I hear you and I agree that might be something we eventually have to reach for, but I also do think my updated regex might just work. The upside of the complexity is that it allows for parsing any number of valid pairs of value+unit and that the implementation is uniform across units, so it is consistent and easier to reason about. The downside is the complexity of the regex itself, performance and the fact that it might not make sense or at least not be commonly used to parse all kinds of pairs of quantities from a string, like we are used to from 1ft 2in and 1' 2''.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have been working on another project last week thus the late response. I'll get back to this once I finished the said project (maybe a couple weeks more). :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, thanks for posting an update! Did my proposal make sense to you?

Copy link
Contributor Author

@gojanpaolo gojanpaolo May 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it allows for parsing any number of valid pairs of value+unit and that the implementation is uniform across units, so it is consistent and easier to reason about.

Yes! especially that sentence. I'll try using your regex and add unit tests for it when I get the time. 👍

[TestCase("333 long tn", Result = 333)]
public double ParseMassLongTon(string s)
{
return Mass.Parse(s).LongTons;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -68,6 +68,12 @@ public class SpeedTests : SpeedTestsBase

protected override double MillimetersPerHourInOneMeterPerSecond => 3600000;

protected override double InchesPerSecondInOneMeterPerSecond => 39.3700787;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've mixed two pieces of work here. I saw you were working on the master branch, but it would probably be better for you to work on a feature branch to separate the two bulks of changes and create separate pull requests.
We can proceed like this if you are not familiar with how to do it, no biggie, but just a heads up for next time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I messed up with the branches. I'm not familiar using Git yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I might as well ask you how should I have done the two pieces of work?
Should I have created two forks? one for the bugfix and one for the addfeature?

Copy link
Owner

@angularsen angularsen May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend googling for some guides on git, github, forks and pull requests. I haven't read this one, but at first glance it seems to cover the basics: https://gist.github.com/Chaser324/ce0505fbed06b947d962

The basic gist of it is:

  • Add official repo as remote so you can sync in changes, I prefer to call this origin and if I have a fork I call that my, but that is just my preference: git remote set-url origin https://github.com/anjdreas/unitsnet and git remote add my https://github.com/gojanpaolo/unitsnet
  • Create a local branch (based on origin/master) for any new work you want to create a pull request for, fix-unit-parsing or add-pascal-unit: git checkout -b fix-unit-parsing origin/master
  • Do your work as one or more commits
  • When you are ready for review, rebase your branch on top of any new commits on the official repo, then push that branch up to your github fork: git rebase origin/master && git push -u my
  • Visit github.com to create a pull request from that branch

It's a bit boilerplate the first time, especially setting up the remotes and stuff, but it gets easier with practice. You might also want to use some GUI clients to make things easier, but I personally prefer the command line.

Copy link
Owner

@angularsen angularsen May 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, for the second pull request, repeat all but the first step, this way both branches are based on origin/master and don't have any commits from the other branch.


protected override double YardsPerMinuteInOneMeterPerSecond => 65.6167979;

protected override double YardsPerSecondInOneMeterPerSecond => 1.09361330;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked these values though, looks good!

@angularsen
Copy link
Owner

Still waiting for you to address my feedback. Also, it needs to be rebased on top of latest master as there is a merge conflict right now.

@gojanpaolo
Copy link
Contributor Author

I think I will still be busy on projects at work for the next couple of months. :/

@angularsen
Copy link
Owner

Fixed by #265.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants