Feature/improve parsing #81

maherkassim · 2015-05-18T15:36:10Z

Added the ability to parse multiple quantities (eg. 1ft 1in)

angularsen · 2015-05-18T19:16:48Z

As already stated, I think this is a fair approach, but I have a hunch we might run into some edge cases in the parsing, especially for unit abbreviations with numbers in them. We don't have any today, since we normally go with m² and m³, but it's fair to assume we will have abbreviations like m^2 and m^3 at some point. I think we need more tests to be sure though, and I guess adding abbreviations like that is needed to fully test it.

I just added a couple of new tests in b215ac5 that might serve as a reference on how to temporarily add new unit abbreviations in a test, to test the outcome. Since UnitSystem is a singleton-per-culture, it will also affect parsing done in unit classes.

This way we can test the parse behavior for both realistic and contrived cases, such as "5 m^2 2m²" equals 7 m². While writing that I just realized an edge case. If we allow removing the space you get a pretty ambiguous expression "5m^22m²". This is why I don't want to allow space between them, but for some cases like 5"2' we do want to allow that. So I'm starting to question whether we should allow 2 units for all types of units, or maybe we should only allow them for the known cases, like feet/inches and stone/pounds?

…rmatting

maherkassim · 2015-05-18T19:40:32Z

I'm leaning more towards handling imperial units like feet/inches and stone/pounds separately.

The updated regex will be able to parse 1' 1" (not 1'1"), but I believe that to be sufficient for now. I've reverted the unit matching back, so it will allow cases like 1m^2 again.

maherkassim · 2015-05-19T12:58:21Z

I haven't yet been able to find a way to modify the regex to parse both 1'1" and 1m^2. Right now it can parse 1' 1" and 1m^2, but it needs that space between the two quantities in the Imperial units.

One approach we can take is to attempt both regex's. Specifically, if it fails to parse the string using the initial regex then it can try an alternate regex for Imperial units where it stops matching the unit once it reaches a number. This should match all imperial quantities of the form 1'1".
The only case I believe won't work with this approach is the one you mentioned with a string of the form "5m^22m²", but I believe that is simply poorly formatted input and should fail (it's not particularly human-readable).

A potential concern with this approach may be performance. However, if we only run the second regex when the first fails, the only times where both are run would be for invalid input or for Imperial units with no space in between. So I'm not sure how big of a concern it is, but we may want to take it into consideration.

What do you think?

angularsen · 2015-05-19T19:50:59Z

I'm still not agreeing with myself on whether we want to allow 2 units for all our unit classes, or only implement for the few known cases such as 5" 3' and possibly pounds/stones (not too familiar with that one).

I do think it is a nice that we can parse strings like "5m 10cm". Possibly even "5m and 10cm", which should be trivial to support with regex. However, it feels strange that this would also parse "5Pa 3 bars" or "5m 2yd". But then again, why not? As long as it parses the correct quantity, why limit the option. That way we don't have to keep implementing all the special cases of 2 units in the future when we discover them.

So in the end, I'm leaning towards keeping the current implementation, but we need to figure out how to support parsing 5"2' , because I believe that will be commonly used. Maybe as you said, a second pass of regex? I don't know what the performance hit is here, but if it's starting to become significant, then I do want to optimize for the 80% of usages, and that is single quantity/unit.

How about this:

1st pass: single quantity/unit regex (most common)
2nd pass: feet/inches special case regex (quite common)
3rd pass: dual quantity/unit regex (rare)

maherkassim · 2015-05-20T14:26:04Z

I believe that having a second pass should be sufficient. I don't think we'll see a significant performance improvement from splitting the current regex into two separate passes.

I think another option might be to just separate the functionality completely into a separate "ParseImperial" method which would be dedicated to handling the special parsing rules. This might also help clarify how the input is expected to be formatted for each method (ie. only allowing multiple quantities like "1ft2in" and "1ft 2in" in ParseImperial).

It might also make sense to limit ParseImperial to only be implemented for specific classes like:

Length (eg. inches, feet, yards)
Mass (eg. pounds, ounces)
I don't think it makes much sense for most of the other classes (eg. Acceleration, Temperature).

angularsen · 2015-05-20T17:14:48Z

I can go with that. I too don't think performance will be a big problem for the short strings we are parsing. We can compile and cache the regexes too to further improve performance if need be.

I think ParseImperial may be a bit generic though. The way I see it, the only special case we need to handle is to allow 1"2' with no space between them. All other single/dual units can be handled by the current regex implementation, maybe extended with support for "and" between dual quantities. Am I right?

As per my last post, I am currently leaning towards allowing dual quantities for all units, simply because it is less maintenance if we later find we need dual quantities for units X and Y as well, and I don't see the harm in parsing two units that can logically be added together although they may not be commonly used together, such as "1L 2dm³" or "1L and 2dm³" parses to 3 liters. As long as we require a space between the two quantities, we should hopefully avoid many edge cases.

maherkassim · 2015-05-20T18:02:28Z

Ok, so I will just have add the second regex that will attempt to match strings like 1'1" if the current regex fails. Also, using the current regex the "and" will just be ignored so it should work without any additional modification.

angularsen · 2015-05-20T18:09:15Z

Perfect, sounds good to me.

maherkassim · 2015-05-22T13:32:46Z

If the input is "1ft and 1in" then that will work with the parsing, but so will nonsense like "1ft chicken cow sheep 1in" as the extra words simply won't be matched by the regex. Should we be concerned with this or ignore it and leave it up to the user? We should document the behavior we choose (ie. let users know that all invalid input mixed in with valid input will be ignored).

On one hand, we can ignore the invalid input and not limit the user. On the other hand, it is technically invalid, so maybe we should treat it as such? For example, how should we treat 1'1? It will currently ignore the second 1 and parse as 1 foot because it isn't matched by the regex. I came across this case when I was writing tests for the parsing and wasn't sure how it should be handled.

angularsen · 2015-05-22T19:43:40Z

After some thought, I think we should allow multiple quantities, but not allow invalid strings. Generally, I favor parsing the quantities as long as they can logically be parsed, and leave it up to user code to ensure what is parsed actually makes sense.

My thoughts are as follows

// Length.Parse() examples
// Multiple quantities are added together
// No invalid text is allowed
// Allowed delimiters between quantities: 1 comma, 1 word "and", N whitespace, or any combination 

// Valid strings

1m => 1 meter                       // single quantity
1m 1" => (1 meter + 1 inch)         // valid, but unconventional
1'1" => (1 foot + 1 inch)           // special case for feet/inches, allow no space
1dm³ 1L => 2 liters                 // 2 quantities, separated by space
1m 1m 1m => 3m                      // 3 quantities, separated by space
1m 1m ... 1m, N times => N m        // N quantities, separated by space
1m and 1cm => 1.011m                // 2 quantities, separated by "and"-word and whitespace
1m,1cm,1mm => 1.011m                // 3 quantities, separated by comma
1m, 1cm, 1mm => 1.011m              // 3 quantities, separated by comma
1m, 1cm and 1mm => 1.011m           // 3 quantities, separated by a mix of delimiters
1m, 1cm, and 1mm => 1.011m          // 3 quantities, separated by a mix of delimiters
1m     and      1cm => 1.011m       // 3 quantities, separated by a mix of delimiters
1m  ,   and,      1cm => 1.011m     // 3 quantities, separated by a mix of delimiters

// Invalid strings, throws exception

1m1cm           // 2 quantities, no space
1m monkey 1cm   // invalid word
1''             // invalid unit
1mmm            // invalid unit
1'1"2"          // only 2 quantities of exactly foot and inch is supported without space
1"1'            // only 2 quantities of exactly foot and inch is supported without space

It's possible we might run into trouble allowing comma as delimiter, since different cultures use comma differently in the number formatting, but to my knowledge none of them result in a number starting or ending with a comma, but I could be wrong. For instance US English often use ".5" to denote 0.5, so I suppose some cultures allow the same for comma. However, we could simply specify that number formats like ",5" and ".5" are not supported.

Topic of interesting complexity, this is. --Yoda

Update: Fixed " and ' for feet/inches.

maherkassim · 2015-05-22T19:52:56Z

Minor clarification: I believe that 1' refers to 1 foot and 1" refers to 1 inch, although this may vary for other languages. This was how I entered it in the abbreviations for feet and inches in the unit definition. http://en.wikipedia.org/wiki/Prime_(symbol)

maherkassim · 2015-05-22T19:56:19Z

Regarding the note about parsing ".5", I think it would be useful to observe the behavior of double.Parse() as that is ultimately what decides whether or not it can be parsed (unless we specifically pad it with a zero).

angularsen · 2015-05-22T19:56:20Z

You are right of course, I'm a metric person myself so got them mixed up for some reason :-)

maherkassim · 2015-05-22T19:58:00Z

Ahh ok :) Sorry for nitpicking, I just wanted to make sure there wasn't an actual difference that we'd have to support. Although that's not the case here, I wonder if there are actually any units where the abbreviations differ by language?

angularsen · 2015-05-22T20:00:16Z

No idea, but I'm tempted to simply assume no for now until someone points out otherwise.

angularsen · 2015-06-24T19:06:25Z

I don't mean to nag, but any progress on this?

maherkassim · 2015-06-24T19:58:40Z

Wow! Sorry about this. Was really preoccupied for a while and it completely slipped my mind to come back to it. I'll take a look to refresh my memory and then submit the remaining changes.

angularsen · 2015-06-25T21:28:30Z

Just a heads up, I am about to delete develop and stable branches, and returning to using a master branch. It seems pull requests can't be edited to change target branch, so I will leave develop up until this is merged in.

…parsing for separator

maherkassim · 2015-06-26T18:52:15Z

Ok, so I've updated to simplify the parsing (splitting to have two regex calls seemed like a bit much). I added a capturing group to the regex for invalid strings that should capture nonsense like "monkey". I think the regex should now behave as expected, and any further limitations we want to place on the parsing might be done after the matching. For example, if we want to enforce that 1"1' is invalid (should be 1'1"). Please note that I have not added checks like that with these changes, but we may wish to add it in the future (perhaps we should add a ticket in the issue tracker).

angularsen · 2015-06-26T19:24:30Z

So mostly nitpicking on my end. Do you think we should just merge it in as-is or do you want to act on any of the feedback first?

maherkassim · 2015-06-26T19:48:17Z

Ok, I think it can be merged unless there is a different alternative to the groups["invalid"].Value == "" check that I missed (couldn't find groups["invalid"].IsMatch and groups["invalid"].Success didn't work as expected).

Improve parsing of unit quantities. Not all the cases below are implemented. For instance it happily parses unit quantity pairs with no space or delimiters between them. This is left for a future improvement. --- // Length.Parse() examples // Multiple quantities are added together // No invalid text is allowed // Allowed delimiters between quantities: 1 comma, 1 word "and", N whitespace, or any combination // Valid strings 1m => 1 meter // single quantity 1m 1" => (1 meter + 1 inch) // valid, but unconventional 1'1" => (1 foot + 1 inch) // special case for feet/inches, allow no space 1dm³ 1L => 2 liters // 2 quantities, separated by space 1m 1m 1m => 3m // 3 quantities, separated by space 1m 1m ... 1m, N times => N m // N quantities, separated by space 1m and 1cm => 1.011m // 2 quantities, separated by "and"-word and whitespace 1m,1cm,1mm => 1.011m // 3 quantities, separated by comma 1m, 1cm, 1mm => 1.011m // 3 quantities, separated by comma 1m, 1cm and 1mm => 1.011m // 3 quantities, separated by a mix of delimiters 1m, 1cm, and 1mm => 1.011m // 3 quantities, separated by a mix of delimiters 1m and 1cm => 1.011m // 3 quantities, separated by a mix of delimiters 1m , and, 1cm => 1.011m // 3 quantities, separated by a mix of delimiters // Invalid strings, throws exception 1m1cm // 2 quantities, no space 1m monkey 1cm // invalid word 1'' // invalid unit 1mmm // invalid unit 1'1"2" // only 2 quantities of exactly foot and inch is supported without space 1"1' // only 2 quantities of exactly foot and inch is supported without space

angularsen · 2015-06-26T19:57:48Z

Ok, I haven't played too much with them myself. If it works, it's fine by me.

Merging in!

Thanks for the 1+ month effort :-) 👍

maherkassim · 2015-06-26T20:03:48Z

Haha yeah, sorry about the extended delay. There's still room for improvement, but I think we've made some progress.

PR was based on develop, so have to manually merge it into master after changing the branching style (again).

angularsen · 2015-06-26T20:24:00Z

Nuget is now out 3.14.0.

maherkassim added 2 commits May 18, 2015 11:31

Updated parsing to allow multiple quantities (eg. 1ft 1in)

2ff5f32

Generated Code for Updated parsing to allow multiple quantities

9495043

maherkassim mentioned this pull request May 18, 2015

Parsing feet and inches ignores inches in results #72

Closed

maherkassim added 2 commits May 18, 2015 15:29

Updated parsing, Added ft/in symbols, and minor formatting

2cb8e04

Generated Code for Updated parsing, Added ft/in symbols, and minor fo…

42fd425

…rmatting

maherkassim added 2 commits May 21, 2015 14:53

angularsen#72 Added second pass in parsing to allow 1'1"

b82bb29

Generated Code for Added second pass in parsing to allow 1'1"

8981414

maherkassim added 2 commits June 26, 2015 14:45

Simplified parsing, captured invalid input, added parsing for separator

b9fb93a

Generated Code for Simplified parsing, captured invalid input, added …

234bbd8

…parsing for separator

Removed unnecessary escape from parse tests

1f1b8c7

angularsen merged commit 7353c4b into angularsen:develop Jun 26, 2015

angularsen added a commit that referenced this pull request Jun 26, 2015

Merge in pull request #81 from develop

6fb85f8

PR was based on develop, so have to manually merge it into master after changing the branching style (again).

angularsen mentioned this pull request Jan 10, 2018

"Length.TryParse" parses invalid values #343

Closed

bplubell mentioned this pull request Jan 11, 2018

v5: Wishlist for breaking changes #180

Closed

13 tasks

Feature/improve parsing #81

Feature/improve parsing #81

Uh oh!

Conversation

maherkassim commented May 18, 2015

Uh oh!

angularsen commented May 18, 2015

Uh oh!

maherkassim commented May 18, 2015

Uh oh!

maherkassim commented May 19, 2015

Uh oh!

angularsen commented May 19, 2015

Uh oh!

maherkassim commented May 20, 2015

Uh oh!

angularsen commented May 20, 2015

Uh oh!

maherkassim commented May 20, 2015

Uh oh!

angularsen commented May 20, 2015

Uh oh!

maherkassim commented May 22, 2015

Uh oh!

angularsen commented May 22, 2015

Uh oh!

maherkassim commented May 22, 2015

Uh oh!

maherkassim commented May 22, 2015

Uh oh!

angularsen commented May 22, 2015

Uh oh!

maherkassim commented May 22, 2015

Uh oh!

angularsen commented May 22, 2015

Uh oh!

angularsen commented Jun 24, 2015

Uh oh!

maherkassim commented Jun 24, 2015

Uh oh!

angularsen commented Jun 25, 2015

Uh oh!

maherkassim commented Jun 26, 2015

Uh oh!

angularsen commented Jun 26, 2015

Uh oh!

maherkassim commented Jun 26, 2015

Uh oh!

angularsen commented Jun 26, 2015

Uh oh!

maherkassim commented Jun 26, 2015

Uh oh!

angularsen commented Jun 26, 2015

Uh oh!

Uh oh!