-
Notifications
You must be signed in to change notification settings - Fork 393
Feature/improve parsing #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/improve parsing #81
Conversation
As already stated, I think this is a fair approach, but I have a hunch we might run into some edge cases in the parsing, especially for unit abbreviations with numbers in them. We don't have any today, since we normally go with m² and m³, but it's fair to assume we will have abbreviations like m^2 and m^3 at some point. I think we need more tests to be sure though, and I guess adding abbreviations like that is needed to fully test it. I just added a couple of new tests in b215ac5 that might serve as a reference on how to temporarily add new unit abbreviations in a test, to test the outcome. Since UnitSystem is a singleton-per-culture, it will also affect parsing done in unit classes. This way we can test the parse behavior for both realistic and contrived cases, such as "5 m^2 2m²" equals 7 m². While writing that I just realized an edge case. If we allow removing the space you get a pretty ambiguous expression "5m^22m²". This is why I don't want to allow space between them, but for some cases like 5"2' we do want to allow that. So I'm starting to question whether we should allow 2 units for all types of units, or maybe we should only allow them for the known cases, like feet/inches and stone/pounds? |
I'm leaning more towards handling imperial units like feet/inches and stone/pounds separately. The updated regex will be able to parse 1' 1" (not 1'1"), but I believe that to be sufficient for now. I've reverted the unit matching back, so it will allow cases like 1m^2 again. |
I haven't yet been able to find a way to modify the regex to parse both 1'1" and 1m^2. Right now it can parse 1' 1" and 1m^2, but it needs that space between the two quantities in the Imperial units. One approach we can take is to attempt both regex's. Specifically, if it fails to parse the string using the initial regex then it can try an alternate regex for Imperial units where it stops matching the unit once it reaches a number. This should match all imperial quantities of the form 1'1". A potential concern with this approach may be performance. However, if we only run the second regex when the first fails, the only times where both are run would be for invalid input or for Imperial units with no space in between. So I'm not sure how big of a concern it is, but we may want to take it into consideration. What do you think? |
I'm still not agreeing with myself on whether we want to allow 2 units for all our unit classes, or only implement for the few known cases such as 5" 3' and possibly pounds/stones (not too familiar with that one). I do think it is a nice that we can parse strings like "5m 10cm". Possibly even "5m and 10cm", which should be trivial to support with regex. However, it feels strange that this would also parse "5Pa 3 bars" or "5m 2yd". But then again, why not? As long as it parses the correct quantity, why limit the option. That way we don't have to keep implementing all the special cases of 2 units in the future when we discover them. So in the end, I'm leaning towards keeping the current implementation, but we need to figure out how to support parsing 5"2' , because I believe that will be commonly used. Maybe as you said, a second pass of regex? I don't know what the performance hit is here, but if it's starting to become significant, then I do want to optimize for the 80% of usages, and that is single quantity/unit. How about this:
|
I believe that having a second pass should be sufficient. I don't think we'll see a significant performance improvement from splitting the current regex into two separate passes. I think another option might be to just separate the functionality completely into a separate "ParseImperial" method which would be dedicated to handling the special parsing rules. This might also help clarify how the input is expected to be formatted for each method (ie. only allowing multiple quantities like "1ft2in" and "1ft 2in" in ParseImperial). It might also make sense to limit ParseImperial to only be implemented for specific classes like:
|
I can go with that. I too don't think performance will be a big problem for the short strings we are parsing. We can compile and cache the regexes too to further improve performance if need be. I think ParseImperial may be a bit generic though. The way I see it, the only special case we need to handle is to allow 1"2' with no space between them. All other single/dual units can be handled by the current regex implementation, maybe extended with support for "and" between dual quantities. Am I right? As per my last post, I am currently leaning towards allowing dual quantities for all units, simply because it is less maintenance if we later find we need dual quantities for units X and Y as well, and I don't see the harm in parsing two units that can logically be added together although they may not be commonly used together, such as "1L 2dm³" or "1L and 2dm³" parses to 3 liters. As long as we require a space between the two quantities, we should hopefully avoid many edge cases. |
Ok, so I will just have add the second regex that will attempt to match strings like 1'1" if the current regex fails. Also, using the current regex the "and" will just be ignored so it should work without any additional modification. |
Perfect, sounds good to me. |
If the input is "1ft and 1in" then that will work with the parsing, but so will nonsense like "1ft chicken cow sheep 1in" as the extra words simply won't be matched by the regex. Should we be concerned with this or ignore it and leave it up to the user? We should document the behavior we choose (ie. let users know that all invalid input mixed in with valid input will be ignored). On one hand, we can ignore the invalid input and not limit the user. On the other hand, it is technically invalid, so maybe we should treat it as such? For example, how should we treat 1'1? It will currently ignore the second 1 and parse as 1 foot because it isn't matched by the regex. I came across this case when I was writing tests for the parsing and wasn't sure how it should be handled. |
After some thought, I think we should allow multiple quantities, but not allow invalid strings. Generally, I favor parsing the quantities as long as they can logically be parsed, and leave it up to user code to ensure what is parsed actually makes sense. My thoughts are as follows
It's possible we might run into trouble allowing comma as delimiter, since different cultures use comma differently in the number formatting, but to my knowledge none of them result in a number starting or ending with a comma, but I could be wrong. For instance US English often use ".5" to denote 0.5, so I suppose some cultures allow the same for comma. However, we could simply specify that number formats like ",5" and ".5" are not supported. Topic of interesting complexity, this is. --Yoda Update: Fixed " and ' for feet/inches. |
Minor clarification: I believe that 1' refers to 1 foot and 1" refers to 1 inch, although this may vary for other languages. This was how I entered it in the abbreviations for feet and inches in the unit definition. http://en.wikipedia.org/wiki/Prime_(symbol) |
Regarding the note about parsing ".5", I think it would be useful to observe the behavior of double.Parse() as that is ultimately what decides whether or not it can be parsed (unless we specifically pad it with a zero). |
You are right of course, I'm a metric person myself so got them mixed up for some reason :-) |
Ahh ok :) Sorry for nitpicking, I just wanted to make sure there wasn't an actual difference that we'd have to support. Although that's not the case here, I wonder if there are actually any units where the abbreviations differ by language? |
No idea, but I'm tempted to simply assume no for now until someone points out otherwise. |
I don't mean to nag, but any progress on this? |
Wow! Sorry about this. Was really preoccupied for a while and it completely slipped my mind to come back to it. I'll take a look to refresh my memory and then submit the remaining changes. |
Just a heads up, I am about to delete develop and stable branches, and returning to using a master branch. It seems pull requests can't be edited to change target branch, so I will leave develop up until this is merged in. |
Ok, so I've updated to simplify the parsing (splitting to have two regex calls seemed like a bit much). I added a capturing group to the regex for invalid strings that should capture nonsense like "monkey". I think the regex should now behave as expected, and any further limitations we want to place on the parsing might be done after the matching. For example, if we want to enforce that 1"1' is invalid (should be 1'1"). Please note that I have not added checks like that with these changes, but we may wish to add it in the future (perhaps we should add a ticket in the issue tracker). |
So mostly nitpicking on my end. Do you think we should just merge it in as-is or do you want to act on any of the feedback first? |
Ok, I think it can be merged unless there is a different alternative to the groups["invalid"].Value == "" check that I missed (couldn't find groups["invalid"].IsMatch and groups["invalid"].Success didn't work as expected). |
Improve parsing of unit quantities. Not all the cases below are implemented. For instance it happily parses unit quantity pairs with no space or delimiters between them. This is left for a future improvement. --- // Length.Parse() examples // Multiple quantities are added together // No invalid text is allowed // Allowed delimiters between quantities: 1 comma, 1 word "and", N whitespace, or any combination // Valid strings 1m => 1 meter // single quantity 1m 1" => (1 meter + 1 inch) // valid, but unconventional 1'1" => (1 foot + 1 inch) // special case for feet/inches, allow no space 1dm³ 1L => 2 liters // 2 quantities, separated by space 1m 1m 1m => 3m // 3 quantities, separated by space 1m 1m ... 1m, N times => N m // N quantities, separated by space 1m and 1cm => 1.011m // 2 quantities, separated by "and"-word and whitespace 1m,1cm,1mm => 1.011m // 3 quantities, separated by comma 1m, 1cm, 1mm => 1.011m // 3 quantities, separated by comma 1m, 1cm and 1mm => 1.011m // 3 quantities, separated by a mix of delimiters 1m, 1cm, and 1mm => 1.011m // 3 quantities, separated by a mix of delimiters 1m and 1cm => 1.011m // 3 quantities, separated by a mix of delimiters 1m , and, 1cm => 1.011m // 3 quantities, separated by a mix of delimiters // Invalid strings, throws exception 1m1cm // 2 quantities, no space 1m monkey 1cm // invalid word 1'' // invalid unit 1mmm // invalid unit 1'1"2" // only 2 quantities of exactly foot and inch is supported without space 1"1' // only 2 quantities of exactly foot and inch is supported without space
Ok, I haven't played too much with them myself. If it works, it's fine by me. Merging in! Thanks for the 1+ month effort :-) 👍 |
Haha yeah, sorry about the extended delay. There's still room for improvement, but I think we've made some progress. |
PR was based on develop, so have to manually merge it into master after changing the branching style (again).
Nuget is now out 3.14.0. |
Added the ability to parse multiple quantities (eg. 1ft 1in)