Discussion of the new XML processing feature

**Describe the bug**

It's not a bug but a discussion about a new feature, how can we extend the XML processing.

There is a feature request from a customer that we should extend the engines' XML parsing capability. Of course, we should add this request to both engine with same behavior.

## Current behavior

Consider this payload:

```XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <level1>
    <level2>
      <node>foo1</node>
      <node>bar1</node>
    </level2>
    <level2>
      <node>foo2</node>
      <node>bar2</node>
    </level2>
  </level1>
</root>
```

This payload will appear in current state in the engines:

(mod_security2)
```
[/post][9] Target value: "  foo1  bar1  foo2  bar2"
```
(libmodsecurity3)
```
[/post] [9] Target value: "  foo1  bar1  foo2  bar2" (Variable: XML:/*)
```
(lines from debug.logs)

## Problem

The problem is that exclusions for sub-parts and specific nodes does not work. See the example:

```
SecRule XML:/* "@rx ^foo.*" \
	"id:10001,\
	phase:2,\
	t:none,\
	log,\
	pass,\
	ctl:ruleRemoveTargetById=930120;XML:/level1/level2/node"
```

because the XML variable holds the **concatenated node values**, not a key:value pairs like JSON. Therefore it's impossible to create any exclusion against any rules.

## Possible solution

Consider this converted strcture (XML to JSON):

```JSON
{
  "root": {
    "level1": {
      "level2": [
        {
          "node": [
            "foo1",
            "bar1"
          ]
        },
        {
          "node": [
            "foo2",
            "bar2"
          ]
        }
      ]
    }
  }
}
```

This payload will expanded like this:

(mod_security2)
```
[/post][9] Adding JSON argument 'root.level1.level2.level2.node.node' with value 'foo1'
[/post][9] Adding JSON argument 'root.level1.level2.level2.node.node' with value 'bar1'
[/post][9] Adding JSON argument 'root.level1.level2.level2.node.node' with value 'foo2'
[/post][9] Adding JSON argument 'root.level1.level2.level2.node.node' with value 'bar2'
```
(libmodsecurity3)
```
[/post] [4] Adding request argument (JSON): name "json.root.level1.level2.array_0.node.array_0", value "foo1"
[/post] [4] Adding request argument (JSON): name "json.root.level1.level2.array_0.node.array_1", value "bar1"
[/post] [4] Adding request argument (JSON): name "json.root.level1.level2.array_1.node.array_0", value "foo2"
[/post] [4] Adding request argument (JSON): name "json.root.level1.level2.array_1.node.array_1", value "bar2"
```

**The idea is to transform the XML structure in a similar way.**

Example:

(libmodsecurity3)
```
[/post] [4] Adding request argument (XML): name "xml.root.level1.level2.array_0.node.array_0", value "foo1"
[/post] [4] Adding request argument (XML): name "xml.root.level1.level2.array_0.node.array_1", value "bar1"
[/post] [4] Adding request argument (XML): name "xml.root.level1.level2.array_1.node.array_0", value "foo2"
[/post] [4] Adding request argument (XML): name "xml.root.level1.level2.array_1.node.array_1", value "bar2"
```

## Possible risks

* if we introduce this "new" collection under an existing one, then it will causes false positive matches
* cost of parsing an XML structure is very high

## How can we avoid/handle the risks?

We can put the decision in the hands of the user, whether he wants to see the new collection under the `ARGS` or not - so introduce a new configuration keyword, eg. `SecParseXMLintoArgs` (consider the optional runtime config, eg. `ctl:parseXMLintoArgs`)

As in case of JSON, introduce a new configuration keyword which controls the maximum number of XML levels that can be analyzed, eg. `SecRequestBodyXMLDepthLimit` (see [SecRequestBodyJSONDepthLimit](https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecRequestBodyJsonDepthLimit))


## More todo's

We have to:

* analyze XML parser performance
  * should we change from libxml2 to another parser? Libexpat? Or other?
* check the effect of [SecArgumentsLimit](https://github.com/owasp-modsecurity/ModSecurity/wiki/Reference-Manual-(v2.x)#user-content-SecArgumentsLimit) in case of JSON parsing
* design and apply this behavior on XML parsing
* explore the possibility of additional XML validation methods (eg. XXE (XML External Entity) detection)
* to decide the issue of compatibility or uniform behavior within versions

For the last item: the behavior of JSON parsing in two versions are different. Consider the payload `{"a":1,"b":[{"a1":"a1val"},{"a1":"a2val"}]}` (see that there is a list!) which is equivalent with this XML:

```
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <a>1</a>
    <b>
        <element>
            <a1>a1val</a1>
        </element>
        <element>
            <a1>a2val</a1>
        </element>
    </b>
</root>
```

which produces these results:

(mod_security2)
```
[/post][9] Adding JSON argument 'a' with value '1'
[/post][9] Adding JSON argument 'b.b.a1' with value 'a1val'
[/post][9] Adding JSON argument 'b.b.a1' with value 'a2val'
```

(libmodsecurity3)
```
[/post] [4] Adding request argument (JSON): name "json.a", value "1"
[/post] [4] Adding request argument (JSON): name "json.b.array_0.a1", value "a1val"
[/post] [4] Adding request argument (JSON): name "json.b.array_1.a1", value "a2val"
```

Note, that please check the list items with the same keys! I think we should follow the libmodsecurity3's behavior - but the the XML and JSON won't be compatible. (Which implies the next question: do we want to align the mod_security2's behavior?)

Any feedback are welcome!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion of the new XML processing feature #3178

Current behavior

Problem

Possible solution

Possible risks

How can we avoid/handle the risks?

More todo's

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion of the new XML processing feature #3178

Description

Current behavior

Problem

Possible solution

Possible risks

How can we avoid/handle the risks?

More todo's

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions