Skip to content
This repository was archived by the owner on Jun 29, 2019. It is now read-only.
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 15 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,27 +16,28 @@ Answering open ended questions is difficult and often requires manual effort fro

## Overview

This example addresses the problem of mapping user questions to pre-existing Question & Answer (Q&A) pairs as is typically provided in a list of Frequently Asked Questions (that is, a FAQ) or in the Q&A pairs present on websites like [Stack Overflow](https://stackoverflow.com/). There are many approaches to match a question to its correct answer, such as finding the answer that is the most similar to the question. However, this example solves the problem by approaching it from a different perspective. Open ended questions are matched to previously asked questions assuming that each answer in the FAQ can answer multiple semantically equivalent questions.
This example addresses the problem of mapping user questions to pre-existing Question & Answer (Q&A) pairs as is typically provided in a Frequently Asked Questions list (FAQ) or the Q&A pairs present on websites like [Stack Overflow](https://stackoverflow.com/). There are many approaches to match a question to its correct answer, such as finding the answer that is the most similar to the question. However, in this example open ended questions are matched to previously asked questions by assuming that each answer in the FAQ can answer multiple semantically equivalent questions.

The key learnings delivered by this example are as follows:
The key steps required to deliver this solution are as follows:

1. Clean and process text data.
2. Learn informative phrases, which are multi-word sequences that provide more information when viewed in sequence than when treated independently.
3. Extract features from text data.
4. Train text classification models and evaluate model performance.

_**JE:** can I get more information on why 2 is necessary here, or where we use the informative phrases (i.e step 3 or 4)?_

## Data Description

The dataset used in this example is collected from [archive.org](https://archive.org/details/stackexchange). This data, Stack Exchange Data Dump, is an anonymized dump of all user-contributed content on the [Stack Exchange network](https://stackexchange.com/). Each site in the network is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory, and PostLinks.
The dataset used in this example is a Stack Exchange Data Dump stored at [archive.org](https://archive.org/details/stackexchange). This data is an anonymized dump of all user-contributed content on the [Stack Exchange network](https://stackexchange.com/). Each site in the network is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory, and PostLinks.

### Data Source

This example uses the [Posts data (10 GB)](https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z) and [PostLinks data (515 MB)](https://archive.org/download/stackexchange/stackoverflow.com-PostLinks.7z) from the Stack Overflow site. For complete schema information, see the [readme.txt](https://ia800500.us.archive.org/22/items/stackexchange/readme.txt).

`PostTypeId` in the Posts data indicates whether a post is a `Question` or an `Answer`. `PostLinkTypeId` in the PostLinks data indicates whether two posts are linked or duplicate. Question posts typically include some tags, which are keywords that categorize a question with other similar/duplicate questions. There are some tags with high frequency, such as `javascript`, `java`, `c#`, `php` etc., consist of a larger number of question posts. In this example, a subset of Q&A pairs with the `javascript` tag is extracted.
The `PostTypeId` field in the Posts data indicates whether a post is a `Question` or an `Answer`. `PostLinkTypeId` in the PostLinks data indicates whether two posts are linked or duplicate. Question posts typically include some tags, which are keywords that categorize a question with other similar/duplicate questions. There are some tags with high frequency, such as `javascript`, `java`, `c#`, `php` etc., comprising a larger number of question posts. In this example, we extract a subset of Q&A pairs with the `javascript` tag.

In addition, a question post may relate with multiple answer posts and duplicate question posts. To construct a list of FAQ from these two datasets, some data collection criteria are considered. The three sets of compiled data are selected using a SQL script, which is not included in this example. But the data description is as follows.
Additionionally, a question post may be related to multiple answer posts or duplicate question posts. To construct a list of FAQ from these two datasets, some data collection criteria are considered. The three sets of compiled data are selected using a SQL script, which is not included in this example. The resulting data description is as follows:

- `Original Questions (Q)`: Question posts are asked and answered on Stack Overflow site.
- `Duplications (D)`: Question posts duplicate other pre-existing questions (`PostLinkTypeId = 3`), which are the original questions. Duplications are considered as semantically equivalent to the original questions in the sense that the answer provided to the original question also answers the new duplicate question.
Expand Down Expand Up @@ -66,7 +67,7 @@ The data schema and direct download links of the three datasets can be found in

## Scenario Structure

The Q&A matching example is presented through three types of files. The first type is a series of Jupyter/iPython Notebooks that show the step-by-step descriptions of the entire workflow. The second type is a set of Python files contain custom Python modules for phrase learning and feature extraction. These Python modules are generic enough to not only serve this example but also other use cases. The third type is a set of Python files to tune hyper-parameters and track model performance using the Azure Machine Learning Workbench.
The Q&A matching example is presented through three types of files. The first type is a series of Jupyter Notebooks that show the step-by-step descriptions of the entire workflow. The second type is a set of custom Python modules for phrase learning and feature extraction. These Python modules are generic enough to not only serve this example, but also be used in other use cases. The third type is a set of Python modules for tuning hyper-parameters and tracking model performance using the Azure Machine Learning Workbench.

The files in this example are organized as follows.

Expand All @@ -90,9 +91,10 @@ The files in this example are organized as follows.

### Data Ingestion & Transformation

The three compiled datasets are stored in a Blob storage and could be retrieved in `Part_1_Data_Preparation.ipynb` notebook.
The three compiled datasets are stored in a Blob storage and are retrieved in `Part_1_Data_Preparation.ipynb` notebook.

Before training the text classification models, the text in the questions is cleaned and preprocessed to exclude code snippets. An unsupervised phrase learning is applied over the training material to learn informative multi-word sequences. These phrases are represented as single compound word units in the downstream BOWs featurization used by the text classification models.
Before training the text classification models, the text in the questions is cleaned and preprocessed to exclude code snippets. Unsupervised phrase learning is applied over the training material to learn informative multi-word sequences. These phrases are represented as single compound word units in the downstream BOWs featurization used by the text classification models.
_**JE** can we have a short definition BOWs featurization?_

The detailed step-by-step descriptions of text preprocessing and phrase learning can be found in the Notebooks `Part_1_Data_Preparation.ipynb` and `Part_2_Phrase_Learning.ipynb`, respectively.

Expand All @@ -107,21 +109,21 @@ The model training process is illustrated in `Part_3_Model_Training_and_Evaluati
### Evaluation

Two different evaluation metrics are used to assess performance.
1. `Average Rank (AR)`: indicates the average position where the correct answer is found in the list of retrieved Q&A pairs (out of the full set of 103 answer classes).
1. `Average Rank (AR)`: indicates the average position where the correct answer is found in the list of retrieved Q&A pairs (out of the full set of answer classes).
2. `Top 3 Percentage`: indicates the percentage of test questions that the correct answer can be retrieved in the top three choices in the returned ranked list.

The evaluation is demonstrated in `Part_3_Model_Training_and_Evaluation.ipynb`.


## Conclusion & Next Steps

This example highlights how to use well-known text analytics techniques, such as phrase learning and text classification, to produce a robust model. It also showcases how Azure Machine Learning Workbench can help with interactive development and track model performance.
This example highlights how to use well-known text analytics techniques, such as phrase learning and text classification, to produce a robust model. It also showcases how Azure Machine Learning Workbench can help with interactive solution development and track model performance.

Some key highlights of this example are:

- The question and answer matching problem can be effectively solved with phrase learning and text classification models.
- Azure Machine Learning Workbench provides the capability of interactive development through Jupyter Notebook.
- Azure Machine Learning Workbench manages the run history and learned models, and logs the evaluation metrics. It enables data scientist to quickly tune hyper-parameters and identify the best performing models.
- The question and answer matching problem can be effectively solved with phrase learning and text classification models. _**JE** sorry, is there proof of this?_
- Demonstrating interactive model development with Azure Machine Learning Workbench and Jupyter Notebook.
- Azure Machine Learning Workbench manages the run history and learned models with logging the evaluation metrics for comparison purposes. These features enables quick hyper-parameter tuning and helps identify the best performing models.


## References
Expand Down