-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-19515. ABFS: Updating Documentations of Hadoop Drivers for Azure #7540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
anujmodi2021
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments
This comment was marked as outdated.
This comment was marked as outdated.
anujmodi2021
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
| the credential provider framework to securely store them and access them | ||
| through configuration. The following describes its use for Azure credentials | ||
| in WASB FileSystem. | ||
| _You cannot enable Hierarchical Namespaces on an existing storage account_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of _ in the start and end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for italics
| A key aspect of ADLS Gen 2 is its support for | ||
| [hierarchical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace) | ||
| These are effectively directories and offer high performance rename and delete operations | ||
| —something which makes a significant improvement in performance in query engines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Give space between —something -> — something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
| For additional reading on the credential provider API see: | ||
| [Credential Provider API](../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html). | ||
| _**Containers in a storage account with Hierarchical Namespaces are | ||
| not (currently) readable through the `deprecated wasb:` connector.**_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need COLON here? deprecated wasb:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was present in the earlier doc as well. The colon was to show the URI scheme
But yes, would separate the words deprecated and wasb:
|
|
||
| For additional reading on the credential provider API see: | ||
| [Credential Provider API](../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html). | ||
| _**Containers in a storage account with Hierarchical Namespaces are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of _ in the start and end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for italics
| % hadoop credential create fs.azure.account.key.youraccount.blob.core.windows.net -value 123 | ||
| -provider localjceks://file/home/lmccay/wasb.jceks | ||
| $ az storage container list --account-name abfswales1 | ||
| Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: BlobApiNotYetSupportedForHierarchicalNamespaceAccounts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line should be outside bash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, taken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this done?
|
|
||
| ###### distcp | ||
| The [az storage](https://docs.microsoft.com/en-us/cli/azure/storage?view=azure-cli-latest) subcommand | ||
| handles all storage commands, [`az storage account create`](https://docs.microsoft.com/en-us/cli/azure/storage/account?view=azure-cli-latest#az-storage-account-create) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to keep it [az storage account create] without square brackets (az storage account create).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think square brackets were for hyperlink?
Its .md syntax
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he meant that we remove the escape quotes so the hyperlink is visible clearly
like keep it [az storage account create] instead of [az storage account create]
| added job specific configuration to a generic core-site.xml. The square brackets above illustrate | ||
| this capability. | ||
| Check that all is well by verifying that the usage command includes `--hierarchical-namespace`: | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be ```bash?
| ``` | ||
|
|
||
| #### Protecting the Azure Credentials for WASB within an Encrypted File | ||
| You can list locations from `az account list-locations`, which lists the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In future, there can be addition or deletion of few of the locations, do you want to hardcode all locations here or just it would be better to write 2-3 locations and mark ... for others. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Added the output as a small sample table
| ```bash | ||
| az storage account show-connection-string --name abfswales1 | ||
| { | ||
| "connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this AccountKey belongs to actual account? We should mask this key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. It was present in the earlier doc as well so been long
The key would have had expired or rotated now I believe. They might have created it for a temporary storage account
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but better to remove it now
| <name>fs.azure.shellkeyprovider.script</name> | ||
| <value>PATH TO DECRYPTION PROGRAM</value> | ||
| <name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name> | ||
| <value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Key would have expired/rotated. (Temporary) account might have been deleted.
| ``` | ||
|
|
||
| You then need to add the access key to your `core-site.xml`, JCEKs file or | ||
| use your cluster management tool to set it the option `fs.azure.account.key.STORAGE-ACCOUNT` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like STORAGE-ACCOUNT is part of the string itself. We can write it something like this: fs.azure.account.key.<STORAGE-ACCOUNT>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. We have been following this format everywhere.
| For most of the cases, combining data from multiple `write()` calls in | ||
| blocks of 4Mb is a good optimization. But, in others cases, like HBase log files, | ||
| every call to `hflush()` or `hsync()` must upload the data to the service. | ||
| Key Steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: Key Steps (Key Steps in bold)
| `/`hsync()` runs once compaction process, if number of blocks in the blob | ||
| is above 32,000. | ||
| 1. Create a new Storage Account in a location which suits you. | ||
| 1. "Basics" Tab: select "StorageV2". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numbering correction 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in md file it shows list with same number only, @anujmodi2021 you had pointed this out earlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct.
We just need to specify 1. for everyline and rendering will automatically put right numbers
| is above 32,000. | ||
| 1. Create a new Storage Account in a location which suits you. | ||
| 1. "Basics" Tab: select "StorageV2". | ||
| 1. "Advanced" Tab: enable "Hierarchical Namespace". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, numbering correction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed
| `fs.azure.block.blob.with.compaction.dir` to a comma-separated list of | ||
| folder names. | ||
| 1. Go to the Azure Portal. | ||
| 1. Select "Storage Accounts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, numbering correction for all below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed
| Block compaction search and replaces a sequence of small blocks with one big | ||
| block. That means there is associated cost with block compaction: reading | ||
| small blocks back to the client and writing it again as one big block. | ||
| You have now created your storage account. Next, get the key for authentication |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentence doesn't sound correct. Please check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified
| See also: | ||
|
|
||
| * [WASB Deprecation](./wasb.html) | ||
| * [ABFS](./index.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename index.md with something better ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the landing page has to be named index.md.
Its same for all the sites in hadoop repo
| [Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java). | ||
|
|
||
| To make it part of Apache Hadoop's default classpath, simply make sure that | ||
| `HADOOP_OPTIONAL_TOOLS`in `hadoop-env.sh` has `'hadoop-azure` in the list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in should be after space
|
|
||
| * Read and write data stored in an Azure Blob Storage account. | ||
| * Present a hierarchical file system view by implementing the standard Hadoop | ||
| [`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full Hadoop Filesystem in italics ?
| The internal implementation also uses blobs to persist the file system | ||
| hierarchy and other metadata. | ||
| stored in a single container. | ||
| * **Blob**: A file of any type and size stored with the existing wasb connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wasb here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected
| After credentials are configured in core-site.xml, any Hadoop component may | ||
| reference files in that Azure Blob Storage account by using URLs of the following | ||
| format: | ||
| The concepts covered there are beyond the scope of this document to cover; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, instead of ; ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks correct I think
| file path, or rely on some external locking mechanism of its own. Failure to do so will result in | ||
| unexpected behavior. | ||
| *Note*: The source of the account key can be changed through a custom key provider; | ||
| one exists to execute a shell script to retrieve it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one needs to ?
| </property> | ||
| <property> | ||
| <name>fs.azure.account.oauth2.client.endpoint</name> | ||
| <value></value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the format here or something like endpoint instead of keeping it empty
| </property> | ||
| <property> | ||
| <name>fs.azure.account.oauth2.client.id</name> | ||
| <value></value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for client id and secret
| </property> | ||
| <property> | ||
| <name>fs.azure.account.oauth2.user.name</name> | ||
| <value></value> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same ad above
| To enable SAS key generation locally following property needs to be set to true. | ||
| ### <a name="oauth-refresh-token"></a> OAuth 2.0: Refresh Token | ||
|
|
||
| With an existing Oauth 2.0 token, make a request of the Active Directory endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of -> to
| PUT/POST operations are idempotent and need no specific handling | ||
| except for Rename and Delete operations. | ||
|
|
||
| Rename idempotency checks are made by ensuring the LastModifiedTime on destination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add the recent changes made for rename and create idempotency ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhattmanish98 could you help here with what to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@manika137 For now we can just remove these lines from this file.
except for Rename and Delete operations.
Rename idempotency checks are made by ensuring the LastModifiedTime on destination is recent if source path is found to be non-existent on retry.
In a later PR, we can add a description of how we achieved this for rename and create operations.
|
🎊 +1 overall
This message was automatically generated. |
…re (apache#7540) Updating documentations to reflect announcement for FNS-Blob Support and Wasb Deprecation Contributed by Manika Joshi. Reviwed by Anmol, Manish and Anuj. Signed off by Anuj Modi<[email protected]>
…re (apache#7540) Updating documentations to reflect announcement for FNS-Blob Support and Wasb Deprecation Contributed by Manika Joshi. Reviwed by Anmol, Manish and Anuj. Signed off by Anuj Modi<[email protected]>
…re (#7540) (#7595) Updating documentations to reflect announcement for FNS-Blob Support and Wasb Deprecation Contributed by Manika Joshi. Reviwed by Anmol, Manish and Anuj. Signed off by Anuj Modi<[email protected]>
Description of PR
JIRA ticket: https://issues.apache.org/jira/browse/HADOOP-19515
Fixing some typos, details and adding links for better readability in the .md files for ABFS driver.
How was this patch tested?
No production code change, no testing needed.