Skip to content

Conversation

@manika137
Copy link
Contributor

@manika137 manika137 commented Mar 26, 2025

Description of PR

JIRA ticket: https://issues.apache.org/jira/browse/HADOOP-19515
Fixing some typos, details and adding links for better readability in the .md files for ABFS driver.

How was this patch tested?

No production code change, no testing needed.

@manika137 manika137 changed the title HADOOP-19515. ABFS: Modifications in .md Files HADOOP-19515. ABFS: Updating Documentations of Hadoop Drivers for Azure Mar 26, 2025
@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

Copy link
Contributor

@anujmodi2021 anujmodi2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments

@hadoop-yetus

This comment was marked as outdated.

Copy link
Contributor

@anujmodi2021 anujmodi2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

the credential provider framework to securely store them and access them
through configuration. The following describes its use for Azure credentials
in WASB FileSystem.
_You cannot enable Hierarchical Namespaces on an existing storage account_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of _ in the start and end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for italics

A key aspect of ADLS Gen 2 is its support for
[hierarchical namespaces](https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace)
These are effectively directories and offer high performance rename and delete operations
—something which makes a significant improvement in performance in query engines
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give space between —something -> — something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken

For additional reading on the credential provider API see:
[Credential Provider API](../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html).
_**Containers in a storage account with Hierarchical Namespaces are
not (currently) readable through the `deprecated wasb:` connector.**_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need COLON here? deprecated wasb:

Copy link
Contributor Author

@manika137 manika137 Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was present in the earlier doc as well. The colon was to show the URI scheme
But yes, would separate the words deprecated and wasb:


For additional reading on the credential provider API see:
[Credential Provider API](../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html).
_**Containers in a storage account with Hierarchical Namespaces are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of _ in the start and end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for italics

% hadoop credential create fs.azure.account.key.youraccount.blob.core.windows.net -value 123
-provider localjceks://file/home/lmccay/wasb.jceks
$ az storage container list --account-name abfswales1
Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: BlobApiNotYetSupportedForHierarchicalNamespaceAccounts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be outside bash.

Copy link
Contributor Author

@manika137 manika137 Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, taken

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this done?


###### distcp
The [az storage](https://docs.microsoft.com/en-us/cli/azure/storage?view=azure-cli-latest) subcommand
handles all storage commands, [`az storage account create`](https://docs.microsoft.com/en-us/cli/azure/storage/account?view=azure-cli-latest#az-storage-account-create)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to keep it [az storage account create] without square brackets (az storage account create).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think square brackets were for hyperlink?
Its .md syntax

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think he meant that we remove the escape quotes so the hyperlink is visible clearly
like keep it [az storage account create] instead of [az storage account create]

added job specific configuration to a generic core-site.xml. The square brackets above illustrate
this capability.
Check that all is well by verifying that the usage command includes `--hierarchical-namespace`:
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be ```bash?

```

#### Protecting the Azure Credentials for WASB within an Encrypted File
You can list locations from `az account list-locations`, which lists the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In future, there can be addition or deletion of few of the locations, do you want to hardcode all locations here or just it would be better to write 2-3 locations and mark ... for others. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Added the output as a small sample table

```bash
az storage account show-connection-string --name abfswales1
{
"connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA=="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this AccountKey belongs to actual account? We should mask this key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. It was present in the earlier doc as well so been long
The key would have had expired or rotated now I believe. They might have created it for a temporary storage account

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but better to remove it now

<name>fs.azure.shellkeyprovider.script</name>
<value>PATH TO DECRYPTION PROGRAM</value>
<name>fs.azure.account.key.abfswales1.dfs.core.windows.net</name>
<value>ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==</value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key would have expired/rotated. (Temporary) account might have been deleted.

```

You then need to add the access key to your `core-site.xml`, JCEKs file or
use your cluster management tool to set it the option `fs.azure.account.key.STORAGE-ACCOUNT`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like STORAGE-ACCOUNT is part of the string itself. We can write it something like this: fs.azure.account.key.<STORAGE-ACCOUNT>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. We have been following this format everywhere.

For most of the cases, combining data from multiple `write()` calls in
blocks of 4Mb is a good optimization. But, in others cases, like HBase log files,
every call to `hflush()` or `hsync()` must upload the data to the service.
Key Steps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Key Steps (Key Steps in bold)

`/`hsync()` runs once compaction process, if number of blocks in the blob
is above 32,000.
1. Create a new Storage Account in a location which suits you.
1. "Basics" Tab: select "StorageV2".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbering correction 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in md file it shows list with same number only, @anujmodi2021 you had pointed this out earlier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct.
We just need to specify 1. for everyline and rendering will automatically put right numbers

is above 32,000.
1. Create a new Storage Account in a location which suits you.
1. "Basics" Tab: select "StorageV2".
1. "Advanced" Tab: enable "Hierarchical Namespace".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, numbering correction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

`fs.azure.block.blob.with.compaction.dir` to a comma-separated list of
folder names.
1. Go to the Azure Portal.
1. Select "Storage Accounts"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, numbering correction for all below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

Block compaction search and replaces a sequence of small blocks with one big
block. That means there is associated cost with block compaction: reading
small blocks back to the client and writing it again as one big block.
You have now created your storage account. Next, get the key for authentication
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence doesn't sound correct. Please check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

See also:

* [WASB Deprecation](./wasb.html)
* [ABFS](./index.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename index.md with something better ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the landing page has to be named index.md.
Its same for all the sites in hadoop repo

[Azure Storage SDK for Java](https://github.com/Azure/azure-storage-java).

To make it part of Apache Hadoop's default classpath, simply make sure that
`HADOOP_OPTIONAL_TOOLS`in `hadoop-env.sh` has `'hadoop-azure` in the list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in should be after space


* Read and write data stored in an Azure Blob Storage account.
* Present a hierarchical file system view by implementing the standard Hadoop
[`FileSystem`](../api/org/apache/hadoop/fs/FileSystem.html) interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full Hadoop Filesystem in italics ?

The internal implementation also uses blobs to persist the file system
hierarchy and other metadata.
stored in a single container.
* **Blob**: A file of any type and size stored with the existing wasb connector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wasb here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected

After credentials are configured in core-site.xml, any Hadoop component may
reference files in that Azure Blob Storage account by using URLs of the following
format:
The concepts covered there are beyond the scope of this document to cover;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, instead of ; ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks correct I think

file path, or rely on some external locking mechanism of its own. Failure to do so will result in
unexpected behavior.
*Note*: The source of the account key can be changed through a custom key provider;
one exists to execute a shell script to retrieve it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one needs to ?

</property>
<property>
<name>fs.azure.account.oauth2.client.endpoint</name>
<value></value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the format here or something like endpoint instead of keeping it empty

</property>
<property>
<name>fs.azure.account.oauth2.client.id</name>
<value></value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for client id and secret

</property>
<property>
<name>fs.azure.account.oauth2.user.name</name>
<value></value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same ad above

To enable SAS key generation locally following property needs to be set to true.
### <a name="oauth-refresh-token"></a> OAuth 2.0: Refresh Token

With an existing Oauth 2.0 token, make a request of the Active Directory endpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of -> to

PUT/POST operations are idempotent and need no specific handling
except for Rename and Delete operations.

Rename idempotency checks are made by ensuring the LastModifiedTime on destination
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add the recent changes made for rename and create idempotency ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhattmanish98 could you help here with what to add?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@manika137 For now we can just remove these lines from this file.

except for Rename and Delete operations.
Rename idempotency checks are made by ensuring the LastModifiedTime on destination is recent if source path is found to be non-existent on retry.

In a later PR, we can add a description of how we achieved this for rename and create operations.

@hadoop-yetus
Copy link

🎊 +1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 34s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+0 🆗 markdownlint 0m 0s markdownlint was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ trunk Compile Tests _
+1 💚 mvninstall 39m 15s trunk passed
+1 💚 mvnsite 0m 44s trunk passed
+1 💚 shadedclient 80m 47s branch has no errors when building and testing our client artifacts.
-0 ⚠️ patch 81m 12s Used diff version of patch file. Binary files and potentially other changes not applied. Please rebase and squash commits if necessary.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 39s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 mvnsite 0m 44s the patch passed
+1 💚 shadedclient 40m 9s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
124m 21s
Subsystem Report/Notes
Docker ClientAPI=1.48 ServerAPI=1.48 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7540/7/artifact/out/Dockerfile
GITHUB PR #7540
Optional Tests dupname asflicense mvnsite codespell detsecrets markdownlint
uname Linux 43c894518105 5.15.0-131-generic #141-Ubuntu SMP Fri Jan 10 21:18:28 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 4214942
Max. process+thread count 736 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7540/7/console
versions git=2.25.1 maven=3.6.3
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@anujmodi2021 anujmodi2021 merged commit 50553de into apache:trunk Apr 4, 2025
4 checks passed
YanivKunda pushed a commit to YanivKunda/hadoop that referenced this pull request Apr 4, 2025
…re (apache#7540)

Updating documentations to reflect announcement for FNS-Blob Support and Wasb Deprecation
Contributed by Manika Joshi.
Reviwed by Anmol, Manish and Anuj.

Signed off by Anuj Modi<[email protected]>
manika137 added a commit to ABFSDriver/AbfsHadoop that referenced this pull request Apr 10, 2025
…re (apache#7540)

Updating documentations to reflect announcement for FNS-Blob Support and Wasb Deprecation
Contributed by Manika Joshi.
Reviwed by Anmol, Manish and Anuj.

Signed off by Anuj Modi<[email protected]>
anujmodi2021 pushed a commit that referenced this pull request Apr 10, 2025
…re (#7540) (#7595)

Updating documentations to reflect announcement for FNS-Blob Support and Wasb Deprecation
Contributed by Manika Joshi.
Reviwed by Anmol, Manish and Anuj.

Signed off by Anuj Modi<[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants