Skip to content
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1503,6 +1503,16 @@ public class DFSConfigKeys extends CommonConfigurationKeys {
public static final boolean DFS_PROTECTED_SUBDIRECTORIES_ENABLE_DEFAULT =
false;

public static final String DFS_DATANODE_ALLOW_SAME_DISK_TIERING =
"dfs.datanode.same-disk-tiering.enabled";
public static final boolean DFS_DATANODE_ALLOW_SAME_DISK_TIERING_DEFAULT =
false;

public static final String DFS_DATANODE_RESERVE_FOR_ARCHIVE_PERCENTAGE =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take another look at the patch, I think it may be better to have the percentage as a tag added to the configuration "dfs.datanode.data.dir", just following the storage type tag. In this way on the same datanode we can have different percentage settings for different mount points. What do you think?

Copy link
Contributor Author

@LeonGao91 LeonGao91 Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention is to have a configuration as a "default value" for all disks, as in normal cases one datanode server comes with the same type of HDDs. Therefore we can keep the DN configuration less verbose for most of the use cases.

However, you are right that we should allow users to configure different values, and it is a good idea to put it under "dfs.datanode.data.dir".
I will create a follow-up JIRA to address it, so we can keep this PR from being too big, as that could involve quite some change.

"dfs.datanode.reserve-for-archive.percentage";
public static final double
DFS_DATANODE_RESERVE_FOR_ARCHIVE_PERCENTAGE_DEFAULT = 0.0;

// dfs.client.retry confs are moved to HdfsClientConfigKeys.Retry
@Deprecated
public static final String DFS_CLIENT_RETRY_POLICY_ENABLED_KEY
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,12 @@ public FsVolumeImpl getVolume(final ExtendedBlock b) {
}
}

// Get volume by device and storage type.
// Only used when turning on same disk tiering feature.
FsVolumeReference getVolumeRef(String device, StorageType storageType) {
return volumes.getVolumeRefByDeviceAndStorageType(device, storageType);
}

@Override // FsDatasetSpi
public Block getStoredBlock(String bpid, long blkid)
throws IOException {
Expand Down Expand Up @@ -365,7 +371,7 @@ public LengthInputStream getMetaDataInputStream(ExtendedBlock b)
RoundRobinVolumeChoosingPolicy.class,
VolumeChoosingPolicy.class), conf);
volumes = new FsVolumeList(volumeFailureInfos, datanode.getBlockScanner(),
blockChooserImpl);
blockChooserImpl, conf);
asyncDiskService = new FsDatasetAsyncDiskService(datanode, this);
asyncLazyPersistService = new RamDiskAsyncLazyPersistService(datanode, conf);
deletingBlock = new HashMap<String, Set<Long>>();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,9 @@ public class FsVolumeImpl implements FsVolumeSpi {
private final FileIoProvider fileIoProvider;
private final DataNodeVolumeMetrics metrics;
private URI baseURI;
private boolean enableSameDiskArchival;
private final String device;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using storageID replace device? IMO both of them are in order to index single volume, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "device" here is the string value of the filesystem mount point. I wanted to use it to keep track of which two volumes are on the same mount (thus the same disk). Datanode can use the existing DF#getMount() to detect it automatically.
I can probably change the name to "mount" to make it more clear.

private double reservedForArchive;

/**
* Per-volume worker pool that processes new blocks to cache.
Expand Down Expand Up @@ -190,6 +193,26 @@ public class FsVolumeImpl implements FsVolumeSpi {
}
this.conf = conf;
this.fileIoProvider = fileIoProvider;
this.enableSameDiskArchival =
conf.getBoolean(DFSConfigKeys.DFS_DATANODE_ALLOW_SAME_DISK_TIERING,
DFSConfigKeys.DFS_DATANODE_ALLOW_SAME_DISK_TIERING_DEFAULT);
if (enableSameDiskArchival) {
this.device = usage.getMount();
reservedForArchive = conf.getDouble(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reservedForArchive try to define reserve for archive percentage. If there are heterogeneous disks located one node, do we need config them separate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a good point. The reason I put it this way is to make configuration less verbose for normal use cases that datanode only has one type of disk. Otherwise, users will need to tag all the disks which is less readable and easy to make mistakes.

I think we can introduce additional config for the use case you mentioned later, to list out each volume and target ratio.

DFSConfigKeys.DFS_DATANODE_RESERVE_FOR_ARCHIVE_PERCENTAGE,
DFSConfigKeys.DFS_DATANODE_RESERVE_FOR_ARCHIVE_PERCENTAGE_DEFAULT);
if (reservedForArchive >= 1) {
FsDatasetImpl.LOG.warn("Value of reserve-for-archival is >= 100% for "
+ currentDir + ". Setting it to 99%.");
reservedForArchive = 0.99;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why reservedForArchive has to less than 1 here, IIUC it means that this is ARCHIVE device when reservedForArchive set to 1. Right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you are right, I will update and make this at most 1.

}
} else {
device = "";
}
}

String getDevice() {
return device;
}

protected ThreadPoolExecutor initializeCacheExecutor(File parent) {
Expand Down Expand Up @@ -412,16 +435,31 @@ long getBlockPoolUsed(String bpid) throws IOException {
*/
@VisibleForTesting
public long getCapacity() {
long capacity;
if (configuredCapacity < 0L) {
long remaining;
if (cachedCapacity > 0L) {
remaining = cachedCapacity - getReserved();
} else {
remaining = usage.getCapacity() - getReserved();
}
return Math.max(remaining, 0L);
capacity = Math.max(remaining, 0L);
} else {
capacity = configuredCapacity;
}

if (enableSameDiskArchival) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value seems not expected as annotation says if enable this feature.

the capacity of the file system excluding space reserved for non-HDFS.

IMO, the part for ARCHIVE should also be calculated. It seems be not differentiated by NameNode for DISK or ARCHIVE per storage of DataNode. Please correct if something wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the important part to enable this feature, to allow users to configure the capacity of a fsVolume.

As we are configuring two fsVolume on the same underlying filesystem, if we do nothing the capacity will be calculated twice thus all the stats being reported will be incorrect.

Here is an example:
Let's say we want to configure [DISK]/data01/dfs and [ARCHIVE]/data01/dfs_archive on a 4TB disk mount /data01, and we want to assign 1 TB to [DISK]/data01/dfs and 3 TB for [ARCHIVE]/data01/dfs_archive, we can make reservedForArchive to be 0.75 and put those two dirs in the volume list.

In this case, /data01/dfs will be reported as a 1TB volume and /data01/dfs_archive will be reported as 3TB volume to HDFS. Logically, HDFS will just treat them as two separate volumes.

If we don't make the change here, HDFS will see two volumes and each of them is 4TB, in that case, the 4TB disk will be counted as 4 * 2 = 8TB capacity in namenode and all the related stats will be wrong.

Another change we need to make is the getActualNonDfsUsed() as below. Let's say in the above 4TB disk setup we use 0.1TB as reserved, and [ARCHIVE]/data01/dfs_archive already has 2TB capacity used, in this case when we are calculating the getActualNonDfsUsed() for [DISK]/data01/dfs it will always return 0, which is not correct and it will cause other weird issues. As the two fsVolumes are on the same filesystem, the reserved space should be shared.

According to our analysis and cluster testing result, updating these two functions getCapacity() and getActualNonDfsUsed() is enough to keep stats correct for the two "logical" fsVolumes on same disk.

I can update the java doc to reflect this when the feature is turned on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LeonGao91 for your detailed comments. It makes sense for me.

double reservedForArchival = conf.getDouble(
DFSConfigKeys.DFS_DATANODE_RESERVE_FOR_ARCHIVE_PERCENTAGE,
DFSConfigKeys.DFS_DATANODE_RESERVE_FOR_ARCHIVE_PERCENTAGE_DEFAULT);
if (storageType == StorageType.ARCHIVE) {
capacity = (long) (capacity * reservedForArchival);
} else {
capacity = (long) (capacity * (1 - reservedForArchival));
}
}
return configuredCapacity;

return capacity;
}

/**
Expand Down Expand Up @@ -452,7 +490,33 @@ public long getAvailable() throws IOException {
}

long getActualNonDfsUsed() throws IOException {
return usage.getUsed() - getDfsUsed();
// DISK and ARCHIVAL on same disk
// should share the same amount of reserved capacity.
// When calculating actual non dfs used,
// exclude DFS used capacity by another volume.
if (enableSameDiskArchival &&
(storageType == StorageType.DISK
|| storageType == StorageType.ARCHIVE)) {
StorageType counterpartStorageType = storageType == StorageType.DISK
? StorageType.ARCHIVE : StorageType.DISK;
FsVolumeReference counterpartRef = dataset
.getVolumeRef(device, counterpartStorageType);
if (counterpartRef != null) {
FsVolumeImpl counterpartVol = (FsVolumeImpl) counterpartRef.getVolume();
long used = getDfUsed() - getDfsUsed() - counterpartVol.getDfsUsed();
counterpartRef.close();
return used;
}
}
return getDfUsed() - getDfsUsed();
}

/**
* This function is only used for Mock.
*/
@VisibleForTesting
public long getDfUsed() {
return usage.getUsed();
}

private long getRemainingReserved() throws IOException {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,14 @@
import java.util.Map;
import java.util.TreeMap;
import java.util.Set;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.CopyOnWriteArrayList;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.Condition;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.StorageType;
import org.apache.hadoop.hdfs.DFSConfigKeys;
import org.apache.hadoop.hdfs.protocol.BlockListAsLongs;
import org.apache.hadoop.hdfs.server.datanode.fsdataset.FsVolumeReference;
import org.apache.hadoop.hdfs.server.datanode.fsdataset.FsVolumeSpi;
Expand Down Expand Up @@ -62,9 +64,14 @@ class FsVolumeList {
private final VolumeChoosingPolicy<FsVolumeImpl> blockChooser;
private final BlockScanner blockScanner;

private boolean enableSameDiskTiering;
private ConcurrentMap<String, Map<StorageType, FsVolumeImpl>>
deviceVolumeMapping;

FsVolumeList(List<VolumeFailureInfo> initialVolumeFailureInfos,
BlockScanner blockScanner,
VolumeChoosingPolicy<FsVolumeImpl> blockChooser) {
VolumeChoosingPolicy<FsVolumeImpl> blockChooser,
Configuration config) {
this.blockChooser = blockChooser;
this.blockScanner = blockScanner;
this.checkDirsLock = new AutoCloseableLock();
Expand All @@ -73,6 +80,12 @@ class FsVolumeList {
volumeFailureInfos.put(volumeFailureInfo.getFailedStorageLocation(),
volumeFailureInfo);
}
enableSameDiskTiering = config.getBoolean(
DFSConfigKeys.DFS_DATANODE_ALLOW_SAME_DISK_TIERING,
DFSConfigKeys.DFS_DATANODE_ALLOW_SAME_DISK_TIERING_DEFAULT);
if (enableSameDiskTiering) {
deviceVolumeMapping = new ConcurrentHashMap<>();
}
}

/**
Expand All @@ -82,6 +95,29 @@ List<FsVolumeImpl> getVolumes() {
return Collections.unmodifiableList(volumes);
}

/**
* Get vol by device and storage type.
* This is used when same-disk-tiering is enabled.
*/
FsVolumeReference getVolumeRefByDeviceAndStorageType(String device,
StorageType storageType) {
if (deviceVolumeMapping != null
&& deviceVolumeMapping.containsKey(device)) {
try {
FsVolumeImpl volume = deviceVolumeMapping
.get(device).getOrDefault(storageType, null);
if (volume != null) {
return volume.obtainReference();
}
} catch (ClosedChannelException e) {
FsDatasetImpl.LOG.warn("Volume closed when getting volume" +
" by device and storage type: "
+ device + ", " + storageType);
}
}
return null;
}

private FsVolumeReference chooseVolume(List<FsVolumeImpl> list,
long blockSize, String storageId) throws IOException {
while (true) {
Expand Down Expand Up @@ -291,6 +327,23 @@ public String toString() {
void addVolume(FsVolumeReference ref) {
FsVolumeImpl volume = (FsVolumeImpl) ref.getVolume();
volumes.add(volume);
if (enableSameDiskTiering &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe consider putting this check into a method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will do that

(volume.getStorageType() == StorageType.DISK
|| volume.getStorageType() == StorageType.ARCHIVE)) {
String device = volume.getDevice();
if (!device.isEmpty()) {
Map<StorageType, FsVolumeImpl> storageTypeMap =
deviceVolumeMapping
.getOrDefault(device, new ConcurrentHashMap<>());
if (storageTypeMap.containsKey(volume.getStorageType())) {
FsDatasetImpl.LOG.error("Found storage type already exist." +
" Skipping for now. Please check disk configuration");
} else {
storageTypeMap.put(volume.getStorageType(), volume);
deviceVolumeMapping.put(device, storageTypeMap);
}
}
}
if (blockScanner != null) {
blockScanner.addVolumeScanner(ref);
} else {
Expand All @@ -311,6 +364,18 @@ void addVolume(FsVolumeReference ref) {
*/
private void removeVolume(FsVolumeImpl target) {
if (volumes.remove(target)) {
if (enableSameDiskTiering &&
(target.getStorageType() == StorageType.DISK
|| target.getStorageType() == StorageType.ARCHIVE)) {
String device = target.getDevice();
if (!device.isEmpty()) {
Map storageTypeMap = deviceVolumeMapping.get(device);
storageTypeMap.remove(target.getStorageType());
if (storageTypeMap.isEmpty()) {
deviceVolumeMapping.remove(device);
}
}
}
if (blockScanner != null) {
blockScanner.removeVolumeScanner(target);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5975,4 +5975,13 @@
</description>
</property>

<property>
<name>dfs.datanode.same-disk-tiering.enabled</name>
<value>false</value>
</property>

<property>
<name>dfs.datanode.reserve-for-archive.percentage</name>
<value>0.0</value>
</property>
</configuration>
Loading