Skip to content

HADOOP-13708. Fix typos in *.md documents, some of which are suggested by Andrew Wang. #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Installation

Installing a Hadoop cluster typically involves unpacking the software on all the machines in the cluster or installing it via a packaging system as appropriate for your operating system. It is important to divide up the hardware into functions.

Typically one machine in the cluster is designated as the NameNode and another machine the as ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastrucutre, depending upon the load.
Typically one machine in the cluster is designated as the NameNode and another machine as the ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastrucutre, depending upon the load.

The rest of the machines in the cluster act as both DataNode and NodeManager. These are the workers.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -68,15 +68,15 @@ Wire compatibility concerns data being transmitted over the wire between Hadoop
#### Use Cases

* Client-Server compatibility is required to allow users to continue using the old clients even after upgrading the server (cluster) to a later version (or vice versa). For example, a Hadoop 2.1.0 client talking to a Hadoop 2.3.0 cluster.
* Client-Server compatibility is also required to allow users to upgrade the client before upgrading the server (cluster). For example, a Hadoop 2.4.0 client talking to a Hadoop 2.3.0 cluster. This allows deployment of client-side bug fixes ahead of full cluster upgrades. Note that new cluster features invoked by new client APIs or shell commands will not be usable. YARN applications that attempt to use new APIs (including new fields in data structures) that have not yet deployed to the cluster can expect link exceptions.
* Client-Server compatibility is also required to allow users to upgrade the client before upgrading the server (cluster). For example, a Hadoop 2.4.0 client talking to a Hadoop 2.3.0 cluster. This allows deployment of client-side bug fixes ahead of full cluster upgrades. Note that new cluster features invoked by new client APIs or shell commands will not be usable. YARN applications that attempt to use new APIs (including new fields in data structures) that have not yet been deployed to the cluster can expect link exceptions.
* Client-Server compatibility is also required to allow upgrading individual components without upgrading others. For example, upgrade HDFS from version 2.1.0 to 2.2.0 without upgrading MapReduce.
* Server-Server compatibility is required to allow mixed versions within an active cluster so the cluster may be upgraded without downtime in a rolling fashion.

#### Policy

* Both Client-Server and Server-Server compatibility is preserved within a major release. (Different policies for different categories are yet to be considered.)
* Compatibility can be broken only at a major release, though breaking compatibility even at major releases has grave consequences and should be discussed in the Hadoop community.
* Hadoop protocols are defined in .proto (ProtocolBuffers) files. Client-Server protocols and Server-protocol .proto files are marked as stable. When a .proto file is marked as stable it means that changes should be made in a compatible fashion as described below:
* Hadoop protocols are defined in .proto (ProtocolBuffers) files. Client-Server protocols and Server-Server protocol .proto files are marked as stable. When a .proto file is marked as stable it means that changes should be made in a compatible fashion as described below:
* The following changes are compatible and are allowed at any time:
* Add an optional field, with the expectation that the code deals with the field missing due to communication with an older version of the code.
* Add a new rpc/method to the service
Expand All @@ -101,7 +101,7 @@ Wire compatibility concerns data being transmitted over the wire between Hadoop

### Java Binary compatibility for end-user applications i.e. Apache Hadoop ABI

As Apache Hadoop revisions are upgraded end-users reasonably expect that their applications should continue to work without any modifications. This is fulfilled as a result of support API compatibility, Semantic compatibility and Wire compatibility.
As Apache Hadoop revisions are upgraded end-users reasonably expect that their applications should continue to work without any modifications. This is fulfilled as a result of supporting API compatibility, Semantic compatibility and Wire compatibility.

However, Apache Hadoop is a very complex, distributed system and services a very wide variety of use-cases. In particular, Apache Hadoop MapReduce is a very, very wide API; in the sense that end-users may make wide-ranging assumptions such as layout of the local disk when their map/reduce tasks are executing, environment variables for their tasks etc. In such cases, it becomes very hard to fully specify, and support, absolute compatibility.

Expand All @@ -115,12 +115,12 @@ However, Apache Hadoop is a very complex, distributed system and services a very

* Existing MapReduce, YARN & HDFS applications and frameworks should work unmodified within a major release i.e. Apache Hadoop ABI is supported.
* A very minor fraction of applications maybe affected by changes to disk layouts etc., the developer community will strive to minimize these changes and will not make them within a minor version. In more egregious cases, we will consider strongly reverting these breaking changes and invalidating offending releases if necessary.
* In particular for MapReduce applications, the developer community will try our best to support provide binary compatibility across major releases e.g. applications using org.apache.hadoop.mapred.
* In particular for MapReduce applications, the developer community will try our best to support providing binary compatibility across major releases e.g. applications using org.apache.hadoop.mapred.
* APIs are supported compatibly across hadoop-1.x and hadoop-2.x. See [Compatibility for MapReduce applications between hadoop-1.x and hadoop-2.x](../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html) for more details.

### REST APIs

REST API compatibility corresponds to both the request (URLs) and responses to each request (content, which may contain other URLs). Hadoop REST APIs are specifically meant for stable use by clients across releases, even major releases. The following are the exposed REST APIs:
REST API compatibility corresponds to both the requests (URLs) and responses to each request (content, which may contain other URLs). Hadoop REST APIs are specifically meant for stable use by clients across releases, even major ones. The following are the exposed REST APIs:

* [WebHDFS](../hadoop-hdfs/WebHDFS.html) - Stable
* [ResourceManager](../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html)
Expand All @@ -136,7 +136,7 @@ The APIs annotated stable in the text above preserve compatibility across at lea

### Metrics/JMX

While the Metrics API compatibility is governed by Java API compatibility, the actual metrics exposed by Hadoop need to be compatible for users to be able to automate using them (scripts etc.). Adding additional metrics is compatible. Modifying (eg changing the unit or measurement) or removing existing metrics breaks compatibility. Similarly, changes to JMX MBean object names also break compatibility.
While the Metrics API compatibility is governed by Java API compatibility, the actual metrics exposed by Hadoop need to be compatible for users to be able to automate using them (scripts etc.). Adding additional metrics is compatible. Modifying (e.g. changing the unit or measurement) or removing existing metrics breaks compatibility. Similarly, changes to JMX MBean object names also break compatibility.

#### Policy

Expand All @@ -148,7 +148,7 @@ User and system level data (including metadata) is stored in files of different

#### User-level file formats

Changes to formats that end-users use to store their data can prevent them for accessing the data in later releases, and hence it is highly important to keep those file-formats compatible. One can always add a "new" format improving upon an existing format. Examples of these formats include har, war, SequenceFileFormat etc.
Changes to formats that end-users use to store their data can prevent them from accessing the data in later releases, and hence it is highly important to keep those file-formats compatible. One can always add a "new" format improving upon an existing format. Examples of these formats include har, war, SequenceFileFormat etc.

##### Policy

Expand Down Expand Up @@ -185,7 +185,7 @@ Depending on the degree of incompatibility in the changes, the following potenti

### Command Line Interface (CLI)

The Hadoop command line programs may be use either directly via the system shell or via shell scripts. Changing the path of a command, removing or renaming command line options, the order of arguments, or the command return code and output break compatibility and may adversely affect users.
The Hadoop command line programs may be used either directly via the system shell or via shell scripts. Changing the path of a command, removing or renaming command line options, the order of arguments, or the command return code and output break compatibility and may adversely affect users.

#### Policy

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,15 @@ Interfaces have two main attributes: Audience and Stability

Audience denotes the potential consumers of the interface. While many interfaces
are internal/private to the implementation, other are public/external interfaces
are meant for wider consumption by applications and/or clients. For example, in
that are meant for wider consumption by applications and/or clients. For example, in
posix, libc is an external or public interface, while large parts of the kernel
are internal or private interfaces. Also, some interfaces are targeted towards
other specific subsystems.

Identifying the audience of an interface helps define the impact of breaking
it. For instance, it might be okay to break the compatibility of an interface
whose audience is a small number of specific subsystems. On the other hand, it
is probably not okay to break a protocol interfaces that millions of Internet
is probably not okay to break a protocol interface that millions of Internet
users depend on.

Hadoop uses the following kinds of audience in order of increasing/wider visibility:
Expand All @@ -75,7 +75,7 @@ referred to as project-private).

The interface is used by a specified set of projects or systems (typically
closely related projects). Other projects or systems should not use the
interface. Changes to the interface will be communicated/ negotiated with the
interface. Changes to the interface will be communicated/negotiated with the
specified projects. For example, in the Hadoop project, some interfaces are
LimitedPrivate{HDFS, MapReduce} in that they are private to the HDFS and
MapReduce projects.
Expand All @@ -92,28 +92,28 @@ the interface are allowed. Hadoop APIs have the following levels of stability.
#### Stable

Can evolve while retaining compatibility for minor release boundaries; in other
words, incompatible changes to APIs marked Stable are allowed only at major
words, incompatible changes to APIs marked as Stable are allowed only at major
releases (i.e. at m.0).

#### Evolving

Evolving, but incompatible changes are allowed at minor release (i.e. m .x)
Evolving, but incompatible changes are allowed at minor releases (i.e. m .x)

#### Unstable

Incompatible changes to Unstable APIs are allowed any time. This usually makes
Incompatible changes to Unstable APIs are allowed at any time. This usually makes
sense for only private interfaces.

However one may call this out for a supposedly public interface to highlight
that it should not be used as an interface; for public interfaces, labeling it
as Not-an-interface is probably more appropriate than "Unstable".

Examples of publicly visible interfaces that are unstable
(i.e. not-an-interface): GUI, CLIs whose output format will change
(i.e. not-an-interface): GUI, CLIs whose output format will change.

#### Deprecated

APIs that could potentially removed in the future and should not be used.
APIs that could potentially be removed in the future and should not be used.

How are the Classifications Recorded?
-------------------------------------
Expand Down Expand Up @@ -155,13 +155,13 @@ FAQ
* e.g. In HDFS, NN-DN protocol is private but stable and can help
implement rolling upgrades. It communicates that this interface should
not be changed in incompatible ways even though it is private.
* e.g. In HDFS, FSImage stability can help provide more flexible roll backs.
* e.g. In HDFS, FSImage stability provides more flexible rollback.

* What is the harm in applications using a private interface that is stable? How
is it different than a public stable interface?
* While a private interface marked as stable is targeted to change only at
major releases, it may break at other times if the providers of that
interface are willing to changes the internal users of that
interface are willing to change the internal users of that
interface. Further, a public stable interface is less likely to break even
at major releases (even though it is allowed to break compatibility)
because the impact of the change is larger. If you use a private interface
Expand All @@ -182,11 +182,11 @@ FAQ
away with private then do so; if the interface is really for general use
for all applications then do so. But remember that making an interface
public has huge responsibility. Sometimes Limited-private is just right.
* A good example of a limited-private interface is BlockLocations, This is
* A good example of a limited-private interface is BlockLocations, This is a
fairly low-level interface that we are willing to expose to MR and perhaps
HBase. We are likely to change it down the road and at that time we will
have get a coordinated effort with the MR team to release matching
releases. While MR and HDFS are always released in sync today, they may
coordinate release effort with the MR team.
While MR and HDFS are always released in sync today, they may
change down the road.
* If you have a limited-private interface with many projects listed then you
are fooling yourself. It is practically public.
Expand All @@ -207,7 +207,7 @@ FAQ
break it at minor releases.
* One example of a public interface that is unstable is where one is
providing an implementation of a standards-body based interface that is
still under development. For example, many companies, in an attampt to be
still under development. For example, many companies, in an attempt to be
first to market, have provided implementations of a new NFS protocol even
when the protocol was not fully completed by IETF. The implementor cannot
evolve the interface in a fashion that causes least distruption because
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ of the client.

**Implementation Note**: the static `FileSystem get(URI uri, Configuration conf) ` method MAY return
a pre-existing instance of a filesystem client class—a class that may also be in use in other threads.
The implementations of `FileSystem` which ship with Apache Hadoop
The implementations of `FileSystem` shipped with Apache Hadoop
*do not make any attempt to synchronize access to the working directory field*.

## Invariants
Expand Down Expand Up @@ -105,8 +105,6 @@ may differ from the local user account name.
of the caller.**


#### Preconditions


#### Postconditions

Expand Down Expand Up @@ -214,7 +212,6 @@ response, then, if a listing `listStatus("/d")` takes place concurrently with th

[a, part-0000001, ... , part-9999999]
[part-0000001, ... , part-9999999, z]

[a, part-0000001, ... , part-9999999, z]
[part-0000001, ... , part-9999999]

Expand Down Expand Up @@ -282,7 +279,7 @@ value is an instance of the `LocatedFileStatus` subclass of a `FileStatus`,
and that rather than return an entire list, an iterator is returned.

This is actually a `protected` method, directly invoked by
`listLocatedStatus(Path path):`. Calls to it may be delegated through
`listLocatedStatus(Path path)`. Calls to it may be delegated through
layered filesystems, such as `FilterFileSystem`, so its implementation MUST
be considered mandatory, even if `listLocatedStatus(Path path)` has been
implemented in a different manner. There are open JIRAs proposing
Expand Down Expand Up @@ -442,10 +439,9 @@ the convention is generally retained.

### `long getDefaultBlockSize()`

Get the "default" block size for a filesystem. This often used during
Get the "default" block size for a filesystem. This is often used during
split calculations to divide work optimally across a set of worker processes.

#### Preconditions

#### Postconditions

Expand All @@ -466,8 +462,6 @@ A FileSystem MAY make this user-configurable (the S3 and Swift filesystem client
Get the "default" block size for a path —that is, the block size to be used
when writing objects to a path in the filesystem.

#### Preconditions


#### Postconditions

Expand Down Expand Up @@ -604,7 +598,7 @@ This MAY be a bug, as it allows >1 client to create a file with `overwrite==fals
and potentially confuse file/directory logic

* The Local FileSystem raises a `FileNotFoundException` when trying to create a file over
a directory, hence it is is listed as an exception that MAY be raised when
a directory, hence it is listed as an exception that MAY be raised when
this precondition fails.

* Not covered: symlinks. The resolved path of the symlink is used as the final path argument to the `create()` operation
Expand Down Expand Up @@ -898,7 +892,7 @@ Renaming a file where the destination is a directory moves the file as a child
##### Renaming a directory onto a directory

If `src` is a directory then all its children will then exist under `dest`, while the path
`src` and its descendants will no longer not exist. The names of the paths under
`src` and its descendants will no longer exist. The names of the paths under
`dest` will match those under `src`, as will the contents:

if isDir(FS, src) isDir(FS, dest) and src != dest :
Expand Down Expand Up @@ -928,7 +922,7 @@ The outcome is no change to FileSystem state, with a return value of false.
*Local Filesystem, S3N*

The outcome is as a normal rename, with the additional (implicit) feature
that the parent directores of the destination also exist
that the parent directores of the destination also exist.

exists(FS', parent(dest))

Expand Down Expand Up @@ -1018,9 +1012,9 @@ HDFS: All source files except the final one MUST be a complete block:


HDFS's restrictions may be an implementation detail of how it implements
`concat` -by changing the inode references to join them together in
`concat` by changing the inode references to join them together in
a sequence. As no other filesystem in the Hadoop core codebase
implements this method, there is no way to distinguish implementation detail.
implements this method, there is no way to distinguish implementation detail
from specification.


Expand Down
Loading