-
-
Notifications
You must be signed in to change notification settings - Fork 46.8k
added smith waterman algorithm #9001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 10 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
4672878
added smith waterman algorithm
BAW2501 44314e4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 65b95a6
descriptive names for the parameters a and b
BAW2501 8e56b7e
Merge branch 'master' of https://github.com/BAW2501/Python
BAW2501 d8a6bcb
doctesting lowercase upcase empty string cases
BAW2501 fc58801
updated block quot,fixed traceback and doctests
BAW2501 0662f69
shorter block quote
BAW2501 892858a
global vars to func params,more doctests
BAW2501 37d7fed
Merge branch 'master' of https://github.com/BAW2501/Python
BAW2501 0e199f7
updated doctests
BAW2501 2729a57
user access to SW params
BAW2501 2a3e20a
formating
BAW2501 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,185 @@ | ||
""" | ||
https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm | ||
The Smith-Waterman algorithm is a dynamic programming algorithm used for sequence | ||
alignment. It is particularly useful for finding similarities between two sequences, | ||
such as DNA or protein sequences. In this implementation, gaps are penalized | ||
linearly, meaning that the score is reduced by a fixed amount for each gap introduced | ||
in the alignment. However, it's important to note that the Smith-Waterman algorithm | ||
supports other gap penalty methods as well. | ||
""" | ||
|
||
|
||
def score_function( | ||
source_char: str, | ||
target_char: str, | ||
match: int = 1, | ||
mismatch: int = -1, | ||
gap: int = -2, | ||
) -> int: | ||
""" | ||
Calculate the score for a character pair based on whether they match or mismatch. | ||
Returns 1 if the characters match, -1 if they mismatch, and -2 if either of the | ||
characters is a gap. | ||
>>> score_function('A', 'A') | ||
1 | ||
>>> score_function('A', 'C') | ||
-1 | ||
>>> score_function('-', 'A') | ||
-2 | ||
>>> score_function('A', '-') | ||
-2 | ||
>>> score_function('-', '-') | ||
-2 | ||
""" | ||
if "-" in (source_char, target_char): | ||
return gap | ||
return match if source_char == target_char else mismatch | ||
|
||
|
||
def smith_waterman(query: str, subject: str) -> list[list[int]]: | ||
""" | ||
Perform the Smith-Waterman local sequence alignment algorithm. | ||
Returns a 2D list representing the score matrix. Each value in the matrix | ||
corresponds to the score of the best local alignment ending at that point. | ||
>>> smith_waterman('ACAC', 'CA') | ||
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]] | ||
BAW2501 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
>>> smith_waterman('acac', 'ca') | ||
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]] | ||
>>> smith_waterman('ACAC', 'ca') | ||
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]] | ||
>>> smith_waterman('acac', 'CA') | ||
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]] | ||
>>> smith_waterman('ACAC', '') | ||
[[0], [0], [0], [0], [0]] | ||
>>> smith_waterman('', 'CA') | ||
[[0, 0, 0]] | ||
>>> smith_waterman('ACAC', 'CA') | ||
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]] | ||
|
||
>>> smith_waterman('acac', 'ca') | ||
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]] | ||
|
||
>>> smith_waterman('ACAC', 'ca') | ||
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]] | ||
|
||
>>> smith_waterman('acac', 'CA') | ||
[[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]] | ||
|
||
>>> smith_waterman('ACAC', '') | ||
[[0], [0], [0], [0], [0]] | ||
|
||
>>> smith_waterman('', 'CA') | ||
[[0, 0, 0]] | ||
|
||
>>> smith_waterman('AGT', 'AGT') | ||
[[0, 0, 0, 0], [0, 1, 0, 0], [0, 0, 2, 0], [0, 0, 0, 3]] | ||
|
||
>>> smith_waterman('AGT', 'GTA') | ||
[[0, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0, 0, 2, 0]] | ||
|
||
>>> smith_waterman('AGT', 'GTC') | ||
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 1, 0, 0], [0, 0, 2, 0]] | ||
|
||
>>> smith_waterman('AGT', 'G') | ||
[[0, 0], [0, 0], [0, 1], [0, 0]] | ||
|
||
>>> smith_waterman('G', 'AGT') | ||
[[0, 0, 0, 0], [0, 0, 1, 0]] | ||
|
||
>>> smith_waterman('AGT', 'AGTCT') | ||
[[0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0], [0, 0, 2, 0, 0, 0], [0, 0, 0, 3, 1, 1]] | ||
|
||
>>> smith_waterman('AGTCT', 'AGT') | ||
[[0, 0, 0, 0], [0, 1, 0, 0], [0, 0, 2, 0], [0, 0, 0, 3], [0, 0, 0, 1], [0, 0, 0, 1]] | ||
|
||
>>> smith_waterman('AGTCT', 'GTC') | ||
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 1, 0, 0], [0, 0, 2, 0], [0, 0, 0, 3], [0, 0, 1, 1]] | ||
""" | ||
# make both query and subject uppercase | ||
query = query.upper() | ||
subject = subject.upper() | ||
|
||
# Initialize score matrix | ||
m = len(query) | ||
n = len(subject) | ||
score = [[0] * (n + 1) for _ in range(m + 1)] | ||
gap = score_function("-", "-") | ||
|
||
for i in range(1, m + 1): | ||
for j in range(1, n + 1): | ||
# Calculate scores for each cell | ||
match = score[i - 1][j - 1] + score_function(query[i - 1], subject[j - 1]) | ||
delete = score[i - 1][j] + gap | ||
insert = score[i][j - 1] + gap | ||
|
||
# Take maximum score | ||
score[i][j] = max(0, match, delete, insert) | ||
|
||
return score | ||
|
||
|
||
def traceback(score: list[list[int]], query: str, subject: str) -> str: | ||
r""" | ||
Perform traceback to find the optimal local alignment. | ||
Starts from the highest scoring cell in the matrix and traces back recursively | ||
until a 0 score is found. Returns the alignment strings. | ||
>>> traceback([[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]], 'ACAC', 'CA') | ||
'CA\nCA' | ||
>>> traceback([[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]], 'acac', 'ca') | ||
'CA\nCA' | ||
>>> traceback([[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]], 'ACAC', 'ca') | ||
'CA\nCA' | ||
>>> traceback([[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 2], [0, 1, 0]], 'acac', 'CA') | ||
'CA\nCA' | ||
>>> traceback([[0, 0, 0]], 'ACAC', '') | ||
'' | ||
""" | ||
# make both query and subject uppercase | ||
query = query.upper() | ||
subject = subject.upper() | ||
# find the indices of the maximum value in the score matrix | ||
max_value = float("-inf") | ||
i_max = j_max = 0 | ||
for i, row in enumerate(score): | ||
for j, value in enumerate(row): | ||
if value > max_value: | ||
max_value = value | ||
i_max, j_max = i, j | ||
# Traceback logic to find optimal alignment | ||
i = i_max | ||
j = j_max | ||
align1 = "" | ||
align2 = "" | ||
gap = score_function("-", "-") | ||
# guard against empty query or subject | ||
if i == 0 or j == 0: | ||
return "" | ||
while i > 0 and j > 0: | ||
if score[i][j] == score[i - 1][j - 1] + score_function( | ||
query[i - 1], subject[j - 1] | ||
): | ||
# optimal path is a diagonal take both letters | ||
align1 = query[i - 1] + align1 | ||
align2 = subject[j - 1] + align2 | ||
i -= 1 | ||
j -= 1 | ||
elif score[i][j] == score[i - 1][j] + gap: | ||
# optimal path is a vertical | ||
align1 = query[i - 1] + align1 | ||
align2 = f"-{align2}" | ||
i -= 1 | ||
else: | ||
# optimal path is a horizontal | ||
align1 = f"-{align1}" | ||
align2 = subject[j - 1] + align2 | ||
j -= 1 | ||
|
||
return f"{align1}\n{align2}" | ||
|
||
|
||
if __name__ == "__main__": | ||
query = "HEAGAWGHEE" | ||
subject = "PAWHEAE" | ||
|
||
score = smith_waterman(query, subject) | ||
print(traceback(score, query, subject)) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add optional parameters for the score constants in this function's header as well? Users will call this function specifically to run the algorithm, and currently there's no way for the user to pass in their desired score constants into this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i hate repeating code, but there seems to be no way around it when it comes to default params
either way i got about it i'll have to unpack the kwargs in score function params or the body of the function and i think the params are a better way for clarity