-
Notifications
You must be signed in to change notification settings - Fork 14.5k
[IR2Vec][llvm-ir2vec] Revamp triplet generation and add entity mapping mode #149214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
svkeerthy
wants to merge
2
commits into
main
Choose a base branch
from
users/svkeerthy/07-16-revamp-triplet-gen
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+336
−93
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,17 +13,21 @@ DESCRIPTION | |
|
||
:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It | ||
generates IR2Vec embeddings for LLVM IR and supports triplet generation | ||
for vocabulary training. It provides two main operation modes: | ||
for vocabulary training. It provides three main operation modes: | ||
|
||
1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary | ||
1. **Triplet Mode**: Generates numeric triplets in train2id format for vocabulary | ||
training from LLVM IR. | ||
|
||
2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary | ||
2. **Entity Mode**: Generates entity mapping files (entity2id.txt) for vocabulary | ||
training. | ||
|
||
3. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary | ||
at different granularity levels (instruction, basic block, or function). | ||
|
||
The tool is designed to facilitate machine learning applications that work with | ||
LLVM IR by converting the IR into numerical representations that can be used by | ||
ML models. | ||
ML models. The triplet mode generates numeric IDs directly instead of string | ||
triplets, streamlining the training data preparation workflow. | ||
|
||
.. note:: | ||
|
||
|
@@ -34,18 +38,46 @@ ML models. | |
OPERATION MODES | ||
--------------- | ||
|
||
Triplet Generation and Entity Mapping Modes are used for preparing | ||
vocabulary and training data for knowledge graph embeddings. The Embedding Mode | ||
is used for generating embeddings from LLVM IR using a pre-trained vocabulary. | ||
|
||
The Seed Embedding Vocabulary of IR2Vec is trained on a large corpus of LLVM IR | ||
by modeling the relationships between opcodes, types, and operands as a knowledge | ||
graph. For this purpose, Triplet Generation and Entity Mapping Modes generate | ||
triplets and entity mappings in the standard format used for knowledge graph | ||
embedding training (see | ||
<https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format> | ||
for details). | ||
|
||
Triplet Generation Mode | ||
~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets | ||
consisting of opcodes, types, and operands. These triplets can be used to train | ||
vocabularies for embedding generation. | ||
In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts numeric | ||
triplets consisting of opcode IDs, type IDs, and operand IDs. These triplets | ||
are generated in train2id format. The tool outputs numeric IDs directly using | ||
the ir2vec::Vocabulary mapping infrastructure, eliminating the need for | ||
string-to-ID preprocessing. | ||
|
||
Usage: | ||
|
||
.. code-block:: bash | ||
|
||
llvm-ir2vec --mode=triplets input.bc -o triplets_train2id.txt | ||
|
||
Entity Mapping Generation Mode | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
In entity mode, :program:`llvm-ir2vec` generates the entity mappings supported by | ||
IR2Vec in entity2id format. This mode outputs all supported entities (opcodes, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here for |
||
types, and operands) with their corresponding numeric IDs, and is not specific for | ||
an LLVM IR file. | ||
|
||
Usage: | ||
|
||
.. code-block:: bash | ||
|
||
llvm-ir2vec --mode=triplets input.bc -o triplets.txt | ||
llvm-ir2vec --mode=entities -o entity2id.txt | ||
|
||
Embedding Generation Mode | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
@@ -67,6 +99,7 @@ OPTIONS | |
Specify the operation mode. Valid values are: | ||
|
||
* ``triplets`` - Generate triplets for vocabulary training | ||
* ``entities`` - Generate entity mappings for vocabulary training | ||
* ``embeddings`` - Generate embeddings using trained vocabulary (default) | ||
|
||
.. option:: --level=<level> | ||
|
@@ -115,7 +148,7 @@ OPTIONS | |
|
||
``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, | ||
``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding | ||
mode. These options are ignored in triplet mode. | ||
mode. These options are ignored in triplet and entity modes. | ||
|
||
INPUT FILE FORMAT | ||
----------------- | ||
|
@@ -129,14 +162,34 @@ OUTPUT FORMAT | |
Triplet Mode Output | ||
~~~~~~~~~~~~~~~~~~~ | ||
|
||
In triplet mode, the output consists of lines containing space-separated triplets: | ||
In triplet mode, the output consists of numeric triplets in train2id format with | ||
metadata headers. The format includes: | ||
|
||
.. code-block:: text | ||
|
||
MAX_RELATIONS=<max_relations_count> | ||
<head_entity_id> <tail_entity_id> <relation_id> | ||
<head_entity_id> <tail_entity_id> <relation_id> | ||
... | ||
|
||
Each line after the metadata header represents one instruction relationship, | ||
with numeric IDs for head entity, relation, and tail entity. The metadata | ||
header (MAX_RELATIONS) provides counts for post-processing and training setup. | ||
|
||
Entity Mode Output | ||
~~~~~~~~~~~~~~~~~~ | ||
|
||
In entity mode, the output consists of entity mapping in the format: | ||
|
||
.. code-block:: text | ||
|
||
<opcode> <type> <operand1> <operand2> ... | ||
<total_entities> | ||
<entity_string> <numeric_id> | ||
<entity_string> <numeric_id> | ||
... | ||
|
||
Each line represents the information of one instruction, with the opcode, type, | ||
and operands. | ||
The first line contains the total number of entities, followed by one entity | ||
mapping per line with tab-separated entity string and numeric ID. | ||
|
||
Embedding Mode Output | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
; RUN: llvm-ir2vec --mode=entities | FileCheck %s | ||
|
||
CHECK: 92 | ||
CHECK-NEXT: Ret 0 | ||
CHECK-NEXT: Br 1 | ||
CHECK-NEXT: Switch 2 | ||
CHECK-NEXT: IndirectBr 3 | ||
CHECK-NEXT: Invoke 4 | ||
CHECK-NEXT: Resume 5 | ||
CHECK-NEXT: Unreachable 6 | ||
CHECK-NEXT: CleanupRet 7 | ||
CHECK-NEXT: CatchRet 8 | ||
CHECK-NEXT: CatchSwitch 9 | ||
CHECK-NEXT: CallBr 10 | ||
CHECK-NEXT: FNeg 11 | ||
CHECK-NEXT: Add 12 | ||
CHECK-NEXT: FAdd 13 | ||
CHECK-NEXT: Sub 14 | ||
CHECK-NEXT: FSub 15 | ||
CHECK-NEXT: Mul 16 | ||
CHECK-NEXT: FMul 17 | ||
CHECK-NEXT: UDiv 18 | ||
CHECK-NEXT: SDiv 19 | ||
CHECK-NEXT: FDiv 20 | ||
CHECK-NEXT: URem 21 | ||
CHECK-NEXT: SRem 22 | ||
CHECK-NEXT: FRem 23 | ||
CHECK-NEXT: Shl 24 | ||
CHECK-NEXT: LShr 25 | ||
CHECK-NEXT: AShr 26 | ||
CHECK-NEXT: And 27 | ||
CHECK-NEXT: Or 28 | ||
CHECK-NEXT: Xor 29 | ||
CHECK-NEXT: Alloca 30 | ||
CHECK-NEXT: Load 31 | ||
CHECK-NEXT: Store 32 | ||
CHECK-NEXT: GetElementPtr 33 | ||
CHECK-NEXT: Fence 34 | ||
CHECK-NEXT: AtomicCmpXchg 35 | ||
CHECK-NEXT: AtomicRMW 36 | ||
CHECK-NEXT: Trunc 37 | ||
CHECK-NEXT: ZExt 38 | ||
CHECK-NEXT: SExt 39 | ||
CHECK-NEXT: FPToUI 40 | ||
CHECK-NEXT: FPToSI 41 | ||
CHECK-NEXT: UIToFP 42 | ||
CHECK-NEXT: SIToFP 43 | ||
CHECK-NEXT: FPTrunc 44 | ||
CHECK-NEXT: FPExt 45 | ||
CHECK-NEXT: PtrToInt 46 | ||
CHECK-NEXT: IntToPtr 47 | ||
CHECK-NEXT: BitCast 48 | ||
CHECK-NEXT: AddrSpaceCast 49 | ||
CHECK-NEXT: CleanupPad 50 | ||
CHECK-NEXT: CatchPad 51 | ||
CHECK-NEXT: ICmp 52 | ||
CHECK-NEXT: FCmp 53 | ||
CHECK-NEXT: PHI 54 | ||
CHECK-NEXT: Call 55 | ||
CHECK-NEXT: Select 56 | ||
CHECK-NEXT: UserOp1 57 | ||
CHECK-NEXT: UserOp2 58 | ||
CHECK-NEXT: VAArg 59 | ||
CHECK-NEXT: ExtractElement 60 | ||
CHECK-NEXT: InsertElement 61 | ||
CHECK-NEXT: ShuffleVector 62 | ||
CHECK-NEXT: ExtractValue 63 | ||
CHECK-NEXT: InsertValue 64 | ||
CHECK-NEXT: LandingPad 65 | ||
CHECK-NEXT: Freeze 66 | ||
CHECK-NEXT: FloatTy 67 | ||
CHECK-NEXT: FloatTy 68 | ||
CHECK-NEXT: FloatTy 69 | ||
CHECK-NEXT: FloatTy 70 | ||
CHECK-NEXT: FloatTy 71 | ||
CHECK-NEXT: FloatTy 72 | ||
CHECK-NEXT: FloatTy 73 | ||
CHECK-NEXT: VoidTy 74 | ||
CHECK-NEXT: LabelTy 75 | ||
CHECK-NEXT: MetadataTy 76 | ||
CHECK-NEXT: UnknownTy 77 | ||
CHECK-NEXT: TokenTy 78 | ||
CHECK-NEXT: IntegerTy 79 | ||
CHECK-NEXT: FunctionTy 80 | ||
CHECK-NEXT: PointerTy 81 | ||
CHECK-NEXT: StructTy 82 | ||
CHECK-NEXT: ArrayTy 83 | ||
CHECK-NEXT: VectorTy 84 | ||
CHECK-NEXT: VectorTy 85 | ||
CHECK-NEXT: PointerTy 86 | ||
CHECK-NEXT: UnknownTy 87 | ||
CHECK-NEXT: Function 88 | ||
CHECK-NEXT: Pointer 89 | ||
CHECK-NEXT: Constant 90 | ||
CHECK-NEXT: Variable 91 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to explain what
train2id
format is?