Skip to content

[IR2Vec] Add triplet generation utility script for vocabulary training #149215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

svkeerthy
Copy link
Contributor

@svkeerthy svkeerthy commented Jul 16, 2025

Added a Python utility script for generating IR2Vec triplets and updated documentation to reference it.

The script generates triplets in a form suitable for training the vocabulary.

(Tracking issues - #141817, #141834; closes - #141834)

Copy link
Contributor Author

svkeerthy commented Jul 16, 2025

Copy link

github-actions bot commented Jul 16, 2025

✅ With the latest revision this PR passed the Python code formatter.

@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch from 42f9479 to 528ac7b Compare July 16, 2025 23:32
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch from 702adeb to faa92fe Compare July 16, 2025 23:32
@svkeerthy svkeerthy changed the title triplet-ext-script [NFC][IR2Vec] Add reference to generateTriplets.py in documentation Jul 16, 2025
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch from faa92fe to 5a8f74a Compare July 16, 2025 23:46
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch from 528ac7b to 09d483a Compare July 16, 2025 23:46
@svkeerthy svkeerthy changed the title [NFC][IR2Vec] Add reference to generateTriplets.py in documentation [IR2Vec] Add triplet generation utility script for vocabulary training Jul 16, 2025
@svkeerthy svkeerthy marked this pull request as ready for review July 16, 2025 23:50
@llvmbot
Copy link
Member

llvmbot commented Jul 16, 2025

@llvm/pr-subscribers-mlgo

@llvm/pr-subscribers-llvm-binary-utilities

Author: S. VenkataKeerthy (svkeerthy)

Changes

Added a Python utility script for generating IR2Vec triplets and updated documentation to reference it.

The script generates triplets in a form suitable for training the vocabulary.

(Tracking issue - #141817)


Full diff: https://github.com/llvm/llvm-project/pull/149215.diff

2 Files Affected:

  • (modified) llvm/docs/CommandGuide/llvm-ir2vec.rst (+3)
  • (added) llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py (+291)
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
index 56ece4f509f6e..e39a663e3be5a 100644
--- a/llvm/docs/CommandGuide/llvm-ir2vec.rst
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -50,6 +50,9 @@ embedding training (see
 <https://github.com/thunlp/OpenKE/tree/OpenKE-PyTorch?tab=readme-ov-file#data-format> 
 for details).
 
+See `llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py` for more details on how
+these two modes are used to generate the triplets and entity mappings.
+
 Triplet Generation Mode
 ~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py b/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py
new file mode 100644
index 0000000000000..0858d10ce0138
--- /dev/null
+++ b/llvm/utils/mlgo-utils/IR2Vec/generateTriplets.py
@@ -0,0 +1,291 @@
+# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+# See https://llvm.org/LICENSE.txt for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+"""IR2Vec Triplet Generator
+
+Generates IR2Vec triplets by applying random optimization levels to LLVM IR files
+and extracting triplets using llvm-ir2vec. Automatically generates preprocessed
+files: entity2id.txt, relation2id.txt, and train2id.txt.
+
+Usage:
+    python generateTriplets.py <llvm_build_dir> <num_optimizations> <ll_file_list> <output_dir>
+"""
+
+import argparse
+import logging
+import os
+import random
+import subprocess
+import sys
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from typing import List, Set, Tuple
+
+# Configuration
+OPT_LEVELS = ["O0", "O1", "O2", "O3", "Os", "Oz"]
+DEFAULT_MAX_WORKERS = 100
+
+logger = logging.getLogger(__name__)
+
+
+class TripletResult:
+    """Result from processing a single LLVM IR file"""
+
+    __slots__ = ["triplets", "max_relation"]
+
+    def __init__(self, triplets: Set[str], max_relation: int):
+        self.triplets = triplets
+        self.max_relation = max_relation
+
+
+class IR2VecTripletGenerator:
+    """Main class for generating IR2Vec triplets"""
+
+    def __init__(
+        self,
+        llvm_build_dir: Path,
+        num_optimizations: int,
+        output_dir: Path,
+        max_workers: int = DEFAULT_MAX_WORKERS,
+    ):
+        self.llvm_build_dir = llvm_build_dir
+        self.num_optimizations = num_optimizations
+        self.output_dir = output_dir
+        self.max_workers = max_workers
+
+        # Tool paths
+        self.opt_binary = os.path.join(llvm_build_dir, "bin", "opt")
+        self.ir2vec_binary = os.path.join(llvm_build_dir, "bin", "llvm-ir2vec")
+
+        self._validate_setup()
+
+    def _validate_setup(self):
+        """Validate that all required tools and paths exist"""
+        if not self.llvm_build_dir.exists():
+            raise FileNotFoundError(
+                f"LLVM build directory not found: {self.llvm_build_dir}"
+            )
+
+        if not os.path.isfile(self.opt_binary) or not os.access(
+            self.opt_binary, os.X_OK
+        ):
+            raise FileNotFoundError(
+                f"opt binary not found or not executable: {self.opt_binary}"
+            )
+
+        if not os.path.isfile(self.ir2vec_binary) or not os.access(
+            self.ir2vec_binary, os.X_OK
+        ):
+            raise FileNotFoundError(
+                f"llvm-ir2vec binary not found or not executable: {self.ir2vec_binary}"
+            )
+
+        if not (1 <= self.num_optimizations <= len(OPT_LEVELS)):
+            raise ValueError(
+                f"Number of optimizations must be between 1-{len(OPT_LEVELS)}"
+            )
+
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+
+    def _select_optimization_levels(self) -> List[str]:
+        """Select unique random optimization levels"""
+        return random.sample(OPT_LEVELS, self.num_optimizations)
+
+    def _process_single_file(self, input_file: Path) -> TripletResult:
+        """Process a single LLVM IR file with multiple optimization levels"""
+        all_triplets = set()
+        max_relation = 1
+        opt_levels = self._select_optimization_levels()
+
+        for opt_level in opt_levels:
+            try:
+                triplets, file_max_relation = self._run_pipeline(input_file, opt_level)
+                if triplets:
+                    all_triplets.update(triplets)
+                    max_relation = max(max_relation, file_max_relation)
+                    logger.debug(
+                        f"Generated {len(triplets)} triplets for {input_file} with {opt_level}"
+                    )
+            except Exception as e:
+                logger.warning(f"Error processing {input_file} with {opt_level}: {e}")
+
+        return TripletResult(all_triplets, max_relation)
+
+    def _run_pipeline(self, input_file: Path, opt_level: str) -> Tuple[Set[str], int]:
+        """Run opt | llvm-ir2vec pipeline elegantly."""
+        pipeline_cmd = (
+            f'"{self.opt_binary}" -{opt_level} "{input_file}" -o - | '
+            f'"{self.ir2vec_binary}" --mode=triplets - -o -'
+        )
+
+        try:
+            result = subprocess.run(
+                pipeline_cmd, shell=True, capture_output=True, text=True, check=True
+            )
+            return self._parse_triplet_output(result.stdout)
+        except subprocess.CalledProcessError:
+            return set(), 1
+
+    def _parse_triplet_output(self, output: str) -> Tuple[Set[str], int]:
+        """Parse triplet output and extract max relation"""
+        if not output.strip():
+            return set(), 1
+
+        lines = output.strip().split("\n")
+        max_relation = 1
+
+        # Extract max relation from metadata line
+        if lines and lines[0].startswith("MAX_RELATION="):
+            max_relation = int(lines[0].split("=")[1])
+            lines = lines[1:]
+
+        # Remove duplicate triplets by converting to a set
+        return set(lines), max_relation
+
+    def generate_triplets(self, file_list: Path) -> None:
+        """Main method to generate triplets from a list of LLVM IR files"""
+        input_files = self._read_file_list(file_list)
+        logger.info(
+            f"Processing {len(input_files)} files with {self.num_optimizations} "
+            f"optimization levels using {self.max_workers} workers"
+        )
+
+        all_triplets = set()
+        global_max_relation = 1
+
+        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
+            future_to_file = {
+                executor.submit(self._process_single_file, file): file
+                for file in input_files
+            }
+
+            for future in as_completed(future_to_file):
+                try:
+                    result = future.result()
+                    all_triplets.update(result.triplets)
+                    global_max_relation = max(global_max_relation, result.max_relation)
+                except Exception as e:
+                    file_path = future_to_file[future]
+                    logger.error(f"Error processing {file_path}: {e}")
+
+        self._generate_output_files(all_triplets, global_max_relation)
+        logger.info("Processing completed successfully")
+
+    def _read_file_list(self, file_list: Path) -> List[Path]:
+        """Read and validate the list of input files"""
+        input_files = []
+        with open(file_list, "r") as f:
+            for line_num, line in enumerate(f, 1):
+                if line := line.strip():
+                    file_path = Path(line)
+                    if file_path.exists():
+                        input_files.append(file_path)
+                    else:
+                        logger.warning(f"File not found (line {line_num}): {file_path}")
+
+        if not input_files:
+            raise ValueError("No valid input files found")
+        return input_files
+
+    def _generate_output_files(self, all_triplets: Set[str], max_relation: int) -> None:
+        """Generate the final output files"""
+        logger.info(f"Generating output files with {len(all_triplets)} unique triplets")
+
+        # Write all output files -- train2id.txt, entity2id.txt, relation2id.txt
+        train2id_file = os.path.join(self.output_dir, "train2id.txt")
+        entity2id_file = os.path.join(self.output_dir, "entity2id.txt")
+        relation2id_file = os.path.join(self.output_dir, "relation2id.txt")
+
+        with open(train2id_file, "w") as f:
+            f.write(f"{len(all_triplets)}\n")
+            f.writelines(f"{triplet}\n" for triplet in all_triplets)
+
+        self._generate_entity2id(entity2id_file)
+        self._generate_relation2id(relation2id_file, max_relation)
+
+    def _generate_entity2id(self, output_file: Path) -> None:
+        """Generate entity2id.txt using llvm-ir2vec"""
+        subprocess.run(
+            [str(self.ir2vec_binary), "--mode=entities", "-o", str(output_file)],
+            check=True,
+            capture_output=True,
+        )
+
+    def _generate_relation2id(self, output_file: Path, max_relation: int) -> None:
+        """Generate relation2id.txt from max relation"""
+        max_relation = max(max_relation, 1)  # At least Type and Next relations
+        num_relations = max_relation + 1
+
+        with open(output_file, "w") as f:
+            f.write(f"{num_relations}\n")
+            f.write("Type\t0\n")
+            f.write("Next\t1\n")
+            f.writelines(f"Arg{i-2}\t{i}\n" for i in range(2, num_relations))
+
+
+def main():
+    """Main entry point"""
+    parser = argparse.ArgumentParser(
+        description="Generate IR2Vec triplets from LLVM IR files",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+
+    parser.add_argument(
+        "llvm_build_dir", type=Path, help="Path to LLVM build directory"
+    )
+    parser.add_argument(
+        "num_optimizations",
+        type=int,
+        help="Number of optimization levels to apply (1-6)",
+    )
+    parser.add_argument(
+        "ll_file_list",
+        type=Path,
+        help="File containing list of LLVM IR files to process",
+    )
+    parser.add_argument(
+        "output_dir", type=Path, help="Output directory for generated files"
+    )
+    parser.add_argument(
+        "-j",
+        "--max-workers",
+        type=int,
+        default=DEFAULT_MAX_WORKERS,
+        help=f"Maximum number of parallel workers (default: {DEFAULT_MAX_WORKERS})",
+    )
+    parser.add_argument(
+        "-v", "--verbose", action="store_true", help="Enable debug logging"
+    )
+    parser.add_argument(
+        "-q", "--quiet", action="store_true", help="Suppress all output except errors"
+    )
+
+    args = parser.parse_args()
+
+    # Configure logging
+    level = (
+        logging.ERROR
+        if args.quiet
+        else (logging.DEBUG if args.verbose else logging.INFO)
+    )
+    logging.basicConfig(
+        level=level,
+        format="[%(asctime)s] %(levelname)s: %(message)s",
+        datefmt="%H:%M:%S",
+    )
+
+    try:
+        generator = IR2VecTripletGenerator(
+            args.llvm_build_dir,
+            args.num_optimizations,
+            args.output_dir,
+            args.max_workers,
+        )
+        generator.generate_triplets(args.ll_file_list)
+    except Exception as e:
+        logger.error(f"Error: {e}")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()

@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch from 5a8f74a to 0007c06 Compare July 17, 2025 18:04
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch 2 times, most recently from 3f8c21f to dff3bdb Compare July 17, 2025 19:12
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch 2 times, most recently from b2e9297 to e088eb8 Compare July 17, 2025 19:55
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch from dff3bdb to ea01937 Compare July 17, 2025 19:55
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch from e088eb8 to d19f53d Compare July 17, 2025 19:58
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch 2 times, most recently from b209998 to 2f6e0b4 Compare July 17, 2025 20:45
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch from d19f53d to c412299 Compare July 17, 2025 20:46
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch from 2f6e0b4 to a1e45d7 Compare July 28, 2025 23:22
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch 2 times, most recently from 51293a8 to 8725ed9 Compare July 29, 2025 17:18
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-revamp-triplet-gen branch 8 times, most recently from 36543d6 to e609434 Compare July 29, 2025 18:27
svkeerthy added a commit that referenced this pull request Jul 29, 2025
…g mode (#149214)

Add entity mapping mode to llvm-ir2vec and improve triplet generation format for knowledge graph embedding training.

This change streamlines the workflow for training the vocabulary embeddings with IR2Vec by:
1. Directly generating numeric IDs instead of requiring string-to-ID preprocessing
2. Providing entity mappings in standard knowledge graph embedding format
3. Structuring triplet output in train2id format compatible with knowledge graph embedding frameworks
4. Adding metadata headers to simplify post-processing and training setup

These improvements make IR2Vec more compatible with standard knowledge graph embedding training pipelines and reduce the preprocessing steps needed before training.

See #149215 for more details on how it is used.

(Tracking issues - #141817, #141834)
Base automatically changed from users/svkeerthy/07-16-revamp-triplet-gen to main July 29, 2025 18:56
@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch 2 times, most recently from 1b49ce3 to 9611b33 Compare July 29, 2025 23:09
@svkeerthy svkeerthy requested a review from boomanaiden154 July 29, 2025 23:11
Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, otherwise LGTM.

class TripletResult:
"""Result from processing a single LLVM IR file"""

__slots__ = ["triplets", "max_relation"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the performance benefits of this tangible?

This would be better suited as a dataclass.

Maybe add a TODO to turn this into a dataclass with slots=True when LLVM moves to Python 3.10.

@svkeerthy svkeerthy force-pushed the users/svkeerthy/07-16-triplet-ext-script branch from 9611b33 to 0807195 Compare July 29, 2025 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants