Skip to content

Add git-based build information for better issue tracking #1232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 1, 2023

Conversation

DannyDaemonic
Copy link
Contributor

@DannyDaemonic DannyDaemonic commented Apr 29, 2023

Description:

This pull request adds a build-info.h header file, which will contain git-related build information such as branch and commit count. This will help users and maintainers to identify the build being used when reporting issues or bugs. The build-info.h file is generated using CMake and Makefile whenever .git/index changes, and is printed to stderr in main.cpp.

Changes:

  1. Modified .gitignore to include build-info.h.
  2. Added build information generation to CMakeLists.txt (and updated main's CMakeLists.txt to add a dependency).
  3. Added build information generation to Makefile.
  4. Modified examples/main/main.cpp to print build information at runtime.

This will allow us to track the build version in a more granular way and may be helpful in identifying and addressing issues or bugs until an official versioning system is introduced.

Edit: In the end this also ended being added to all examples.

@slaren
Copy link
Member

slaren commented Apr 29, 2023

I think this is a good idea. One question, currently it looks like this outputs the number of commits in HEAD, but how do we find what version they are using from this information? I think I would prefer a commit hash id, such as obtained by git rev-parse --short HEAD.

@DannyDaemonic
Copy link
Contributor Author

DannyDaemonic commented Apr 29, 2023

@slaren

but how do we find what version they are using from this information?

To go the other direction:

git rev-list HEAD --skip=$(( $(git rev-list HEAD --count) - BUILD_NUMBER)) --max-count=1

So if you were looking for 461:

$ git rev-list HEAD --skip=$(( $(git rev-list HEAD --count) - 461)) --max-count=1
305eb5afd51325e3142c01c17431febb7c67de87

It's not as straight forward I suppose. I just thought a build number would be something easier for people to understand and without looking up the hash, judge how old the build is, and at the same time allow people (or documentation) to say "you need at least build number 461 to do X."

If you want the direct hash, how would you feel about a combination? Something like:

build = 461 (305eb5afd51325e3142c01c17431febb7c67de87)
seed  = 189273487

@j-f1
Copy link
Collaborator

j-f1 commented Apr 29, 2023

How about pulling out the commit date?

build = 2022-04-29 (305eb5afd51325e3142c01c17431febb7c67de87)

@Green-Sky
Copy link
Collaborator

Green-Sky commented Apr 29, 2023

Did you test how this behaves when it is a cmake subdirectory?

edit: also git submodules. which have a .git file that points to the root.

@Green-Sky Green-Sky added enhancement New feature or request build Compilation issues labels Apr 29, 2023
@DannyDaemonic
Copy link
Contributor Author

DannyDaemonic commented Apr 29, 2023

@j-f1:

The date might be better for a quick judge of how old something is. I thought a straight number component would be more useful in the long run because it could be used more easily programmatically.

For example, if we wanted to tag all the new quantizations with which build number they were quantized in and there were some quantization bugs fixed in commit number 670, you just have to test if the build number in the quantized file was < 760 and then warn the user they should requantize for best performance. The hash alone wouldn't be good enough and a date would mostly work, but it's not exact (and time zones?) and no one wants to parse date strings.

@Green-Sky:

I tried it with main (which is in another subdirectory, if that's what you mean?) and it worked fine. The BUILD_INFO project it makes is itself relative to the source root so even when I write CMAKE_CURRENT_SOURCE_DIR it's still using that path and should use only the .git at that that particular location but there might be some edge cases I'm not thinking of? I can replace CMAKE_CURRENT_SOURCE_DIR with CMAKE_SOURCE_DIR just to cover my bases.

It's still all configured on my dev machine if there's a particular cmake command you want me to try out.

@Green-Sky
Copy link
Collaborator

I tried it with main (which is in another subdirectory, if that's what you mean?)

hm no.

The BUILD_INFO project it makes is itself relative to the source root so even when I write CMAKE_CURRENT_SOURCE_DIR it's still using that path and should use only the .git at that that particular location but there might be some edge cases I'm not thinking of?

I was thinking of including llama.cpp as a git submodule or git subtree and including the cmake via eg add_subirectory(external/llama.cpp) .

CMAKE_CURRENT_SOURCE_DIR

In the llama.cpp's project root directory is the way to go, I think.

@DannyDaemonic
Copy link
Contributor Author

DannyDaemonic commented Apr 29, 2023

I committed a change so it shows the hash and the build number. I think this is better after all.

main: build = 464 (5e7d90a838c2543687fcff2e603b1189b6bc1084)
main: seed  = 1682785118

Previously, this would have shown my branch instead (conditionally, because it wasn't master), but now it just always shows the hash, which I do admit is easier. The example even points to this exact commit: 5e7d90a838c2543687fcff2e603b1189b6bc1084

And while I don't think it was necessary (because BUILD_INFO is treated as its own project in cmake), I didn't see any harm in replacing all the CMAKE_CURRENT_SOURCE_DIR with CMAKE_SOURCE_DIR just to be extra careful. (Spoiler, I undid this below.)

@DannyDaemonic
Copy link
Contributor Author

@Green-Sky Oh I see what you're saying. If someone else is using this project in their own project, cmake should be smart about that since it would still be a sub directory, but I haven't tested it and now I think CMAKE_CURRENT_SOURCE_DIR was actually the right thing to do. I need to go back undo that.

@DannyDaemonic
Copy link
Contributor Author

DannyDaemonic commented Apr 29, 2023

Alright, I put CMAKE_CURRENT_SOURCE_DIR back and checked for any other spots it may have caused issues when used with add_subdirectory(). I also made the building comments a little more descript.

Copy link
Contributor

@sw sw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the working directory is not a git repo (for example when downloading a ZIP from Github), the Cmake build fails:

...
-- Configuring done (0.4s)
CMake Error at examples/main/CMakeLists.txt:5 (add_dependencies):
  The dependency target "BUILD_INFO" of target "main" does not exist.

Can we handle this gracefully?

@DannyDaemonic
Copy link
Contributor Author

DannyDaemonic commented Apr 29, 2023

@sw Yeah, it's embarrassingly easy to fix. Everything else is all set up to generate dummy values. I just overlooked that dependency line because I originally wasn't going to add it (since BUILD_INFO is it's own thing and part of ALL) but added it the last minute in case someone builds only main (which could cause other issues anyway), and it missed the .git testing. Give me a second.

Edit: git will still spit out an error claiming "not a git repository", but only when configuring, not during builds. I can also hide that but I think maybe it's good people see that so if they look at their logs they will understand why their build info is "unknown"?

@DannyDaemonic DannyDaemonic requested a review from sw April 29, 2023 19:24
@sw
Copy link
Contributor

sw commented Apr 29, 2023

Thanks @DannyDaemonic, I've noticed another problem, though. CMake doesn't find the git executable:

-- Could NOT find Git (missing: GIT_EXECUTABLE)

This is on Ubuntu 22.04 with cmake 3.22.1 and git 2.34.1

Also seems to happen in the CI build: https://github.com/ggerganov/llama.cpp/actions/runs/4840382315/jobs/8625989141?pr=1232#step:4:22

I think you have to use find_package(Git) in the top-level CMakeLists.txt as well, not just in the generated BUILD_INFO.cmake.

@sw
Copy link
Contributor

sw commented Apr 29, 2023

Couldn't you add BUILD_INFO.cmake as a regular file in scripts/, instead of writing it to the build directory?

I think the goal of this PR is good, the implementation just seems very complicated. But I'm no Cmake expert.

@DannyDaemonic DannyDaemonic force-pushed the build-info branch 2 times, most recently from 40aec68 to 8977b8a Compare April 29, 2023 22:40
@DannyDaemonic
Copy link
Contributor Author

DannyDaemonic commented Apr 29, 2023

@sw

I think you have to use find_package(Git) in the top-level CMakeLists.txt as well, not just in the generated BUILD_INFO.cmake.

Hmm, I don't know why that would make a difference when BUILD_INFO has its own top-level project (at build/BUILD_INFO.vcxproj. Unless you mean it didn't work when called during the config but then it worked after? Either way, I added a fallback on which, that always works (outside of Windows anyway). I tested it and it was able to find git even in a subshell. I also double checked and it was working in Windows without the which work around. 🤷

As for making a separate cmake file, all of the other .cmake files are generated by the CMakeLists.txt files, so it seemed natural for it to create BUILD_INFO.cmake as well. It was also easier to work with the code in one place, but it has grown quite a bit since so I separated it out and left just the project stuff behind in CMakeLists.txt.

The other downside to separating it into scripts was it messed with the path stuff Green-Sky and I were talking about earlier. Even thought I was using CMAKE_CURRENT_SOURCE_DIR it was using the build directory and apparently you can't override the source directories with -D. I was able to find a way to do it with nmake 3.12, which luckily, we already require, it but it took longer than I'd like to admit to figure out what was wrong...

@slaren

Alright. I've put it in all the examples. I tried to place it after parameter checks that could cause an exit, but before the seed is written out so that they'd be printed together. This also means I had to update the makefile to depend on the .h file for generation. The only thing is this made the Makefile a bit messy. We could probably clean it up with a pattern. But I'm not sure that's desired. I can always try to clean it up in a separate PR to see if that's prefered.

@slaren
Copy link
Member

slaren commented Apr 29, 2023

This is very minor, but I wonder if the filter-out in the Makefile could be replaced by a more generic $(filter-out %.h,$^).

@DannyDaemonic
Copy link
Contributor Author

DannyDaemonic commented Apr 30, 2023

@slaren Oh, Yeah, I could do that much without reshaping the makefile. Previously, I was thinking of something like this:

# Define the targets
TARGETS := main quantize quantize-stats perplexity embedding

# Define dependencies for each target
main_DEPS := ggml.o llama.o common.o $(OBJS)
quantize_DEPS := ggml.o llama.o $(OBJS)
quantize-stats_DEPS := ggml.o llama.o $(OBJS)
perplexity_DEPS := ggml.o llama.o common.o $(OBJS)
embedding_DEPS := ggml.o llama.o common.o $(OBJS)

# Pattern rule for compiling and linking
%.out: %.cpp $(%_DEPS) build-info.h
	$(CXX) $(CXXFLAGS) $(filter-out *.h,$^) -o $@ $(LDFLAGS)

But I don't know if that's too far removed from a normal simple makefile.

@slaren
Copy link
Member

slaren commented Apr 30, 2023

Yeah that probably would be better in the long run. I also noticed that the LLAMA_GPROF and LLAMA_PERF flags don't work with cuBLAS. There is some refactoring to do there, but that would be a different PR.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use 4 spaces indentation in the CMake

@DannyDaemonic
Copy link
Contributor Author

@ggerganov: Oh, good catch.

@slaren I tried to make that particular pattern/rule work but it was really not having it. The only way to join the target with a variable is to use an $eval.

It still looks great to use:

TARGETS_CPP += quantize
DEPS_quantize := examples/quantize/quantize.cpp ggml.o llama.o

TARGETS_CPP += quantize-stats
DEPS_quantize-stats := examples/quantize-stats/quantize-stats.cpp ggml.o llama.o

TARGETS_CPP += perplexity
DEPS_perplexity := examples/perplexity/perplexity.cpp ggml.o llama.o common.o

TARGETS_CPP += embedding
DEPS_embedding := examples/embedding/embedding.cpp ggml.o llama.o common.o

But then you have something like this at the end that includes all the extras (OBJS, build-info, etc):

define template_cpp
OUTP_$(1) ?= $(1)
$(1): $$(DEPS_$(1)) $$(OBJS) build-info.h
	$$(CXX) $$(CXXFLAGS) $$(filter-out build-info.h,$$^) -o $$(OUTP_$(1)) $$(LDFLAGS)
	$$(if $$(value EXEC_$(1)),$$(EXEC_$(1)))
endef

$(foreach target,$(TARGETS_CPP),$(eval $(call template_cpp,$(target))))

I commented it nicely but I don't know if that's still preferred.

Thoughts?

Makefile Outdated
Comment on lines 244 to 249
define template_cpp
OUTP_$(1) ?= $(1)
$(1): $$(DEPS_$(1)) $$(OBJS) build-info.h
$$(CXX) $$(CXXFLAGS) $$(filter-out build-info.h,$$^) -o $$(OUTP_$(1)) $$(LDFLAGS)
$$(if $$(value EXEC_$(1)),$$(EXEC_$(1)))
endef
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too "smart", in my humble opinion.

The Makefile could be much simpler, actually.

VPATH= examples examples/main
%.o: %.c
	$(CC)  $(CPPFLAGS) $(CFLAGS)   -c $< -o $@
%.o: %.cpp
	$(CXX) $(CPPFLAGS) $(CXXFLAGS) -c $< -o $@
%: %.o
	$(CXX) $(LDFLAGS) -o $@ $^ $(LDLIBS)
ggml.o: ggml.h ggml-cuda.h
llama.o: ggml.h ggml-cuda.h llama.h llama_util.h
common.o: common.h
main: ggml.o llama.o common.o $(OBJS)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Makefile was already pretty busy when I started trying to edit build-info.h in. Each target had its own rule. Since the build info has to be generated from git, the example projects have to depend on it. But putting it in the prerequisites was somehow causing the compilers on macOS to try compiling it and failing, so you then have to exclude it from the compiler with $(filter-out build-info.h, $^). I couldn't get it to pass the build tests any other way, so I had to edit that into every example target.

Originally, I had thought I could use a static pattern rule for this (even mentioned as much earlier) and hoped to wrap up all the example target rules into one pattern rule so everyone wouldn't have to see a dozen lines of $(filter-out build-info.h, $^).

My goal was to put keep all the weird stuff in one place. This way, no has to understand, or copy and paste without understanding, code to change the makefile when a new project is added. There were so many little things different. It seemed each target had its own exception. (main wants to echo text when it's done, benchmark's source is actually benchmark-q4_0-matmult.c not benchmark.c.

Maybe if I had started with a Makefile from scratch I could have come up with a better solution, but I just wanted to get the example projects in line with the change. I do think a pattern rule is ideal, but I just couldn't get one to work in practice.

I hated the foreach eval when I first thought about it, but after stumbling over patterns for way too long, it seemed like the only solution where each example, each program, each test, could just be written the same way in the Makefile. My argument for this approach is it's easy to add future examples and if you need to change the way we're building those examples you just need to edit one line (the template), which also means escaping your $ with another $.

What I can do is revert it to how it was before and just make the bare minimum changes to each target.. After spending a little too much time fighting with patterns, I kind of wish I had just done to begin with, but since I arrived at a destination that wasn't too bad I decided to push it and see what people's thoughts were. Part of the reason I pointed out the "smart" code in a comment after my commit was so that I could try to judge people's comfort level with it. I just want to make sure the code is easy to follow and that people are comfortable updating it if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's not too crazy if I look at the whole file.

The Makefile was already pretty busy when I started trying to edit build-info.h in. Each target had its own rule. Since the build info has to be generated from git, the example projects have to depend on it. But putting it in the prerequisites was somehow causing the compilers on macOS to try compiling it and failing

Wouldn't it be better to generate build-info.c that has global variables and have everyone link to build-info.o? It also means that it wouldn't be necessary to rebuild everything on every git pull if some files have not changed.

main wants to echo text when it's done

Double colon rule could work?

benchmark's source is actually benchmark-q4_0-matmult.c not benchmark.c

I would say it's two targets, a PHONY benchmark that depends on benchmark-q4_0-matmult and runs it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the general consensus has been that Makefile needs some refactoring, but this PR was never intended to do that. I've reset it to a more minimally changed version and just tried to make it less ugly with *.h instead of build-info.h, which actually covers future cases of header only includes. I still have my new Makefile stashed away; maybe I'll open a different pull request with a draft for cleaning up Makefile to see if there's any interest there, but I'm less comfortable taking that role as I'm not the person who edits new stuff in regularly.

Building a build-info.o would add a little more complexity to the process, but I do agree it would be most effective at preventing unnecessary recompilation. I imagine when there's an official version system, that's how it will be done (but included in ggml.o, llama.o, or both). I still like the minimalism of the header to get build info into the examples. And while it can cause unnecessary rebuilds, this is limited to the examples and doesn't include the larger files like ggml.c and llama.c and the other .o (and .so) files.

@DannyDaemonic DannyDaemonic force-pushed the build-info branch 5 times, most recently from 8653aca to 8f8715c Compare May 1, 2023 02:09
@DannyDaemonic
Copy link
Contributor Author

DannyDaemonic commented May 1, 2023

Rebased with less "fancy" Makefile.

Used shorter hash as originally recommended by @slaren. If this is no longer desired, let me know and I'll put it back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants