-
Notifications
You must be signed in to change notification settings - Fork 102
Enable external storage shim (extstore). #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Looks like we're running into a combination of memcached/memcached#319 and later memcached/memcached#374. |
For reference, we support all of amd64, arm32v5, arm32v6, arm32v7, arm64v8, i386, ppc64le, and s390x, so I'd rather wait to merge this until the functionality is working on all supported platforms. |
Any chance you could test the 'next' branch? Or does it have to wait for a released version? Got fixes for all of my buildbots so far, but I think you have a few extra platforms. |
For the official builds it ought to be a release, but I'm happy to do some
tests on all our arches tomorrow from the next branch to help test! 🎊
|
Would be really useful, since I'd have to make another release if it didn't work :P |
Found some time tonight, so I'm running lots of |
key:
https://gist.github.com/tianon/d3aa50500e22cf8686a08ae4fa2cbae1 I'm confused about the also:
There's one extstore-specific test that when it fails, it appears to spew some absolutely massive test data -- not sure how I'll share this massive log ATM or what to trim to get it down to a useful size without trimming useful data, and I think the offending test is Now, if your response to this is that we need to trim the list of architectures we support, that's fair too (we're not opposed to doing that), but the current image (which includes (Also, in the future, feel free to ping me directly on anything you'd like either multiarch or Docker tested before release -- always happy to help out where we can!! ❤️ ❤️) |
whaaat arm64v8 is that which doesn't have the crc32c instruction? Is there an march or cross compile issue? the v5 failure doesn't seem to have anything to do with extstore but the v6 one looks crazy weird. I get it though; the patch from qualcomm wasn't good enough for this. If the CPU instruction isn't known to the compiler it won't build, which is usally dealt with by replacing the symbol with the opcode. I can also add configure time tests which is a pain... but that arm64 really should have that instruction, my rpi3 does. Running tests on rpi2 now. I bet chunked on the arm32's is something else... but repro'ing this might be a pain if my rpi2 won't do it (it's running tests now.. was too lazy to plug it in). Is it possible to mask extstore for arm/390 but leave other options for the primary builds? Extstore doesn't make a ton of sense on small platforms, but it would on proper aarch64 server boards with lots of disk... those should build just fine with the crc instruction. On primary platforms extstore can benefit people now, it'd suck to hold back it's progress (but also to lose any small platforms for non-extstore purposes). |
The arm64 box I was testing on is a HiSilicon Hi1616 chip (64 cores) graciously provided to us by the WorksOnArm project, so definitely a server chip, and we're compiling directly on that chip natively (no cross-compile, emulation, or anything like that), and thus even more odd that it didn't work. I've also got access to an 8-core AppliedMicro XGene1, but it fails with a similar error:
So maybe this instruction isn't well-supported across the breadth of arm64 chips? I'm happy to re-test Regarding the IBM s390x mainframe stuff, I'll give a poke to some contacts I've got on that team and see if they've got any ideas (but for now, just planning to exclude extstore on s390x and arm32 is totally sane). |
what platform is the arm32v5? I haven't looked at the s390x failures, but I have no way of developing on one and am not sure of any users, certainly not extstore. I'd prefer to put my efforts to making sure arm works better. If it runs fine without extstore, please stick with that :) All I have is a couple of RPI's though... one of the tests is flaky under 32bit mode, but they all do pass. I have an rpi2 in pure 32bit, and a rpi3 with a (painfully built) 64bit kernel running aarch64... also the patch I got which does this came from qualcomm, but unsure what they tested on exactly. I'll do some blind googling and see if I can figure this out. (also, thanks so much for testing! It's really hard to get access to these exotic platforms...) |
How much RAM do the armv5 and v6 platforms have? Looks like some of my tests are flaky under 32bit mode, as it's trying to fill disk and expecting the page layout to be in specific forms. Staring hard at the failure for some of these that might just be the case. armv5 failing normal chunked-items.t (not even extstore) might also be similar. unfortunately the test platform makes it hard to tell if the daemon just died during the test, or if it gave invalid output re: v8, I'll have to think about how to determine via configure if armv8-a+crc should be forced :/ might be as simple as "if I force this march and it works", but I have a deep fear of that accidentally compiling something that won't run if the target arch is actually v7 or something. |
Alright, looks like I goofed my cross compile on the rpi3: it wasn't actually building with aarch64. You probably do need the -march to get the silly thing to compile. For now, I've added a "--enable-arm-crc32" which gates the instructions for those who're familiar enough to build it properly. This is pushed to 'next' and you should try again with --enable-extstore but without --enable-arm-crc32. The other platforms I can look at once you let me know how much RAM they have. I've been testing builds with 512mb. On the other hand, armv5 might be too old for me to want to really pursue. |
We build them on that same HiSilicon chip (which has ~125Gb of RAM 😅), although I've got a real one here at home (the "Pogoplug") that has 256mb of RAM. Definitely 100% understand if you want to avoid wasting more cycles on this arch (it's very much on the way out -- generally has no hardware floating point chip, so it's about as speedy as you'd imagine).
Yeah, I put out a feeler to IBM, but haven't seen a response so for now we'll just go extstore-less there. 👍
Ouch. I use https://wiki.debian.org/RaspberryPi3 on mine, which has worked pretty well (arm64 Debian).
Amen -- so, so true. 😞
Ours is a special case (we build on the HiSilicon), but AFAIK, in the real-world v6 is really only the RPi 1 and RPi Zero -- even the RPi 2 is a v7 chip (and the 3 is a 64bit v8, as you know).
HiSilicon machine: $ grep crc /proc/cpuinfo | sort -u
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
$ XGene1: $ grep crc /proc/cpuinfo | sort -u
$ grep '^Features' /proc/cpuinfo | sort -u
Features : fp asimd evtstrm
$ 😕 😞 Is there any way to do runtime detection on arm64 instead? (I mean, you could scrape
That'd be https://pkgs.alpinelinux.org/package/v3.7/main/aarch64/gcc and https://packages.debian.org/stretch/gcc, so 6.4.0 on Alpine and 6.3.0 on Debian.
Sure thing, firing off some fresh builds! |
Ok, Is it still useful for me to give you the arm32 failing logs, or are you good? The |
RE: Runtime detection, that's exactly what the code is doing, but if you have a compiler error you can't fix that at runtime :P I'll need to revisit by either stubbing the opcodes like I mentioned earlier, or some other trick. Most likely there'll just be a configure option to force the march build. :/ Is If 1.5.7 passes, does fac34333294ae6a05e822498e3295ec45ce16de4 fail? (or just bisect it, I have no idea which commit would've caused chunked items to fail in that way) |
Ah, I've been building on a different host than the real builds -- they're currently all using a QEMU VM on my local machine instead. Trying again now on the proper hardware... (We've got a number of arm32 builds that fail on our arm64 hardware, even though it supports almost all arm32 instructions.) |
Ok, building on the proper hardware, memcached/memcached@cf3b853 works fine with |
that's nuts! was the QEMU VM set to low RAM or something? I guess it could also be a timing issue, since I still have some of those in the test suite. |
No, I think it must be a timing issue -- the failures were from the beefy HiSilicon box (the one with over 100GB of RAM and 64 cores). 😅 The QEMU emulated machine is a more modest 1GB of RAM with only 4 cores. |
ah confused; thought you said the failures were from QEMU |
Sorry for the confusion! To hopefully clear things up: most of our
|
Got it.. well, let me know which ones still fail on the real hardware. I can work through the timing issues in the tests over a longer period of time, which should eventually help with QEMU stuff. |
The ones that fail on the real hardware are down to just |
Alright, lets focus on arm32v6. is it still that huge spew of data? If so you can probably cut like... the middle 50% of it and give me the start and the end bits. |
It fails in a slightly different way when I run it on a RPi2 in arm32v6 mode: + make test
./sizes
Slab Stats 64
Thread stats -6464
Global stats 184
Settings 528
Item (no cas) 32
Item (cas) 40
extstore header 12
Libevent thread 104
Connection 352
----------------------------------------
libevent thread cumulative 6440
Thread stats cumulative 6336
./testapp
1..52
ok 1 - cache_create
ok 2 - cache_constructor
ok 3 - cache_constructor_fail
ok 4 - cache_destructor
ok 5 - cache_reuse
ok 6 - cache_redzone
ok 7 - issue_161
ok 8 - strtol
ok 9 - strtoll
ok 10 - strtoul
ok 11 - strtoull
ok 12 - issue_44
ok 13 - vperror
ok 14 - issue_101
Signal handled: Terminated.
ok 15 - start_server
ok 16 - issue_92
ok 17 - issue_102
ok 18 - binary_noop
ok 19 - binary_quit
ok 20 - binary_quitq
ok 21 - binary_set
ok 22 - binary_setq
ok 23 - binary_add
ok 24 - binary_addq
ok 25 - binary_replace
ok 26 - binary_replaceq
ok 27 - binary_delete
ok 28 - binary_deleteq
ok 29 - binary_get
ok 30 - binary_getq
ok 31 - binary_getk
ok 32 - binary_getkq
ok 33 - binary_gat
ok 34 - binary_gatq
ok 35 - binary_gatk
ok 36 - binary_gatkq
ok 37 - binary_incr
ok 38 - binary_incrq
ok 39 - binary_decr
ok 40 - binary_decrq
ok 41 - binary_version
ok 42 - binary_flush
ok 43 - binary_flushq
ok 44 - binary_append
ok 45 - binary_appendq
ok 46 - binary_prepend
ok 47 - binary_prependq
ok 48 - binary_stat
ok 49 - binary_illegal
ok 50 - binary_pipeline_hickup
Signal handled: Interrupt.
ok 51 - shutdown
ok 52 - stop_server
getaddrinfo(): Name does not resolve
failed to listen on TCP port 37893: Invalid argument
slab class 1: chunk size 80 perslab 13107
slab class 2: chunk size 104 perslab 10082
slab class 3: chunk size 136 perslab 7710
slab class 4: chunk size 176 perslab 5957
slab class 5: chunk size 224 perslab 4681
slab class 6: chunk size 280 perslab 3744
slab class 7: chunk size 352 perslab 2978
slab class 8: chunk size 440 perslab 2383
slab class 9: chunk size 552 perslab 1899
slab class 10: chunk size 696 perslab 1506
slab class 11: chunk size 872 perslab 1202
slab class 12: chunk size 1096 perslab 956
slab class 13: chunk size 1376 perslab 762
slab class 14: chunk size 1720 perslab 609
slab class 15: chunk size 2152 perslab 487
slab class 16: chunk size 2696 perslab 388
slab class 17: chunk size 3376 perslab 310
slab class 18: chunk size 4224 perslab 248
slab class 19: chunk size 5280 perslab 198
slab class 20: chunk size 6600 perslab 158
slab class 21: chunk size 8256 perslab 127
slab class 22: chunk size 10320 perslab 101
slab class 23: chunk size 12904 perslab 81
slab class 24: chunk size 16136 perslab 64
slab class 25: chunk size 20176 perslab 51
slab class 26: chunk size 25224 perslab 41
slab class 27: chunk size 31536 perslab 33
slab class 28: chunk size 39424 perslab 26
slab class 29: chunk size 49280 perslab 21
slab class 30: chunk size 61600 perslab 17
slab class 31: chunk size 77000 perslab 13
slab class 32: chunk size 96256 perslab 10
slab class 33: chunk size 120320 perslab 8
slab class 34: chunk size 150400 perslab 6
slab class 35: chunk size 188000 perslab 5
slab class 36: chunk size 235000 perslab 4
slab class 37: chunk size 293752 perslab 3
slab class 38: chunk size 367192 perslab 2
slab class 39: chunk size 524288 perslab 2
<26 server listening (auto-negotiate)
<27 new auto-negotiating client connection
<27 connection closed.
slab class 1: chunk size 80 perslab 13107
slab class 2: chunk size 104 perslab 10082
slab class 3: chunk size 136 perslab 7710
slab class 4: chunk size 176 perslab 5957
slab class 5: chunk size 224 perslab 4681
slab class 6: chunk size 280 perslab 3744
slab class 7: chunk size 352 perslab 2978
slab class 8: chunk size 440 perslab 2383
slab class 9: chunk size 552 perslab 1899
slab class 10: chunk size 696 perslab 1506
slab class 11: chunk size 872 perslab 1202
slab class 12: chunk size 1096 perslab 956
slab class 13: chunk size 1376 perslab 762
slab class 14: chunk size 1720 perslab 609
slab class 15: chunk size 2152 perslab 487
slab class 16: chunk size 2696 perslab 388
slab class 17: chunk size 3376 perslab 310
slab class 18: chunk size 4224 perslab 248
slab class 19: chunk size 5280 perslab 198
slab class 20: chunk size 6600 perslab 158
slab class 21: chunk size 8256 perslab 127
slab class 22: chunk size 10320 perslab 101
slab class 23: chunk size 12904 perslab 81
slab class 24: chunk size 16136 perslab 64
slab class 25: chunk size 20176 perslab 51
slab class 26: chunk size 25224 perslab 41
slab class 27: chunk size 31536 perslab 33
slab class 28: chunk size 39424 perslab 26
slab class 29: chunk size 49280 perslab 21
slab class 30: chunk size 61600 perslab 17
slab class 31: chunk size 77000 perslab 13
slab class 32: chunk size 96256 perslab 10
slab class 33: chunk size 120320 perslab 8
slab class 34: chunk size 150400 perslab 6
slab class 35: chunk size 188000 perslab 5
slab class 36: chunk size 235000 perslab 4
slab class 37: chunk size 293752 perslab 3
slab class 38: chunk size 367192 perslab 2
slab class 39: chunk size 524288 perslab 2
<26 server listening (ascii)
<27 new ascii client connection.
<27 connection closed.
slab class 1: chunk size 80 perslab 13107
slab class 2: chunk size 104 perslab 10082
slab class 3: chunk size 136 perslab 7710
slab class 4: chunk size 176 perslab 5957
slab class 5: chunk size 224 perslab 4681
slab class 6: chunk size 280 perslab 3744
slab class 7: chunk size 352 perslab 2978
slab class 8: chunk size 440 perslab 2383
slab class 9: chunk size 552 perslab 1899
slab class 10: chunk size 696 perslab 1506
slab class 11: chunk size 872 perslab 1202
slab class 12: chunk size 1096 perslab 956
slab class 13: chunk size 1376 perslab 762
slab class 14: chunk size 1720 perslab 609
slab class 15: chunk size 2152 perslab 487
slab class 16: chunk size 2696 perslab 388
slab class 17: chunk size 3376 perslab 310
slab class 18: chunk size 4224 perslab 248
slab class 19: chunk size 5280 perslab 198
slab class 20: chunk size 6600 perslab 158
slab class 21: chunk size 8256 perslab 127
slab class 22: chunk size 10320 perslab 101
slab class 23: chunk size 12904 perslab 81
slab class 24: chunk size 16136 perslab 64
slab class 25: chunk size 20176 perslab 51
slab class 26: chunk size 25224 perslab 41
slab class 27: chunk size 31536 perslab 33
slab class 28: chunk size 39424 perslab 26
slab class 29: chunk size 49280 perslab 21
slab class 30: chunk size 61600 perslab 17
slab class 31: chunk size 77000 perslab 13
slab class 32: chunk size 96256 perslab 10
slab class 33: chunk size 120320 perslab 8
slab class 34: chunk size 150400 perslab 6
slab class 35: chunk size 188000 perslab 5
slab class 36: chunk size 235000 perslab 4
slab class 37: chunk size 293752 perslab 3
slab class 38: chunk size 367192 perslab 2
slab class 39: chunk size 524288 perslab 2
<26 server listening (binary)
<27 new binary client connection.
<27 connection closed.
Invalid value for binding protocol: http
-- should be one of auto, binary, or ascii
Maximum connections must be greater than 0
Maximum connections must be greater than 0
Number of threads must be greater than 0
t/00-startup.t .............. ok
t/64bit.t ................... skipped: Skipping 64-bit tests on 32-bit build
t/binary-extstore.t ......... ok
t/binary-get.t .............. ok
t/binary-sasl.t ............. skipped: Skipping SASL tests
t/binary.t .................. ok
t/bogus-commands.t .......... ok
t/cas.t ..................... ok
t/chunked-extstore.t ........ ok
t/chunked-items.t ........... ok
t/daemonize.t ............... ok
t/dash-M.t .................. ok
t/dyn-maxbytes.t ............ ok
t/evictions.t ............... ok
t/expirations.t ............. ok
t/extstore-buckets.t ........ ok
# Failed test '0 pages are free'
# at t/extstore.t line 100.
# got: '1'
# expected: '0'
# Looks like you failed 1 test of 27.
t/extstore.t ................
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/27 subtests
t/flags.t ................... ok
t/flush-all.t ............... ok
t/getandtouch.t ............. ok
t/getset.t .................. ok
t/idle-timeout.t ............ ok
t/incrdecr.t ................ ok
t/inline_asciihdr.t ......... ok
t/issue_104.t ............... ok
t/issue_108.t ............... ok
t/issue_14.t ................ ok
t/issue_140.t ............... skipped: Fix for Issue 140 was only an illusion
t/issue_152.t ............... ok
t/issue_163.t ............... ok
t/issue_183.t ............... ok
t/issue_192.t ............... ok
t/issue_22.t ................ ok
t/issue_260.t ............... skipped: Only possible to test #260 under artificial conditions
t/issue_29.t ................ ok
t/issue_3.t ................. ok
t/issue_41.t ................ ok
t/issue_42.t ................ ok
t/issue_50.t ................ ok
t/issue_61.t ................ ok
t/issue_67.t ................ ok
t/issue_68.t ................ ok
t/issue_70.t ................ ok
Item max size cannot be less than 1024 bytes.
Cannot set item size limit higher than 1/2 of memory max.
t/item_size_max.t ........... ok
t/line-lengths.t ............ ok
t/lru-crawler.t ............. ok
t/lru-maintainer.t .......... ok
t/lru.t ..................... ok
t/malicious-commands.t ...... ok
t/maxconns.t ................ ok
t/misbehave.t ............... skipped: Privilege drop not supported
t/multiversioning.t ......... ok
t/noreply.t ................. ok
t/quit.t .................... ok
t/refhang.t ................. skipped: Test is flaky. Needs special hooks.
t/slabhang.t ................ skipped: Test is flaky. Needs special hooks.
t/slabs-reassign-chunked.t .. ok
t/slabs-reassign2.t ......... ok
t/slabs_reassign.t .......... ok
PORT: 46715
t/stats-conns.t ............. ok
t/stats-detail.t ............ ok
t/stats.t ................... ok
t/touch.t ................... ok
t/udp.t ..................... ok
t/unixsocket.t .............. ok
t/watcher.t ................. ok
t/whitespace.t .............. skipped: Skipping tests probably because you don't have git.
Test Summary Report
-------------------
t/extstore.t (Wstat: 256 Tests: 27 Failed: 1)
Failed test: 20
Non-zero exit status: 1
Files=67, Tests=64765, 532 wallclock secs (122.04 usr 5.04 sys + 213.23 cusr 32.06 csys = 372.37 CPU)
Result: FAIL
make: *** [Makefile:1873: test] Error 1
The command '/bin/sh -c set -x && apk add --no-cache --virtual .build-deps autoconf automake ca-certificates coreutils cyrus-sasl-dev dpkg-dev dpkg gcc libc-dev libevent-dev libressl linux-headers make perl perl-utils tar && wget -O memcached.tar.gz "https://github.com/memcached/memcached/archive/$MEMCACHED_COMMIT.tar.gz" && mkdir -p /usr/src/memcached && tar -xzf memcached.tar.gz -C /usr/src/memcached --strip-components=1 && rm memcached.tar.gz && cd /usr/src/memcached && ./autogen.sh && ./configure --build="$(dpkg-architecture --query DEB_BUILD_GNU_TYPE)" --enable-extstore --enable-sasl && make -j "$(nproc)" && make test && make install && cd / && rm -rf /usr/src/memcached && runDeps="$( scanelf --needed --nobanner --format '%n#p' --recursive /usr/local | tr ',' '\n' | sort -u | awk 'system("[ -e /usr/local/lib/" $1 " ]") == 0 { next } { print "so:" $1 }' )" && apk add --virtual .memcached-rundeps $runDeps && apk del .build-deps && memcached -V' returned a non-zero code: 2 |
that failure I'm familiar with. my rpi2 did that once but on repeated runs works okay. test is flaky but is passable. I'll boot my pi2 back up and try to improve that now I guess |
Anything more I can provide to help debug the failure in the longer log? That's from the machine that'll actually be building this once it's released, so it's more concerning than just a flaky test (since we automatically re-attempt builds several times to account for flaky failures when we build the real thing). |
Try 'next' branch again, please? Can no longer repro the second failure on my rpi2, and the at least start of the log spew is likely the same issue in the other test. I changed the pacing of the inserts and ramped up the count a bit to make up for extra space available on 32bit systems. |
broke it in a different way, hold on... |
okay, pushed 'next' for chunked item changes again. re-tuned a lot of things, and shrunk the number of tests by a lot. if it spews again the file should at least be smaller. hmm... running them in a loop, they do occasionally fail, but less often :/ need a better mechanism for doing a trailing fill on such slow systems... edit: alright, 'next' is rebased again... now passes every time on my desktop and rpi2, at least so far, which is much more often than all previous tests. Sorry about that... it took a long time to figure out the compactor was helpfully rescuing the damn canary values :) So now it's told to back off until it's required for the tests. That among a few other changes. This might fix all of the tests everywhere. edit2: rebased with one more tiny change. it passed in a loop for an hour on the rpi2. |
Nice! I've tested The money shots: |
Think I need to task off of this for a while; does the arm32v6 always fail in that same spot, or does it pass sometimes? I think that's still a pacing issue. the s390x... just mask it off for now. would you mind opening an issue on the memcached project and link back to both logs? There're some other user facing bugs and some bench work I'd like to prioritize for the next week. Thanks a ton for all your help so far... sorry my tests are so flaky :| Feel dumb it took so long to figure out the compactor race. No idea what's wrong with the s390x though. |
No problem -- definitely not complaining, just trying to be helpful in one of the only ways I can! 😄 Will file issues to track these further and we'll just mask out on those two arches for now. 👍 |
(Yes, I think both test failures are pretty consistent -- with s390x, we've got both Debian and Alpine failing in exactly the same way.) |
Ah crap, spoke too soon. On |
Do you still want another issue just for (Filed s390x at memcached/memcached#381) |
Yeah please do. The tests shouldn't be flaky, and I beat most of that out of them already. Dunno howtf your platform is so sensitive to it though. The compaction algorithms changed a bunch since I first wrote the tests, so they were due up for some work. |
Done deal! memcached/memcached#382 (I've just confirmed on s390x, and that one is definitely not a flakiness issue -- it's gotta be something deeper. 😞) |
oh shoot... actually in t/chunked-extstore.t, can you try changing: with drop_under being 3, it could race and throw out odd or even numbered objects :| that won't fix s390x though. I have no idea what's going on there. |
Still flaky with that patched from 3 to 1 😞 |
nuts. punting! fuck it. I tried :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- `bash`: 4.4.23 - `ghost`: 1.23.1 - `julia`: 0.6.3 - `matomo`: GPG race conditions (matomo-org/docker#105) - `memcached`: `extstore` (docker-library/memcached#38) - `mongo`: 4.0.0~rc2 - `openjdk`: remove 9 (docker-library/openjdk#199), add 10 for Windows (docker-library/openjdk#200), 11-ea+16 - `owncloud`: update PECL exts (docker-library/owncloud#102) - `percona`: 5.7.22, 5.6.40 - `php`: fix `wget: error getting response: Connection reset by peer` - `piwik`: GPG race conditions (matomo-org/docker#105) - `python`: add `nis` nonsense to 2.7 (docker-library/python#281), 3.7.0b5 - `rocket.chat`: 0.65.1 - `ruby`: 2.6.0-preview2 - `wordpress`: update GPG for wp-cli
Extstore info at: https://github.com/memcached/memcached/wiki/Extstore
Looks like it is maturing very quickly and would be worthwhile having access to it on Docker images.