Message boards : Number crunching : Rosetta@home using AVX / AVX2 ?
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next
| Author | Message | 
|---|---|
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 There are MANY things that can be done to significantly improve performance without a major rewrite. So, what's the problem?? The first change I talked about was introducing "homogeneous coordinates". This is very nice because, it does not "really" change the "project code". You can introduce the C++ TEMPLATE typedef changes, recompile and you should get the EXACT SAME ANSWER with the new compile options. So, again, what's the problem?? The second place where substantial improvement can be accomplished with little effort is by upgrading the server to steer optimized applications to target crunchers. Waiting for crowdfounding :-P | 
| Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 | 
 I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type. rjs5 will have to clarify, but I believe his study and figures are estimates. To track back to original source code and make the suggested change and measure results is another story. That makes miscommunication very easy too when one is looking at one end of the elephant, and the other is looking at the other (source code vs. executable code). Rosetta Moderator: Mod.Sense | 
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type. Crowdfounding may be also for SW (RHEL license, for example, or new Visual Studio licence). To track back to original source code and make the suggested change and measure results is another story. That makes miscommunication very easy too when one is looking at one end of the elephant, and the other is looking at the other (source code vs. executable code). Are they scared about "fork the code" and try it? Waste of resources? I think that, for example, the admins lost a lot of time and resources with Android version. IMHO. | 
|  David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 | 
 Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research.  rjs5 has put a huge effort to help and look into optimization possibilities with Rosetta.  Now that CASP is almost over, we can get back to this. Our servers are chugging along and our throughput has nearly doubled quite recently relatively speaking due to an influx of hosts from Charity Engine I believe. Despite this, the load on our servers is fine. In the mean time, there have been publications in Nature and Science and some exciting results with co-evolution that is under review right now which relied heavily on R@h. This research will hopefully have a huge positive impact in the future and make good use of DNA sequence data. There are videos and articles explaining this in a recent Science publication as noted on our news. We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish. | 
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research......We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish. Ok, we understand: - no optimizations in near future - no new servers/update servers Please, close my thread about crowdfounding, it's a waste of time. And, personally, i'll stop bothering you | 
|  David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1480 Credit: 4,334,829 RAC: 0 | 
 Rosetta is constantly evolving and keeping up with the latest optimization technology and developing/testing/maintaining many different builds would be a very complex task which we currently don't have the resources for aside from taking away resources for research......We are not sure if hiring a developer specifically for the task of optimization will work as DB has mentioned it has been tried before but with the complexity and dynamics of Rosetta, it has been difficult to accomplish. We are going to update the database and file servers. And rjs5 may be able to help us further with optimizations. And you are not bothering at all, we appreciate the discussion and input! It's all with good intentions. And crowd funding may be promising. I haven't had a chance to read that thread yet. But we did look into our donations and it's under a couple grand I believe so more would help! | 
|  Dr. Merkwürdigliebe  Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 | 
 I haven't had a chance to read that thread yet. But we did look into our donations and it's under a couple grand I believe so more would help! It's good to hear from you. :-) I think experiment.com would be a good place to start. There are already crowdfunding campaigns from other universities. Random pick: Bacterial Vesicular Delivery: A One-Step Protein Transport Method Kickstarter et al. are probably not the right choice, there are some hefty fees and not really science-related. The r@h users usually participate in other forums, too, so it would be a good idea to come up with a standardized signature for those forum posts and advertise a little ;-) | 
| Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 | 
 Ok, we understand: I think that is overly negative. If they can make scientific progress by changing around their present applications, that may be a more productive use of their time than optimizing their applications. The latter might tend to freeze the present science in place rather than allowing it to advance (I don't know, but just raise the issue). We are dealing with cutting-edge science here, not turning out widgets on a production line, and they have to be free to go where it leads them. But if they just need money for servers, I am sure we can help. Money is not always the limiting factor though. | 
| rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0 | 
 I believe the reference to upgrading the server was meant to indicate software updates that would support task routing by host type. I think the 32-bit SSE2 applications can be shipped to any cruncher. It probably makes sense to build in the application routing for new and more highly optimized applications. Level 1) What is being shipped today. New level 2) applications modified for SSE2 PLUS vector padding Fast level 3) AFTER ROUTING implemented ... application #2) but compiled for AVX2 for wider optimization and routed to AVX2 crunchers. The figures are my estimates based on analysis of dynamic execution profiling code over 1 hour Rosetta runs. It is VERY, VERY, VERY HARD to assign a FIXED improvement since the machines are very different ... microarchiture, cache, memory sizes, disk type (HDD or SSD), .... When I started a couple of PrimeGrid jobs in parallel, they degraded Rosetta by 30% ... so giving a single number is "tough". | 
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 Level 1) What is being shipped today. The great "force" of Rosetta code is that, with an unique base code, you can crunch a lot of different and heterogeneous simulations. This force, on the other hand, is also is his biggest weakness: a lot of different needs create a lot of "fluffy" code. A solution may be to split the code into different specialized apps (one for the abinitio, one for folding, one for docking, etc) like A LOT of Boinc projects do. | 
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 Developer "F": commenting on my recommendation for homogeneous coordinates ... Eigen is very powerful (Tensorflow, for example), but i don't know if Rosy's team uses it http://programmingexamples.net/wiki/Eigen | 
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 | 
| sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 | 
 i got a little too curious about AVX / AVX2 & decided to do a little experiment: i made a little program that multiples a 4x4 matrix to a 4x1 vector. 2 subroutines one that does it using simple loops, the other tries to be as 'AVX' as possible. do it a hundred times each & count the cpu cycles. #include <iostream>
//#include <ia32intrin.h>
#include <x86intrin.h>
#include <immintrin.h>
using namespace std;
#define ALIGN __attribute__ ((aligned (32)))
void matrix_multiply(double *mat, double *vec, double *res);
void matrix_mul_simd(double *mat, double *vec, double *res);
void print(double res[4]);
int main() {
	unsigned long long timestm, delta;
	cout << "avxtest" << endl; // prints avxtest
	double ALIGN mat[4][4] = {{1.0, 2.0, 3.0, 4.0}, {5.0, 6.0, 7.0, 8.0},
			{9.0, 10.0, 11.0, 12.0}, {13.0, 14.0, 15.0, 16.0}};
	double ALIGN vec[4] = {1.0, 2.0, 3.0, 4.0}, res[4] = { 0.0, 0.0, 0.0, 0.0};
	timestm = __rdtsc();
	for(int i=0;i<100; i++)
		matrix_multiply((double *)&mat, (double *)&vec, (double *)&res);
	print(res);
	delta = __rdtsc() - timestm;
	cout << "Loop: " << delta << endl;
	timestm = __rdtsc();
	for(int i=0;i<100; i++)
		matrix_mul_simd((double *)&mat, (double *)&vec, (double *)&res);
	print(res);
	delta = __rdtsc() - timestm;
	cout << "AVX: " << delta << endl;
	return 0;
}
void print(double res[4]) {
	cout << "result: ";
	for(int i=0; i<4; i++) {
		if (i > 0)	cout << ", ";
		cout << res[i] ; }
	cout << endl;
}
void matrix_multiply(double *mat, double *vec, double *res) {
	int i, j;
	for(i=0; i<4; i++) *(res+i) = 0;
	for(i=0; i<4; i++) {
		for(j=0; j<4; j++) {
			*(res+i) += *(mat + j + i*4) * *(vec+j);
		}}}
void matrix_mul_simd(double *mat, double *vec, double *res) {
	double ALIGN t[4] = {0.0, 0.0, 0.0, 0.0};
	__m256d r = _mm256_broadcast_sd(&t[0]);
	__m128i d = _mm256_castsi256_si128 (_mm256_set_epi32 (0, 0, 0, 0, 12, 8, 4, 0));
	for(int i=0; i<4; i++) {
		__m256d v = _mm256_broadcast_sd(vec+i);
		__m256d a = _mm256_i32gather_pd (mat + i, d, 8);
		r = _mm256_fmadd_pd(a,v,r);
	}
	_mm256_store_pd(res, r);
}compile and run the code in GCC (in Linux) > g++ -O2 -mavx2 -mavx -mfma -o avxtest avxtest.cpp > g++ -O2 -mavx2 -mavx -mfma -S avxtest.cpp > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 96439 << these are cpu cycles result: 30, 70, 110, 150 AVX: 71015 << these are cpu cycles > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 95024 result: 30, 70, 110, 150 AVX: 64296 > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 68096 result: 30, 70, 110, 150 AVX: 53596 so AVX didn't really reduce it to a small fraction, the differences are perhaps marginal. And as it turns out GCC is simply too 'smart' and actually vectorized the 'loop' codes and made it AVX as well: (abstracts from the generated assembly)         .globl  _Z15matrix_multiplyPdS_S_
        .type   _Z15[b]matrix_multiply[/b]PdS_S_, @function
_Z15matrix_multiplyPdS_S_:
.LFB2030:
        .cfi_startproc
        movq    $0, (%rdx)
        movq    $0, 8(%rdx)
        xorl    %ecx, %ecx
        movq    $0, 16(%rdx)
        movq    $0, 24(%rdx)
.L16:
        vmovsd  (%rdx,%rcx), %xmm0
        xorl    %eax, %eax
.L19:
        vmovsd  (%rdi,%rax), %xmm1
[b]     vfmadd231sd     (%rsi,%rax), %xmm1, %xmm0 [/b]
        addq    $8, %rax
        vmovsd  %xmm0, (%rdx,%rcx)
and this is the part that is hand optimised 
        .globl  _Z15matrix_mul_simdPdS_S_
        .type   _Z15[b]matrix_mul_simd[/b]PdS_S_, @function
_Z15matrix_mul_simdPdS_S_:
.LFB2031:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        vxorpd  %xmm0, %xmm0, %xmm0
        xorl    %eax, %eax
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        andq    $-32, %rsp
        addq    $16, %rsp
        vcmppd  $0, %ymm0, %ymm0, %ymm3
        movq    $0, -48(%rsp)
        movq    $0, -40(%rsp)
        vmovdqa .LC3(%rip), %xmm4
        movq    $0, -32(%rsp)
        movq    $0, -24(%rsp)
.L23:
        leaq    (%rdi,%rax), %rcx
[b]     vmovapd %ymm3, %ymm5
        vbroadcastsd    (%rsi,%rax), %ymm1
        addq    $8, %rax
        vgatherdpd      %ymm5, (%rcx,%xmm4,8), %ymm2
        cmpq    $32, %rax
        vfmadd231pd     %ymm1, %ymm2, %ymm0
        jne     .L23
        vmovapd %ymm0, (%rdx)[/b]
        vzeroupper
        leave
conclusions: 1) GCC/G++ is pretty(very) 'smart' and if you simply select optimizations e.g. -O2 -mavx2 -mavx -mfma, GCC can actually optimize away loops and make them AVX/AVX2 all by the compiler itself 2) hand optimised codes seemed somewhat consistent in terms of using lesser time, but this is a 'toy' problem. a real problem may take an exorbitant effort to optimise. it'd seem it'd be good to let GCC / compilers do the optimizations where convenient / appropriate. and for apps like r@h, it needs to run on a large number of platforms some (many) do not have AVX let alone AVX2. we'd not want the app to 'crash' on those platforms simply because they don't have AVX/AVX2. Hence, such optimizations is a compromise of sorts, the same program may need to have both (AVX) optimised codes and non-optimised codes. the run time switching may incur some performance penalty in addition. but nevertheless 'safe' compiler optimizations are probably a 'good thing'. e.g. that it contains hybrid optimised codes hopefully AVX but runs on non-AVX platforms as well. | 
| sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 | 
 the intel cpus can do 4 double precision SIMD in AVX2 'per core', i'm not sure if things like instruction level parallelism & hyperthreading (makes it 2 cores of AVX2?) could possibly make that even 'more parallel'.  but i've also run some other 'benchmark' apps (e.g. http://www.openblas.net/), and noted that things like AVX/AVX2 depends on problems being capable of 'completely running in the cpu' without needing to touch ram or disk(slowest). and for that matter the cases that truly see > 100 (say closing to 200) Gflops on even the 'average' i7 desktops are *multiplying large square matrices*, most of them are square matrices of large dimensions say 10,000 x 10,000 (i.e. 10,000 unknowns & dense matrices) and tiny matrices like 4x4 has little if any perceptible performance gain from AVX. The other overheads such as disk I/O far overwhelm the time to work that 4x4 matrix. I'd guess in the same light if r@h problem scenarios can fit those *special cases* such as multiplying 10,000 x 10,000 square matrices AVX/AVX2 (SIMD) may turn out to be a significant advantage. And if the square matrices dimensions are even larger, possibly the high end GPUs may show (very) significant performance gains over CPU, but at a cost of much higher power consumption (e.g. 200-300 watts just on the GPU cards itself) but as it stands, i'd think the 'problem' would need a possibility to be expressed as 'multiplying large square matrices'. not all problems are that simplified and some have if-then-else dependencies and yet others the next iteration of a 'small' problem depends on the results of the previous iteration, that makes it probably 'impossible' to vectorize there is of course, the additional efforts to study and make those efforts which may be non-trival | 
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 and for apps like r@h, it needs to run on a large number of platforms some (many) do not have AVX let alone AVX2. we'd not want the app to 'crash' on those platforms simply because they don't have AVX/AVX2. Hence, such optimizations is a compromise of sorts, the same program may need to have both (AVX) optimised codes and non-optimised codes. No problem, 2 apps, updated scheduler recognizes correctly the cpu and sends the right app. I think, to start, SSE3 will be enough.... | 
| sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 | 
 same avxtest as 2 post earlier, run the subroutines 1000 times each. the results are much closer. it shows that the GCC/G++ optimised codes are pretty much as good as hand optimised codes. GCC/G++ may not 'catch all' cases of loops and do those optimizations, especially for 'real world' problems where it isn't all this simple to 'unroll' the loops. > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 103224 result: 30, 70, 110, 150 AVX: 100612 > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 107428 result: 30, 70, 110, 150 AVX: 108980 > ./avxtest avxtest result: 30, 70, 110, 150 Loop: 111144 result: 30, 70, 110, 150 AVX: 108093 | 
| sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 | 
 
 the other issue though, is that it seemed it isn't quite possible (yet) for boinc to distribute apps based on cpu 'architecture' (i.e. has SSE, AVX, AVX2, FMA etc), it seemed currently what's possible is 32 bits or 64 bits. yup going 64 bits esp for Windows platform (which today is 32 bits) would likely see some gains. A 'hybrid' app (for SSE/AVX or none) is possibly more appropriate, because distributing apps is its own 'logistics' issue. just that a hybrid app depends on compiler capabilities and if the compiler can't do it on its own, it may need to be 'hand tuned' | 
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 the other issue though, is that it seemed it isn't quite possible (yet) for boinc to distribute apps based on cpu 'architecture' (i.e. has SSE, AVX, AVX2, FMA etc), it seemed currently what's possible is 32 bits or 64 bits. yup going 64 bits esp for Windows platform (which today is 32 bits) would likely see some gains. A little help with app_config.... :-) | 
| sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 | 
 here is a very interesting article / slides on *AVX/AVX2*, and from CERN the HPC (high performance computing) people who deal with *physics* Haswell Conundrum:AVX or not AVX? https://indico.cern.ch/event/327306/contributions/760669/attachments/635800/875267/HaswellConundrum.pdf in 2014 Conclusions 
 they are in Boinc too & u can run their simulations: http://atlasathome.cern.ch/ that *special scenario* is apparently things like Linpack benchmark that depends heavily on subroutine DGEMM (double precision general matrix multiplication), e.g. multiply very *big/large* *square matrices* say 10,000 x 10,000 https://www.pugetsystems.com/labs/hpc/Haswell-Floating-Point-Performance-493/ once the math scenario falls outside this DGEMM multiply very big square matrices use case, all that vector / parallel cpu and even those extreme speed GPU (*petaflops*) hardware is simply *useless*, e.g. if you are trying to solve 2x2 matrices a billion times and the result of the next iteration depend on the previous, it would be just as slow as if you simply do it in loops no SSE,AVX,AVX2 lol in short SSE/AVX to all those super high end vectorized extreme performance GPU is only good if the whole world is simply DGEMM. too bad DGEMM is just very few of true real world scenarios lol | 
| ![View the profile of [VENETO] boboviz Profile](https://boinc.bakerlab.org/rosetta/img/head_20.png) [VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2124 Credit: 12,428,047 RAC: 2,329   | 
 Interesting tests, but every simulation is different (denis@home accelerated 10 times the computation with SSE3) so  these results may be different in rosetta's enviroment. At the end we "know" that Avx/Avx2/Avx512 requests a large refactoring of the code, while SSE2, for example, needs only flag in recompile.... | 
            Message boards : 
            Number crunching : 
        Rosetta@home using AVX / AVX2 ?
    
 
         ©2025 University of Washington 
https://www.bakerlab.org