Hi We all know that chuck is not the fastest audio software out there. But I guess like me you've all found ways to work around that. I often found myself wondering "how much can I save by disconnecting this UGen" or "exactly how expensive is another NRev". For this purpose I started to do some benchmarking in the form of a bunch of .ck files and a bash script. My initial results are here: http://atte.dk/chuck/results.txt The first line is "chuck --loop", so the wm alone. "nb" is the number of files, "cpu" is the cpu usage as reported by htop on my laptop (2Ghz Intel dualcore), and "cpu normalized" is cpu-usage (simple multiplication) at nb=100. Of course this doesn't make sense without seeing the .ck files, so I've put them here: http://atte.dk/chuck/performance_tests.tgz I'm quite aware of that this approach is very un-scientific, so any input on how to improve it is more than welcome. Esp. I'm wondering how 10 * PRCRev = 7% and 50 * PRCRev = 46% (should have been 35%). I'm gonna continue my tests, but you're all welcome to supply files for testing. Maybe this should all end up on the wiki? NB: This is in no way a critique of the developers. Sure I would love to see a faster chuck, but we still love it and use it. -- Atte http://atte.dk http://modlys.dk
Atte; this is a wonderful project.
Some notes on your method; you are using a file (and hence a shred) per
Ugen. I believe there to be some small overhead for the usage of a shred so
I believe you may be ending up with slightly high numbers. I once tested the
cpu cost of sporking a thousand (maybe ten thousand) shreds that were all
just waiting for a event and this took some small but signifficant amount of
cpu.
Maybe the most remarkable result to me here is the cost of setting a SndBuf
to loop; that's a rather big hit for what I imagine to come down to a single
if-then instruction per sample. I also wonder why Pan2 is more expensive
than 2 Gains; we could use 2 Gain Ugens (and a function) to emulate it,
maybe this instead tests a higher cost for forcing the DAC to operate in
stereo? Finally, like you, I wonder about the different normalised results
for the two tests of PRCRev that you performed; some difference would be
understandable, for example the VM itself takes some CPU which would be
scalled along with the whole thing but this looks like a rather signifficant
difference to me. I wonder how we could explain that.
As for further tests that might be useful/ interesting; you could compare
the Blit osc's to the regular ones (as well as try to deterimine what -if
any- difference the amount of harmonics makes). Another interesting thing to
compare might be LiSa with multiple voices v.s. as many copies of SndBuf.
If at all possible it would also be interesting to fully automate a series
of tests like this so we could compare versions of ChucK later. We might be
interested in the exact difference that a future implementation of block
processing could make, for example.
The list goes on; we might want to know the cost of member functions like
the .freq() parameter of filters, we may even want to know the cost of
certain operations. For example in the case of the looping SndBuf above; we
could make it loop using a "if the pointer goes out of range put it back at
the start" construct or we could perform a modulo on the same pointer (which
might carry the remainder, which would probably be desirable there), what
would be cheaper in ChucK?
Those last tests would probably be going too far and be too detailed but we
know very little about the price of such operations in ChucK.
Thanks again for sharing your results so far!
Yours,
Kas.
2009/3/10 Atte André Jensen
Hi
We all know that chuck is not the fastest audio software out there. But I guess like me you've all found ways to work around that. I often found myself wondering "how much can I save by disconnecting this UGen" or "exactly how expensive is another NRev".
For this purpose I started to do some benchmarking in the form of a bunch of .ck files and a bash script. My initial results are here:
http://atte.dk/chuck/results.txt
The first line is "chuck --loop", so the wm alone. "nb" is the number of files, "cpu" is the cpu usage as reported by htop on my laptop (2Ghz Intel dualcore), and "cpu normalized" is cpu-usage (simple multiplication) at nb=100.
Of course this doesn't make sense without seeing the .ck files, so I've put them here:
http://atte.dk/chuck/performance_tests.tgz
I'm quite aware of that this approach is very un-scientific, so any input on how to improve it is more than welcome. Esp. I'm wondering how 10 * PRCRev = 7% and 50 * PRCRev = 46% (should have been 35%).
I'm gonna continue my tests, but you're all welcome to supply files for testing. Maybe this should all end up on the wiki?
NB: This is in no way a critique of the developers. Sure I would love to see a faster chuck, but we still love it and use it.
-- Atte
http://atte.dk http://modlys.dk _______________________________________________ chuck-users mailing list chuck-users@lists.cs.princeton.edu https://lists.cs.princeton.edu/mailman/listinfo/chuck-users
Kassen wrote:
Some notes on your method; you are using a file (and hence a shred) per Ugen. I believe there to be some small overhead for the usage of a shred so I believe you may be ending up with slightly high numbers.
I'll check the difference. But it seems to me that for this purpose ("which of A and B is more expensive") it doesn't matter since we're interested in the relative speed of code.
As for further tests that might be useful/ interesting; you could compare the Blit osc's to the regular ones (as well as try to deterimine what -if any- difference the amount of harmonics makes). Another interesting thing to compare might be LiSa with multiple voices v.s. as many copies of SndBuf.
Agreed, and I have lot's of other ideas + plan on testing things as they occur in my work.
If at all possible it would also be interesting to fully automate a series of tests like this so we could compare versions of ChucK later.
You're absolutely right. I actually started something automatic, but got stuck, and thought a "by-hand" start was better than nothing. I'll for sure see if I can make a 100% automatic test. Thanks for your feedback! -- Atte http://atte.dk http://modlys.dk
It would also probably be good to know which kernel you're running these tests on. -Eric -- _______________________________________ http://greyrockstudio.blogspot.com
Atte; I'll check the difference. But it seems to me that for this purpose ("which
of A and B is more expensive") it doesn't matter since we're interested in the relative speed of code.
I think getting as close as possible to the actual cost of a single UGen would be useful. We may have several ways of constructing a certain sound out of a few UGens, in that case the relative cost of single UGens would be quite relevant. I recognise that we may not be able to find the exact cost in all cases as it may depend on the context, especially if/when we would start optimising ChucK but it could still be interesting to get as close as possible. Even the question of whether or not such a difference really exists in a appreciable form would be interesting; such info would help us make more informed choices about the the structure of our programs. Agreed, and I have lot's of other ideas + plan on testing things as they
occur in my work.
Great! I'm looking forward to reading about this. You're absolutely right. I actually started something automatic, but got
stuck, and thought a "by-hand" start was better than nothing. I'll for sure see if I can make a 100% automatic test.
Oh, yes, clearly, this is already quite interesting. Eric had a interesting note about different kernels; I wonder whether that would make a difference beyond the relative cost of UGens compared to eachother (as percentages) and the overhead cost any given OS has for running a program at all. I could imagine different CPU's making a difference, provided we have a good compiler and compile each on it's own system. We could create a standard for testing this; for example saying the cost of a single SinOsc is "1", always using the same buffer size and so on. Yours, Kas.
Kassen wrote:
I think getting as close as possible to the actual cost of a single UGen would be useful. We may have several ways of constructing a certain sound out of a few UGens, in that case the relative cost of single UGens would be quite relevant.
I'm not sure what you mean. Could you outline how one of the testfiles.ck should look and how you propose I call chuck on it from my test script? Note that the "heavyness" on the cpu (in my case the number of instances of the .ck file) should be controllable from the bash script and hence not hardcoded into the chuck code.
Oh, yes, clearly, this is already quite interesting.
Automation is on it's way, need some polish, but it basically works now!
Eric had a interesting note about different kernels;
Kernels are indeed interesting along with stuff like buffer size. However my main focus is to compare the various ugens (and stuff like modulo and casting), but once automation is working and the measurement is accurate it would be easy to test what-not. -- Atte http://atte.dk http://modlys.dk
Atte André Jensen wrote:
Automation is on it's way, need some polish, but it basically works now!
Done. Slightly different output... -- Atte http://atte.dk http://modlys.dk
Atte; I think getting as close as possible to the actual cost of a single UGen
would be useful. We may have several ways of constructing a certain sound out of a few UGens, in that case the relative cost of single UGens would be quite relevant.
I'm not sure what you mean. Could you outline how one of the testfiles.ckshould look and how you propose I call chuck on it from my test script? Note that the "heavyness" on the cpu (in my case the number of instances of the .ck file) should be controllable from the bash script and hence not hardcoded into the chuck code.
Sorry. Here I meant we are working from the assumption that every UGen will have a (knowable) cost in cpu. So; if we want to create a certain sound in ChucK this will use certain UGens in a certain configuration. In some cases we will have several ways of creating a UGen graph that will yield this sound. At that point we may choose one option for it's CPU cost, but we can only make that choice knowing the cost of single UGens (so we may add those). Above I meant that for this scenario it makes a difference to know what those costs are -exactly- as opposed to just knowing what order the UGens are in if we'd order them based on CPU cost. I meant this as a reason to be curious about the exact (as opposed to relative) cost of UGens, not as a speciffic scenario to test. I hope that clarifies. Yours, Kas.
Kassen wrote:
I meant this as a reason to be curious about the exact (as opposed to relative) cost of UGens, not as a speciffic scenario to test.
I hope that clarifies.
Kind of. I still don't get it. You're saying that it's a problem that I run several shreds per test, right? Obviously I'm doing that to get numbers (cpu usage) in a range where they make sense, admittedly entirely based on my gut feeling. Comparing cpu loads of 2.1 and 2.2 is not as good as 84.0 and 88.0 + I'd expect small numbers to be relatively more polluted with "stuff from the system", including the vm. Also close to 100% things are useless, the system would start to blow up, stutter etc. But supposed we compare these lines (a result of the current version of the test): file x10 x50 x100 -------------------------------------------------------- 01_PulseOsc.ck 2.7 8.5 18.5 01_SawOsc.ck 3.0 10.2 22.5 01_SinOsc.ck 5.0 28.0 44.5 01_SqrOsc.ck 2.7 8.7 19.2 01_TriOsc.ck 3.0 10.2 22.5 Another run: file x10 x50 x100 -------------------------------------------------------- 01_PulseOsc.ck 3.0 9.5 19.5 01_SawOsc.ck 3.0 11.0 22.0 01_SinOsc.ck 5.5 21.5 45.0 01_SqrOsc.ck 3.0 9.5 19.5 01_TriOsc.ck 3.5 10.5 23.5 Wouldn't you say that it's safe to say that SinOsc is *about* twice as expensive as the others? I mean the numbers in the same column should be directly comparable, or? Naturally that is provided that the measurements are sane. Thinking about the statics mentioned by Tom (http://www.zedshaw.com/essays/programmer_stats.html) this is actually a real challenge. A simple way would be to have chuck run for a number of seconds, and take measurements of the cpu load at certain intervals. Then (with out knowing much about statistics) something like throwing away measurements that are way of and averaging between the remaining could make sense. Or one might be interested in the maximum load generated by the code, although many things outside of chuck (the system) could account for the "jitter". For instance the x50 of SinOsc are 28.0 compared to 21.5 in the two different runs. I have no idea where this jitter comes from (the specific UGen or my systems or something else), but clearly that's something that should be improved upon. -- Atte http://atte.dk http://modlys.dk
Atte;
I still don't get it. You're saying that it's a problem that I run several shreds per test, right?
Yes. I think a shred in and off itself takes some cpu (based on tests I ran a long time ago using empty shreds). Thus, if we have a single UGen per shred and we wish to measure UGens we'd be measuring the cost of a UGen + the cost of a shred. This would mean that all UGens would end up looking slightly more expensive than they are. This would mean that according to the numbers we'd get a network of 5 UGens would be at a disatvantage compared to one of 3 UGens if we want to compare what they cost. In the first case we'd have 5 times our "cost of measuring", in the second only 3 times. Let's say a bag of candy costs 2$, a piza is 4$ and driving to the store costs me 1$. This means buying a bag of candy costs 3$ (in practice), a piza would cost me 5$ but driving o the store to buy both would only by 2+4+1=7$ as I'd only have to drive once. Here driving equates to "having a shred". With a test like this; repeat( 100 ) SinOsc s => dac; week => now; We'd have a 100 UGens and only a single shred, instead of 100 UGens and 100 shreds.
Obviously I'm doing that to get numbers (cpu usage) in a range where they make sense, admittedly entirely based on my gut feeling. Comparing cpu loads of 2.1 and 2.2 is not as good as 84.0 and 88.0 + I'd expect small numbers to be relatively more polluted with "stuff from the system", including the vm. Also close to 100% things are useless, the system would start to blow up, stutter etc.
Makes sense.
Wouldn't you say that it's safe to say that SinOsc is *about* twice as expensive as the others? I mean the numbers in the same column should be directly comparable, or?
Yes. I also think that the cost of a shred should be small compared to the cost of a UGen. Here is a test to benchmark the cost of a 100 shreds that all do nothing; fun void wait() { week => now; } repeat(100) spork~wait(); week => now; As you'll see; the cost of those is non-zero.
Naturally that is provided that the measurements are sane. Thinking about the statics mentioned by Tom ( http://www.zedshaw.com/essays/programmer_stats.html) this is actually a real challenge. A simple way would be to have chuck run for a number of seconds, and take measurements of the cpu load at certain intervals. Then (with out knowing much about statistics) something like throwing away measurements that are way of and averaging between the remaining could make sense. Or one might be interested in the maximum load generated by the code, although many things outside of chuck (the system) could account for the "jitter".
Yes, true, though if some UGen would be corelated to high jitter that would be a interesting metric as well. I'm not sure we have such UGens. I recognise that this is a hard thing to measure.
For instance the x50 of SinOsc are 28.0 compared to 21.5 in the two different runs. I have no idea where this jitter comes from (the specific UGen or my systems or something else), but clearly that's something that should be improved upon.
Yes. I saw that too. When I benchmark my own programs to see what they cost (to see whether I think a certain change is worthwhile) I've often seen considerable jitter. I'm inclined to blame the OS in most cases. Typically I wait for a bit and see what the worst it ever does is as it's only the worst case scenario that affects me. Taking only 5% is still no good to me if it occasionally spikes and glitches. Still, even if it's hard; this is very interesting and very worthwhile, I feel. Yours, Kas.
Kassen wrote:
Let's say a bag of candy costs 2$, a piza is 4$ and driving to the store costs me 1$. This means buying a bag of candy costs 3$ (in practice), a piza would cost me 5$ but driving o the store to buy both would only by 2+4+1=7$ as I'd only have to drive once. Here driving equates to "having a shred".
But if the prize of driving were magnitudes lower (say $.01) than candy and pizza and you could only pickup one thing at a time, you could compare the prize of candy and pizza by getting 100 of each: pizza: 100 * (4 + 0.01) = 401 candy: 100 * (2 + 0.01) = 201 Still the picture is a little screwed by the cost of driving, but if that's low (and my tests suggest the shreds are) you're pretty close. Still I'm gonna work with your idea! Stupid me doesn't know of a way to pass arguments to a chuck file from the commandline. I'd rather not have the 100 as in "create 100 SinOsc's" hardcoded in the files. -- Atte http://atte.dk http://modlys.dk
Atte; But if the prize of driving were magnitudes lower (say $.01) than candy and
pizza and you could only pickup one thing at a time, you could compare the prize of candy and pizza by getting 100 of each:
pizza: 100 * (4 + 0.01) = 401 candy: 100 * (2 + 0.01) = 201
Still the picture is a little screwed by the cost of driving, but if that's low (and my tests suggest the shreds are) you're pretty close.
That's fair, yes, but the cost of shreds is non-zero, maybe this was affected by me test on the miniAudicle which also reports on the length of tine a shred has been running for, creating more of a per-shred overhead. Only picking up a single type of product seems like the way to go.
Still I'm gonna work with your idea! Stupid me doesn't know of a way to pass arguments to a chuck file from the commandline. I'd rather not have the 100 as in "create 100 SinOsc's" hardcoded in the files.
Well, even if the difference is slight I do think my idea has some merit.
Comand line argument syntax is documented in VERSIONS.txt in your ChucK dir as well as in a example or two in the examples dir. Regardless of the exact numerical outcome using those should simplify the testing process which has advantages in itself. For one thing the outcome of a more simple test should be more simple to analyse. That might go some way in preventing that guy from killing us :¬). Happy testing! Kas.
Kassen wrote:
Well, even if the difference is slight I do think my idea has some merit.
Agreed.
Comand line argument syntax is documented in VERSIONS.txt in your ChucK dir as well as in a example or two in the examples dir. Regardless of the exact numerical outcome using those should simplify the testing process which has advantages in itself. For one thing the outcome of a more simple test should be more simple to analyse.
Never used that, but now I have :-) If you could The test now uses Your Way (TM), differences goes like this: Old style file x10 x50 x100 -------------------------------------------------------- 00_wait.ck 1.5 1.7 1.7 01_PulseOsc.ck 2.7 8.5 18.5 01_SawOsc.ck 3.0 10.2 22.5 01_SinOsc.ck 5.0 28.0 44.5 01_SqrOsc.ck 2.7 8.7 19.2 01_TriOsc.ck 3.0 10.2 22.5 02_SndBuf.ck 2.5 7.7 15.5 02_SndBuf_loaded_not_playing.ck 1.5 2.0 3.0 02_SndBuf_loop.ck 5.2 24.2 51.0 03_Gain.ck 2.5 7.5 14.7 03_Pan2.ck 5.0 24.0 49.2 05_Delay.ck 3.2 13.5 30.7 05_Echo.ck 3.5 14.7 39.7 06_BPF.ck 4.0 10.2 20.5 06_BRF.ck 3.0 10.2 20.5 06_HPF.ck 3.0 10.0 21.5 06_LPF.ck 2.7 10.0 20.5 07_JCRev.ck 12.2 75.2 97.2 07_NRev.ck 18.7 97.0 97.2 07_PRCRev.ck 7.2 45.5 93.2 08_Chorus.ck 7.2 37.7 79.5 90_SinOscLPF.ck 6.5 32.5 67.5 New style: file x10 x50 x100 -------------------------------------------------------- 00_wait.ck 1.5 1.5 1.5 01_PulseOsc.ck 2.8 8.3 18.2 01_SawOsc.ck 2.5 10.2 22.0 01_SinOsc.ck 4.1 21.5 43.2 01_SqrOsc.ck 2.6 8.5 24.3 01_TriOsc.ck 3.0 10.2 22.0 02_SndBuf.ck 2.5 7.3 14.7 02_SndBuf_loaded_not_playing.ck 1.5 2.0 2.5 02_SndBuf_loop.ck 5.4 24.7 52.5 03_Gain.ck 2.5 7.2 14.0 03_Pan2.ck 5.0 24.5 49.2 05_Delay.ck 3.2 13.2 28.7 05_Echo.ck 3.6 12.8 40.5 06_BPF.ck 2.8 9.7 20.4 06_BRF.ck 3.0 10.0 20.4 06_HPF.ck 2.8 9.7 20.5 06_LPF.ck 2.4 10.0 20.2 07_JCRev.ck 12.7 75.7 97.2 07_NRev.ck 18.2 97.5 97.7 07_PRCRev.ck 6.3 45.2 92.0 08_Chorus.ck 7.2 33.3 81.7 90_SinOscLPF.ck 6.7 32.5 67.0 One thing that clutters up the picture is that I also changed the script to take three measurements and throw away the largest and smallest value, which should (and seems to have) get rid of the worst jitter. Besides that they don't look that different, or... Anyways, this is the content of for instance 01_SinOsc.ck, do you think it looks sane? 0 => int instances; if(me.args()) me.arg(0) => Std.atoi => instances; repeat(instances){ SinOsc s => dac; 0 => s.gain; } 1::week => now; -- Atte http://atte.dk http://modlys.dk
Atte; One thing that clutters up the picture is that I also changed the script to
take three measurements and throw away the largest and smallest value, which should (and seems to have) get rid of the worst jitter.
Besides that they don't look that different, or...
Slightly lower numbers for the new style. The difference in x10 and x100 still shows the static cost of the VM but we can compensate for that now in those cases where using x50 would be more practical.
Anyways, this is the content of for instance 01_SinOsc.ck, do you think it looks sane?
Yeah, that looks "by the book" to me :¬) One thing that has me wondering now is the cost of a non-playing SndBuf; that's very low indeed, lower than a gain. Some optimisation must be going on there; nice. Nice work! Yours, Kas. PS; As a side note; GMail users can enable a google labs extension that will allow them to view selected emails in fixed-width font. This makes looking at tables like this a lot more convenient.
Kassen wrote:
Anyways, this is the content of for instance 01_SinOsc.ck, do you think it looks sane?
Yeah, that looks "by the book" to me :¬)
Ok, thanks.
One thing that has me wondering now is the cost of a non-playing SndBuf; that's very low indeed, lower than a gain. Some optimisation must be going on there; nice.
OR: it wasn't connected to dac :-) I'm not sure it was on purpose, but agree it's confusing at least by looking at the filename (I suspect you didn't look at the chuck code). With it connected it looks different: file x10 x50 x100 x1000 -------------------------------------------------------- 02_SndBuf.ck 2.6 7.5 15.2 96.7 02_SndBuf_loaded_not_playing.ck 4.6 24.2 53.0 97.2 02_SndBuf_loop.ck 5.3 24.7 53.2 97.0 03_Gain.ck 2.6 7.3 14.4 96.2 A different story: Do you have any ideas about how to test stuff like multiplication vs division? These results lead me to believe that the usual rave about division being expensive it totally irrelevant as soon as you connect even a single UGen in your chuck code :-) file x10 x50 x100 x1000 -------------------------------------------------------- 11_cast_float_to_int.ck 2.2 2.2 2.2 2.2 11_divide_integers.ck 2.5 2.3 2.5 2.4 11_modulo_integer.ck 2.2 2.2 2.2 2.4 11_multiply_integers.ck 2.4 2.2 2.2 2.2 atte@vestbjerg:~/music/chuck/performance_tests$ cat tests/11_divide_integers.ck 0 => int instances; if(me.args()) me.arg(0) => Std.atoi => instances; int i; repeat(instances){ while(true){ 1::samp => now; 10 / 2 => i; } } 1::week => now;
PS; As a side note; GMail users can enable a google labs extension that will allow them to view selected emails in fixed-width font. This makes looking at tables like this a lot more convenient.
I'm using thunderbird... -- Atte http://atte.dk http://modlys.dk
Atte André Jensen wrote:
02_SndBuf_loaded_not_playing.ck 4.6 24.2 53.0 97.2 02_SndBuf_loop.ck 5.3 24.7 53.2 97.0
Just to elaborate: This means that the non-playing trick "s.samples() => s.pos" won't do anything for the cpu (which I thought). It must be disconnected from dac to be easy on the cpu... Time to change habits, I guess :-) -- Atte http://atte.dk http://modlys.dk
Atte; OR: it wasn't connected to dac :-)
Right, yes, that would explain it :¬)
I'm not sure it was on purpose, but agree it's confusing at least by looking at the filename (I suspect you didn't look at the chuck code). With it connected it looks different:
Ah, yes. (pasting your notes from your other mail here)
This means that the non-playing trick "s.samples() => s.pos" won't do anything for the cpu (which I thought). It must be disconnected from dac to be easy on the cpu...
That's right; that trick saves the ears, not the cpu. It's still a good trick as it saves a lot of noise in setups where we use a lot of samples. A different story: Do you have any ideas about how to test stuff like
multiplication vs division?
What I came up seems very close to your solution These results lead me to believe that the usual rave about division being
expensive it totally irrelevant as soon as you connect even a single UGen in your chuck code :-)
The difference seems quite small indeed; intersting. I wonder how conditions compare against those; I always thought those were relatively expensive. PS; As a side note; GMail users can enable a google labs extension that will
allow them to view selected emails in fixed-width font. This makes looking at tables like this a lot more convenient.
I'm using thunderbird...
My lifestyle forces me to use a web-based solution. I'd like to say that's because of traveling so much but in reality it has to do with tinkering with computers and OS's and occasionally having lost a install and all the mail archives that were in it :¬) Kas.
2009/3/10 Kassen
I also wonder why Pan2 is more expensive than 2 Gains; we could use 2 Gain Ugens (and a function) to emulate it, maybe this instead tests a higher cost for forcing the DAC to operate in stereo?
As far as I can tell, Gain is implemented almost as a noop UGen: // from ugen_xxx.cpp if( !type_engine_import_ugen_begin( env, "Gain", "UGen", env->global(), NULL, NULL, NULL, NULL ) ) return FALSE; Those NULLs are function pointers for a constructor, destructor, tick (sample generator), and something else I don't know about. There is probably no Gain-specific code; it only needs behavior that's shared among all UGens anyway, whereas Pan2 has Pan2-specific behavior. -- Tom Lieber http://AllTom.com/
Kassen wrote:
As for further tests that might be useful/ interesting; you could compare the Blit osc's to the regular ones (as well as try to deterimine what -if any- difference the amount of harmonics makes).
They are indeed more expensive than their counterparts. Number of harmonics doesn't seem to matter, though (see link below).
Another interesting thing to compare might be LiSa with multiple voices v.s. as many copies of SndBuf.
Hmm. Could you supply a test file for this?
The list goes on; we might want to know the cost of member functions like the .freq() parameter of filters, we may even want to know the cost of certain operations.
But this (and other operations needs to be done at a rate. Would every samp make sense? Or should we pump more operations per samp, for instance with a inline repeat? Another thing: Some UGens could prove more expensive in use (Envelope and ADSR seems to fall into this category), which ones should I test, and how? For Envelope/ADSR I simply connected them to a SinOsc, and found that the cost of ADSR => SinOsc => dac is more than ADSR => dac + SinOsc => dac. I've wrapped it up a bit, with direct links to the test files for easier browsing: http://atte.dk/chuck/ -- Atte http://atte.dk http://modlys.dk
Atte; Another day, more testing! They are indeed more expensive than their counterparts. Number of harmonics
doesn't seem to matter, though (see link below).
Interesting; I expected them to be more expensive but I wasn't sure we could get any number of harmonics for that price.
Another interesting thing to compare might be LiSa with multiple voices
v.s. as many copies of SndBuf.
Hmm. Could you supply a test file for this?
Let me think about this because doing this in a way that's as neutral as possible doesn't seem so easy now that I considered it.
The list goes on; we might want to know the cost of member functions like
the .freq() parameter of filters, we may even want to know the cost of certain operations.
But this (and other operations needs to be done at a rate. Would every samp make sense? Or should we pump more operations per samp, for instance with a inline repeat?
I'd say we need to do it at whatever rate gives us numbers in a 40-80% (or so) CPU usage range because of the factors you pointed out previously; at too little CPU usage other factors weigh too heavily, at too much the measuring will likely become inaccurate. Maybe a inline repeat makes more sense than looping at fractions of a samp; we'd have less conditions (per second) but that would also make the code slightly more complicated and I'm also in favour of simple code here for simple analysis. Sigh; not so easy, this.
Another thing: Some UGens could prove more expensive in use (Envelope and ADSR seems to fall into this category), which ones should I test, and how? For Envelope/ADSR I simply connected them to a SinOsc, and found that the cost of ADSR => SinOsc => dac is more than ADSR => dac + SinOsc => dac.
The one difference I know for sure is there is that in the case of chaining them up there will be a second UGen pulling ticks through the chain; you won't have just dac polling the envelope and the osc but you'll have dac polling the osc, then the osc polling the envelope. Maybe that's more expensive. It could also be that SinOsc calculates numbers in a more complicated way when it's fed a signal to it's sync input? I suppose that to get beyond this we're simply going to have to dig into the code to see what exactly is going on there. There might be something going on that will be interesting should we want to optimise UGen graphs for speed of calculation.
I've wrapped it up a bit, with direct links to the test files for easier browsing:
Cool! Kas.
On Mar 12, 2009, at 3:26 AM, Atte André Jensen wrote:
Another thing: Some UGens could prove more expensive in use (Envelope and ADSR seems to fall into this category), which ones should I test, and how? For Envelope/ADSR I simply connected them to a SinOsc, and found that the cost of ADSR => SinOsc => dac is more than ADSR => dac + SinOsc => dac.
There is an additional cost to chucking anything into SinOsc -- that might be what you are seeing here. SinOsc (and some other Osc's) map its input to frequency (depending on the .sync parameter); this entire code path is skipped if the tick function detects that there are no input ugens. Perhaps a more fair comparison would be to replace SinOsc with a ugen that does not make this optimization, e.g. gain, or to do something like: ADSR env => SinOsc s => dac; Gain g => s; vs. Gain g => SinOsc s => dac; ADSR env => dac; spencer
2009/3/10 Atte André Jensen
We all know that chuck is not the fastest audio software out there. But I guess like me you've all found ways to work around that. I often found myself wondering "how much can I save by disconnecting this UGen" or "exactly how expensive is another NRev".
For this purpose I started to do some benchmarking in the form of a bunch of .ck files and a bash script. My initial results are here:
http://atte.dk/chuck/results.txt
The first line is "chuck --loop", so the wm alone. "nb" is the number of files, "cpu" is the cpu usage as reported by htop on my laptop (2Ghz Intel dualcore), and "cpu normalized" is cpu-usage (simple multiplication) at nb=100.
Of course this doesn't make sense without seeing the .ck files, so I've put them here:
http://atte.dk/chuck/performance_tests.tgz
I'm quite aware of that this approach is very un-scientific, so any input on how to improve it is more than welcome. Esp. I'm wondering how 10 * PRCRev = 7% and 50 * PRCRev = 46% (should have been 35%).
I'm gonna continue my tests, but you're all welcome to supply files for testing. Maybe this should all end up on the wiki?
NB: This is in no way a critique of the developers. Sure I would love to see a faster chuck, but we still love it and use it.
Interesting stuff! Please keep us informed. One useful benchmark I used when I was testing my Ruby ChucK library was "how many of this UGen can I connect to the dac before the audio starts to stutter?" In other words, how many copies of a UGen can chuck handle in real-time? It's perhaps more precise to use timings of chuck --silent (add UGens until chuck --silent takes as long as the amount of virtual time that passes), but the idea is the same. It would let you set up an exchange rate for UGens, like "1 Pan2 = 3 Gain". I only tested with Gain and SinOsc, but I found the numbers were fairly consistent as long as nothing else was running. Also, recommended reading: http://www.zedshaw.com/essays/programmer_stats.html If you skip the first half, there's actual information down below. -- Tom Lieber http://AllTom.com/
Tom; Also, recommended reading:
http://www.zedshaw.com/essays/programmer_stats.html If you skip the first half, there's actual information down below.
Yes. The first half implies being tall should lead to less problems with women. At 2 meters I must count as "tall" and I didn't find this at all. It does get interesting and relevant right after that though (and as opinionated as it is; it's quite pleasant to read). Fortunately we are quite safe from one of the major complications; optimisation. If ChucK would have a optimising compiler, for example, we'd be in serious trouble. :¬) As for Gain & Pan2; you are quite right. Here we could say that Gain is more or less like the "base cost" of all UGens. Still; a modest little Gain is capable of adding, multiplying and so on. Clearly Pan2 needs it's own code but it's not clear to me yet why that code has to be so much more expensive than just two Gains. One explanation would be that the base cost of a Gain doesn't account for everything; a Gain set to multiply might be more expensive than a plain one or maybe stereo adds cost. I suspect there are hidden issues that we haven't evaluated yet. At any rate this shows that the base-costs of adding a UGen at all are there and are quite signifficant compared to the cost a full UGen. IMHO this pleads for the "blob" project. For example a crossfader is a device that we could quite easily make ourselves out of other UGens, so is a clipper and a state-variable filter (the list goes on) but this table already shows us that desipite these being possible to create already there would be signifficant benefits to increasing our range of available low-level building blocks. This should in turn cut down on the costs of implementing more advanced/ interesting synthesis techniques in ChucK. Oh, and SndBuf is very cheap indeed, far cheaper than I thought it was, that's nice to know. Cheers, Kas.
2009/3/10 Kassen
As for Gain & Pan2; you are quite right. Here we could say that Gain is more or less like the "base cost" of all UGens. Still; a modest little Gain is capable of adding, multiplying and so on. Clearly Pan2 needs it's own code but it's not clear to me yet why that code has to be so much more expensive than just two Gains. One explanation would be that the base cost of a Gain doesn't account for everything; a Gain set to multiply might be more expensive than a plain one or maybe stereo adds cost. I suspect there are hidden issues that we haven't evaluated yet.
I'm pretty sure there being no Gain-specific functionality is all there is to it. Can you find a UGen that doesn't support "3 => u.op"? I think "Gain" is just the most generic UGen possible. (I should really go verify it in code, but I'm a little busy tonight.) (Same reason I haven't tried my hand at the ChucK benchmarks.) (I'll probably forget about both of these by the time I'm free, though. ;p)
At any rate this shows that the base-costs of adding a UGen at all are there and are quite signifficant compared to the cost a full UGen. IMHO this pleads for the "blob" project.
Or an optimizing ChucK compiler! (heh, heh, any takers?) -- Tom Lieber http://AllTom.com/
Tom;
I'm pretty sure there being no Gain-specific functionality is all there is to it. Can you find a UGen that doesn't support "3 => u.op"? I think "Gain" is just the most generic UGen possible.
I agree.
(I should really go verify it in code, but I'm a little busy tonight.) (Same reason I haven't tried my hand at the ChucK benchmarks.) (I'll probably forget about both of these by the time I'm free, though. ;p)
Well, there are UGens with no input (SndBuf, the STKInstruments...) I'm not sure what good .op() is for them but I do believe the docs claim they have it. That's not actually useful, but then again; this compiles as well; 2 => blackhole.op; //it has a .gain() as well, there 's something Zen about that. While we're at it; dac has a .op as well, I found that earlier. This can indeed be used to create terrible noises after sporking a few shreds as the dac is the same across all shreds, in case you were wondering. In my experience it's best left for those situation where you were planning to end your set anyway :¬).
At any rate this shows that the base-costs of adding a UGen at all are there and are quite signifficant compared to the cost a full UGen. IMHO this pleads for the "blob" project.
Or an optimizing ChucK compiler! (heh, heh, any takers?)
That would mean re-compiling the whole UGen graph at any additions as we can hot-swap running UGens, I posted about that when we were talking about more advanced uses of casting. This would be a spectacularly bad idea unless we could also somehow preserve state of individual UGens to prevent glitches. We may be able to have a optimising compiler but we can't (preserving current functionality) condence the UGen graph like some digital modulars do (simplifying groups of UGens internally). Well, maybe we could but even thinking about it makes my head hurt. IMHO we need more facilities for hot-swapping UGens, not less. Cheers, Kas.
On Tue, Mar 10, 2009 at 11:49 PM, Kassen
Tom;
I'm pretty sure there being no Gain-specific functionality is all there is to it. Can you find a UGen that doesn't support "3 => u.op"? I think "Gain" is just the most generic UGen possible.
I agree.
(I should really go verify it in code, but I'm a little busy tonight.) (Same reason I haven't tried my hand at the ChucK benchmarks.) (I'll probably forget about both of these by the time I'm free, though. ;p)
Well, there are UGens with no input (SndBuf, the STKInstruments...) I'm not sure what good .op() is for them but I do believe the docs claim they have it. That's not actually useful, but then again; this compiles as well;
2 => blackhole.op; //it has a .gain() as well, there 's something Zen about that.
While we're at it; dac has a .op as well, I found that earlier. This can indeed be used to create terrible noises after sporking a few shreds as the dac is the same across all shreds, in case you were wondering. In my experience it's best left for those situation where you were planning to end your set anyway :¬).
At any rate this shows that the base-costs of adding a UGen at all are there and are quite signifficant compared to the cost a full UGen. IMHO this pleads for the "blob" project.
Or an optimizing ChucK compiler! (heh, heh, any takers?)
That would mean re-compiling the whole UGen graph at any additions as we can hot-swap running UGens, I posted about that when we were talking about more advanced uses of casting. This would be a spectacularly bad idea unless we could also somehow preserve state of individual UGens to prevent glitches.
We may be able to have a optimising compiler but we can't (preserving current functionality) condence the UGen graph like some digital modulars do (simplifying groups of UGens internally). Well, maybe we could but even thinking about it makes my head hurt. IMHO we need more facilities for hot-swapping UGens, not less.
That's basically true. Even if the compiler produced machine code as fast as C, it would still issue calls to connect up UGens the same way it is done now. On the other hand, if the compiler really could generate code that fast, it would be worthwhile to implement UGens directly in ChucK. I've been playing more and more with LLVM lately, but still too early to say anything. I was messing around with the ChucK compiler previously, and found a few places to play with its opcode generator. Don't hold your breath though, as I'm way too busy on other things. :) I'd love to spend some more time with it in the summer though. Steve
Steve;
We may be able to have a optimising compiler but we can't (preserving
current functionality) condence the UGen graph like some digital modulars do (simplifying groups of UGens internally). Well, maybe we could but even thinking about it makes my head hurt. IMHO we need more facilities for hot-swapping UGens, not less.
That's basically true. Even if the compiler produced machine code as fast as C, it would still issue calls to connect up UGens the same way it is done now.
Yes, I was talking about optimising UGen performance though. There are some digital modular synths that optimise by taking a group of "UGens" and converting those to instructions in a optimised way. I meant we couldn't do that as we want to be able to re-route them.
On the other hand, if the compiler really could generate code that fast, it would be worthwhile to implement UGens directly in ChucK.
I still think that would be worthwhile to have as a option. Even if it would be slow(er) it would be great for prototyping and would lead to cleaner code. For example; I sometimes use a overdrive that consists of a a set of UGens. When I use that a few times, on a few sounds, I end up with code that may work and be relatively fast but that's not so pretty.
I've been playing more and more with LLVM lately, but still too early to say anything. I was messing around with the ChucK compiler previously, and found a few places to play with its opcode generator. Don't hold your breath though, as I'm way too busy on other things. :) I'd love to spend some more time with it in the summer though.
Could you explain what you mean by "ChucK's opcode generator"? "Opcode" is a CSound term that translates roughly to "UGen", right? Kas.
On Wed, Mar 11, 2009 at 10:08 AM, Kassen
Steve;
We may be able to have a optimising compiler but we can't (preserving current functionality) condence the UGen graph like some digital modulars do (simplifying groups of UGens internally). Well, maybe we could but even thinking about it makes my head hurt. IMHO we need more facilities for hot-swapping UGens, not less.
That's basically true. Even if the compiler produced machine code as fast as C, it would still issue calls to connect up UGens the same way it is done now.
Yes, I was talking about optimising UGen performance though. There are some digital modular synths that optimise by taking a group of "UGens" and converting those to instructions in a optimised way. I meant we couldn't do that as we want to be able to re-route them.
That's basically true. It's not _impossible_ to optimize them (inlining "tick" functions, etc), but it would be a completely different problem to tackle. A pretty interesting one however!
On the other hand, if the compiler really could generate code that fast, it would be worthwhile to implement UGens directly in ChucK.
I still think that would be worthwhile to have as a option. Even if it would be slow(er) it would be great for prototyping and would lead to cleaner code. For example; I sometimes use a overdrive that consists of a a set of UGens. When I use that a few times, on a few sounds, I end up with code that may work and be relatively fast but that's not so pretty.
I've been playing more and more with LLVM lately, but still too early to say anything. I was messing around with the ChucK compiler previously, and found a few places to play with its opcode generator. Don't hold your breath though, as I'm way too busy on other things. :) I'd love to spend some more time with it in the summer though.
Could you explain what you mean by "ChucK's opcode generator"? "Opcode" is a CSound term that translates roughly to "UGen", right?
Sorry for the confusion. I meant opcode as it is used in talking about compilers. The word just refers to "operation codes", i.e., codes that represent machine instructions. In ChucK, its VM opcodes are actually instances of the Chuck_Instr class. Not the same as CSound's opcodes. (Though, shamefully, I don't actually know CSound very well.) Steve
Steve; That's basically true. It's not _impossible_ to optimize them
(inlining "tick" functions, etc), but it would be a completely different problem to tackle. A pretty interesting one however!
That would be a interesting approach indeed. We would need to be careful and make sure we could still consider the internal state of individual UGens in such a graph, as we may want to change the UGen graph with as few glitches as possible.
Sorry for the confusion. I meant opcode as it is used in talking about compilers. The word just refers to "operation codes", i.e., codes that represent machine instructions. In ChucK, its VM opcodes are actually instances of the Chuck_Instr class. Not the same as CSound's opcodes. (Though, shamefully, I don't actually know CSound very well.)
Ah, I see. IMHO CSound is nice as it's so old and established that many very worthwhile articles have been written with it in mind. It can pay off to be able to read it to be able to understand those articles. To me personally it does look a lot like a language that's decades old if we compare it to ChucK; I think you'll be fine without it. Anyway, that's something completely different indeed. Fortunately we have words like "spork" and "shred", at least those won't easily be confused with other concepts elsewhere! Yours, Kas.
2009/3/10 Kassen
Well, there are UGens with no input (SndBuf, the STKInstruments...) I'm not sure what good .op() is for them but I do believe the docs claim they have it. That's not actually useful, but then again; this compiles as well;
Really? adc => StifKarp k => dac; Look out, there could be a Gain behind you, too. -- Tom Lieber http://AllTom.com/
participants (6)
-
Atte André Jensen
-
Eric Hedekar
-
Kassen
-
Spencer Salazar
-
Stephen Sinclair
-
Tom Lieber